2014-07-31 21:54:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 0/9

Hello!

This series provides v3 of a prototype of an RCU-tasks implementation,
which has been requested to assist with tramopoline removal. This flavor
of RCU is task-based rather than CPU-based, and has voluntary context
switch, usermode execution, and the idle loops as its only quiescent
states. This selection of quiescent states ensures that at the end
of a grace period, there will no longer be any tasks depending on a
trampoline that was removed before the beginning of that grace period.
This works because such trampolines do not contain function calls,
do not contain voluntary context switches, do not switch to usermode,
and do not switch to idle.

The patches in this series are as follows:

1. Adds the basic call_rcu_tasks() functionality.

2. Provides cond_resched_rcu_qs() to force quiescent states, including
RCU-tasks quiescent states, in long loops.

3. Adds synchronous APIs: synchronize_rcu_tasks() and
rcu_barrier_tasks().

4. Adds GPL exports for the above APIs, courtesy of Steven Rostedt.

5. Adds rcutorture tests for RCU-tasks.

6. Adds RCU-tasks test cases to rcutorture scripting.

7. Adds stall-warning checks for RCU-tasks.

8. Improves RCU-tasks energy efficiency by replacing polling with
wait/wakeup.

9. Document RCU-tasks stall-warning messages.

Changes from v2:

o Use get_task_struct() instead of do_exit() hooks to synchronize
with exiting tasks, as suggested by Lai Jiangshan.

o Add checks of ->on_rq to the grace-period-wait polling, again
as suggested by Lai Jiangshan.

o Repositioned synchronize_sched() calls and improved their
comments.

Changes from v1:

o The lockdep issue with list locking was finessed by ditching
list locking in favor of having the list manipulated by a single
kthread. This change trimmed about 150 highly concurrent lines
from the implementation.

o Get rid of the scheduler hooks in favor of polling the
per-task count of voluntary context switches, in response
to Peter Zijlstra's concerns about scheduler overhead.

o Passes more aggressive rcutorture runs, which indicates that
an increase in rcutorture's aggression is called for.

o Handled review comments from Peter Zijlstra, Lai Jiangshan,
Frederic Weisbecker, and Oleg Nesterov.

o Added RCU-tasks stall-warning documentation.

Remaining issues include:

o It is not clear that trampolines in functions called from the
idle loop are correctly handled. Or if anyone cares about
trampolines in functions called from the idle loop.

o The current implementation does not yet recognize tasks that start
out executing is usermode. Instead, it waits for the next
scheduling-clock tick to note them.

o As a result, the current implementation does not handle nohz_full=
CPUs executing tasks running in usermode. There are a couple of
possible fixes under consideration.

o If a task is preempted while executing in usermode, the RCU-tasks
grace period will not end until that task resumes. (Is there
some reasonable way to determine that a given preempted task
was preempted from usermode execution?)

o More about RCU-tasks needs to be added to Documentation/RCU.

o This version creates rcu_tasks_kthread() even if there never will
be any uses, which is expected to be the common case. A future
version might create rcu_tasks_kthread() on demand, as suggested
off-list by Josh Triplett.

o There are probably still bugs.

Thanx, Paul

------------------------------------------------------------------------

b/Documentation/RCU/stallwarn.txt | 33 -
b/Documentation/kernel-parameters.txt | 5
b/fs/file.c | 2
b/include/linux/init_task.h | 9
b/include/linux/rcupdate.h | 52 +
b/include/linux/sched.h | 23
b/init/Kconfig | 10
b/kernel/rcu/rcutorture.c | 44 +
b/kernel/rcu/tiny.c | 2
b/kernel/rcu/tree.c | 14
b/kernel/rcu/tree_plugin.h | 2
b/kernel/rcu/update.c | 292 +++++++++-
b/mm/mlock.c | 2
b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01 | 7
b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 | 6
b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot | 1
17 files changed, 460 insertions(+), 45 deletions(-)


2014-07-31 21:55:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

From: "Paul E. McKenney" <[email protected]>

This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
context switch (not preemption!), userspace execution, and the idle loop.
Note that unlike other RCU flavors, these quiescent states occur in tasks,
not necessarily CPUs. Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerate
updaters. This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single linked list containing all tasks that have not yet passed
through a quiescent state. If experience shows this assumption to be
incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/init_task.h | 9 +++
include/linux/rcupdate.h | 36 ++++++++++
include/linux/sched.h | 23 ++++---
init/Kconfig | 10 +++
kernel/rcu/tiny.c | 2 +
kernel/rcu/tree.c | 2 +
kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 242 insertions(+), 11 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..78715ea7c30c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -124,6 +124,14 @@ extern struct group_info init_groups;
#else
#define INIT_TASK_RCU_PREEMPT(tsk)
#endif
+#ifdef CONFIG_TASKS_RCU
+#define INIT_TASK_RCU_TASKS(tsk) \
+ .rcu_tasks_holdout = false, \
+ .rcu_tasks_holdout_list = \
+ LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
+#else
+#define INIT_TASK_RCU_TASKS(tsk)
+#endif

extern struct cred init_cred;

@@ -231,6 +239,7 @@ extern struct task_group root_task_group;
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
+ INIT_TASK_RCU_TASKS(tsk) \
INIT_CPUSET_SEQ(tsk) \
INIT_RT_MUTEXES(tsk) \
INIT_VTIME(tsk) \
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 6a94cc8b1ca0..829efc99df3e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,

void synchronize_sched(void);

+/**
+ * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_tasks() assumes
+ * that the read-side critical sections end at a voluntary context
+ * switch (not a preemption!), entry into idle, or transition to usermode
+ * execution. As such, there are no read-side primitives analogous to
+ * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
+ * to determine that all tasks have passed through a safe state, not so
+ * much for data-strcuture synchronization.
+ *
+ * See the description of call_rcu() for more detailed information on
+ * memory ordering guarantees.
+ */
+void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
+
#ifdef CONFIG_PREEMPT_RCU

void __rcu_read_lock(void);
@@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
rcu_irq_exit(); \
} while (0)

+/*
+ * Note a voluntary context switch for RCU-tasks benefit. This is a
+ * macro rather than an inline function to avoid #include hell.
+ */
+#ifdef CONFIG_TASKS_RCU
+#define rcu_note_voluntary_context_switch(t) \
+ do { \
+ preempt_disable(); /* Exclude synchronize_sched(); */ \
+ if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
+ ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
+ preempt_enable(); \
+ } while (0)
+#else /* #ifdef CONFIG_TASKS_RCU */
+#define rcu_note_voluntary_context_switch(t) do { } while (0)
+#endif /* #else #ifdef CONFIG_TASKS_RCU */
+
#if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
bool __rcu_is_watching(void);
#endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..3cf124389ec7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1273,6 +1273,11 @@ struct task_struct {
#ifdef CONFIG_RCU_BOOST
struct rt_mutex *rcu_boost_mutex;
#endif /* #ifdef CONFIG_RCU_BOOST */
+#ifdef CONFIG_TASKS_RCU
+ unsigned long rcu_tasks_nvcsw;
+ int rcu_tasks_holdout;
+ struct list_head rcu_tasks_holdout_list;
+#endif /* #ifdef CONFIG_TASKS_RCU */

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
@@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
unsigned int mask);

#ifdef CONFIG_PREEMPT_RCU
-
#define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
#define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
+#endif /* #ifdef CONFIG_PREEMPT_RCU */

static inline void rcu_copy_process(struct task_struct *p)
{
+#ifdef CONFIG_PREEMPT_RCU
p->rcu_read_lock_nesting = 0;
p->rcu_read_unlock_special = 0;
-#ifdef CONFIG_TREE_PREEMPT_RCU
p->rcu_blocked_node = NULL;
-#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
#ifdef CONFIG_RCU_BOOST
p->rcu_boost_mutex = NULL;
#endif /* #ifdef CONFIG_RCU_BOOST */
INIT_LIST_HEAD(&p->rcu_node_entry);
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
+#ifdef CONFIG_TASKS_RCU
+ p->rcu_tasks_holdout = false;
+ INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
+#endif /* #ifdef CONFIG_TASKS_RCU */
}

-#else
-
-static inline void rcu_copy_process(struct task_struct *p)
-{
-}
-
-#endif
-
static inline void tsk_restore_flags(struct task_struct *task,
unsigned long orig_flags, unsigned long flags)
{
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..c56cb62a2df1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -507,6 +507,16 @@ config PREEMPT_RCU
This option enables preemptible-RCU code that is common between
the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.

+config TASKS_RCU
+ bool "Task_based RCU implementation using voluntary context switch"
+ default n
+ help
+ This option enables a task-based RCU implementation that uses
+ only voluntary context switch (not preemption!), idle, and
+ user-mode execution as quiescent states.
+
+ If unsure, say N.
+
config RCU_STALL_COMMON
def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
help
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index d9efcc13008c..717f00854fc0 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
rcu_sched_qs(cpu);
else if (!in_softirq())
rcu_bh_qs(cpu);
+ if (user)
+ rcu_note_voluntary_context_switch(current);
}

/*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 625d0b0cd75a..f958c52f644d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
rcu_preempt_check_callbacks(cpu);
if (rcu_pending(cpu))
invoke_rcu_core();
+ if (user)
+ rcu_note_voluntary_context_switch(current);
trace_rcu_utilization(TPS("End scheduler-tick"));
}

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index bc7883570530..50453589e3ca 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -47,6 +47,7 @@
#include <linux/hardirq.h>
#include <linux/delay.h>
#include <linux/module.h>
+#include <linux/kthread.h>

#define CREATE_TRACE_POINTS

@@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
early_initcall(check_cpu_stall_init);

#endif /* #ifdef CONFIG_RCU_STALL_COMMON */
+
+#ifdef CONFIG_TASKS_RCU
+
+/*
+ * Simple variant of RCU whose quiescent states are voluntary context switch,
+ * user-space execution, and idle. As such, grace periods can take one good
+ * long time. There are no read-side primitives similar to rcu_read_lock()
+ * and rcu_read_unlock() because this implementation is intended to get
+ * the system into a safe state for some of the manipulations involved in
+ * tracing and the like. Finally, this implementation does not support
+ * high call_rcu_tasks() rates from multiple CPUs. If this is required,
+ * per-CPU callback lists will be needed.
+ */
+
+/* Lists of tasks that we are still waiting for during this grace period. */
+static LIST_HEAD(rcu_tasks_holdouts);
+
+/* Global list of callbacks and associated lock. */
+static struct rcu_head *rcu_tasks_cbs_head;
+static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
+
+/* Post an RCU-tasks callback. */
+void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
+{
+ unsigned long flags;
+
+ rhp->next = NULL;
+ rhp->func = func;
+ raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+ *rcu_tasks_cbs_tail = rhp;
+ rcu_tasks_cbs_tail = &rhp->next;
+ raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+}
+EXPORT_SYMBOL_GPL(call_rcu_tasks);
+
+/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
+static int __noreturn rcu_tasks_kthread(void *arg)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+ struct rcu_head *list;
+ struct rcu_head *next;
+
+ /* FIXME: Add housekeeping affinity. */
+
+ /*
+ * Each pass through the following loop makes one check for
+ * newly arrived callbacks, and, if there are some, waits for
+ * one RCU-tasks grace period and then invokes the callbacks.
+ * This loop is terminated by the system going down. ;-)
+ */
+ for (;;) {
+
+ /* Pick up any new callbacks. */
+ raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+ smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
+ list = rcu_tasks_cbs_head;
+ rcu_tasks_cbs_head = NULL;
+ rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+ raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+
+ /* If there were none, wait a bit and start over. */
+ if (!list) {
+ schedule_timeout_interruptible(HZ);
+ flush_signals(current);
+ continue;
+ }
+
+ /*
+ * Wait for all pre-existing t->on_rq and t->nvcsw
+ * transitions to complete. Invoking synchronize_sched()
+ * suffices because all these transitions occur with
+ * interrupts disabled. Without this synchronize_sched(),
+ * a read-side critical section that started before the
+ * grace period might be incorrectly seen as having started
+ * after the grace period.
+ *
+ * This synchronize_sched() also dispenses with the
+ * need for a memory barrier on the first store to
+ * ->rcu_tasks_holdout, as it forces the store to happen
+ * after the beginning of the grace period.
+ */
+ synchronize_sched();
+
+ /*
+ * There were callbacks, so we need to wait for an
+ * RCU-tasks grace period. Start off by scanning
+ * the task list for tasks that are not already
+ * voluntarily blocked. Mark these tasks and make
+ * a list of them in rcu_tasks_holdouts.
+ */
+ rcu_read_lock();
+ for_each_process_thread(g, t) {
+ if (t != current && ACCESS_ONCE(t->on_rq) &&
+ !is_idle_task(t)) {
+ get_task_struct(t);
+ t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
+ ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
+ list_add(&t->rcu_tasks_holdout_list,
+ &rcu_tasks_holdouts);
+ }
+ }
+ rcu_read_unlock();
+
+ /*
+ * Each pass through the following loop scans the list
+ * of holdout tasks, removing any that are no longer
+ * holdouts. When the list is empty, we are done.
+ */
+ while (!list_empty(&rcu_tasks_holdouts)) {
+ schedule_timeout_interruptible(HZ / 10);
+ flush_signals(current);
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
+ rcu_tasks_holdout_list) {
+ if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
+ if (t->rcu_tasks_nvcsw ==
+ ACCESS_ONCE(t->nvcsw) &&
+ ACCESS_ONCE(t->on_rq))
+ continue;
+ ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
+ }
+ list_del_rcu(&t->rcu_tasks_holdout_list);
+ put_task_struct(t);
+ }
+ rcu_read_unlock();
+ }
+
+ /*
+ * Because ->on_rq and ->nvcsw are not guaranteed
+ * to have a full memory barriers prior to them in the
+ * schedule() path, memory reordering on other CPUs could
+ * cause their RCU-tasks read-side critical sections to
+ * extend past the end of the grace period. However,
+ * because these ->nvcsw updates are carried out with
+ * interrupts disabled, we can use synchronize_sched()
+ * to force the needed ordering on all such CPUs.
+ *
+ * This synchronize_sched() also confines all
+ * ->rcu_tasks_holdout accesses to be within the grace
+ * period, avoiding the need for memory barriers for
+ * ->rcu_tasks_holdout accesses.
+ */
+ synchronize_sched();
+
+ /* Invoke the callbacks. */
+ while (list) {
+ next = list->next;
+ local_bh_disable();
+ list->func(list);
+ local_bh_enable();
+ list = next;
+ cond_resched();
+ }
+ }
+}
+
+/* Spawn rcu_tasks_kthread() at boot time. */
+static int __init rcu_spawn_tasks_kthread(void)
+{
+ struct task_struct __maybe_unused *t;
+
+ t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
+ BUG_ON(IS_ERR(t));
+ return 0;
+}
+early_initcall(rcu_spawn_tasks_kthread);
+
+#endif /* #ifdef CONFIG_TASKS_RCU */
--
1.8.1.5

2014-07-31 21:55:42

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 9/9] documentation: Add verbiage on RCU-tasks stall warning messages

From: "Paul E. McKenney" <[email protected]>

This commit documents RCU-tasks stall warning messages and also describes
when to use the new cond_resched_rcu_qs() API.

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/stallwarn.txt | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 68fe3ad27015..ef5a2fd4ff70 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -56,8 +56,20 @@ RCU_STALL_RAT_DELAY
two jiffies. (This is a cpp macro, not a kernel configuration
parameter.)

-When a CPU detects that it is stalling, it will print a message similar
-to the following:
+rcupdate.rcu_task_stall_timeout
+
+ This boot/sysfs parameter controls the RCU-tasks stall warning
+ interval. A value of zero or less suppresses RCU-tasks stall
+ warnings. A positive value sets the stall-warning interval
+ in jiffies. An RCU-tasks stall warning starts wtih the line:
+
+ INFO: rcu_tasks detected stalls on tasks:
+
+ And continues with the output of sched_show_task() for each
+ task stalling the current RCU-tasks grace period.
+
+For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
+it will print a message similar to the following:

INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies)

@@ -174,8 +186,12 @@ o A CPU looping with preemption disabled. This condition can
o A CPU looping with bottom halves disabled. This condition can
result in RCU-sched and RCU-bh stalls.

-o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
- without invoking schedule().
+o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
+ kernel without invoking schedule(). Note that cond_resched()
+ does not necessarily prevent RCU CPU stall warnings. Therefore,
+ if the looping in the kernel is really expected and desirable
+ behavior, you might need to replace some of the cond_resched()
+ calls with calls to cond_resched_rcu_qs().

o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
happen to preempt a low-priority task in the middle of an RCU
@@ -208,11 +224,10 @@ o A hardware failure. This is quite unlikely, but has occurred
This resulted in a series of RCU CPU stall warnings, eventually
leading the realization that the CPU had failed.

-The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
-SRCU does not have its own CPU stall warnings, but its calls to
-synchronize_sched() will result in RCU-sched detecting RCU-sched-related
-CPU stalls. Please note that RCU only detects CPU stalls when there is
-a grace period in progress. No grace period, no CPU stall warnings.
+The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
+warning. Note that SRCU does -not- have CPU stall warnings. Please note
+that RCU only detects CPU stalls when there is a grace period in progress.
+No grace period, no CPU stall warnings.

To diagnose the cause of the stall, inspect the stack traces.
The offending function will usually be near the top of the stack.
--
1.8.1.5

2014-07-31 21:55:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 5/9] rcutorture: Add torture tests for RCU-tasks

From: "Paul E. McKenney" <[email protected]>

This commit adds torture tests for RCU-tasks. It also fixes a bug that
would segfault for an RCU flavor lacking a callback-barrier function.

Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
include/linux/rcupdate.h | 1 +
kernel/rcu/rcutorture.c | 40 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1f073af940a5..f9d314bbc7b6 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -55,6 +55,7 @@ enum rcutorture_type {
RCU_FLAVOR,
RCU_BH_FLAVOR,
RCU_SCHED_FLAVOR,
+ RCU_TASKS_FLAVOR,
SRCU_FLAVOR,
INVALID_RCU_FLAVOR
};
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index febe07062ac5..6d12ab6675bc 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -602,6 +602,42 @@ static struct rcu_torture_ops sched_ops = {
};

/*
+ * Definitions for RCU-tasks torture testing.
+ */
+
+static int tasks_torture_read_lock(void)
+{
+ return 0;
+}
+
+static void tasks_torture_read_unlock(int idx)
+{
+}
+
+static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
+{
+ call_rcu_tasks(&p->rtort_rcu, rcu_torture_cb);
+}
+
+static struct rcu_torture_ops tasks_ops = {
+ .ttype = RCU_TASKS_FLAVOR,
+ .init = rcu_sync_torture_init,
+ .readlock = tasks_torture_read_lock,
+ .read_delay = rcu_read_delay, /* just reuse rcu's version. */
+ .readunlock = tasks_torture_read_unlock,
+ .completed = rcu_no_completed,
+ .deferred_free = rcu_tasks_torture_deferred_free,
+ .sync = synchronize_rcu_tasks,
+ .exp_sync = synchronize_rcu_tasks,
+ .call = call_rcu_tasks,
+ .cb_barrier = rcu_barrier_tasks,
+ .fqs = NULL,
+ .stats = NULL,
+ .irq_capable = 1,
+ .name = "tasks"
+};
+
+/*
* RCU torture priority-boost testing. Runs one real-time thread per
* CPU for moderate bursts, repeatedly registering RCU callbacks and
* spinning waiting for them to be invoked. If a given callback takes
@@ -1295,7 +1331,8 @@ static int rcu_torture_barrier_cbs(void *arg)
if (atomic_dec_and_test(&barrier_cbs_count))
wake_up(&barrier_wq);
} while (!torture_must_stop());
- cur_ops->cb_barrier();
+ if (cur_ops->cb_barrier != NULL)
+ cur_ops->cb_barrier();
destroy_rcu_head_on_stack(&rcu);
torture_kthread_stopping("rcu_torture_barrier_cbs");
return 0;
@@ -1534,6 +1571,7 @@ rcu_torture_init(void)
int firsterr = 0;
static struct rcu_torture_ops *torture_ops[] = {
&rcu_ops, &rcu_bh_ops, &rcu_busted_ops, &srcu_ops, &sched_ops,
+ &tasks_ops,
};

if (!torture_init_begin(torture_type, verbose, &rcutorture_runnable))
--
1.8.1.5

2014-07-31 21:55:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 7/9] rcu: Add stall-warning checks for RCU-tasks

From: "Paul E. McKenney" <[email protected]>

This commit adds a three-minute RCU-tasks stall warning. The actual
time is controlled by the boot/sysfs parameter rcu_task_stall_timeout,
with values less than or equal to zero disabling the stall warnings.
The default value is three minutes, which means that the tasks that
have not yet responded will get their stacks dumped every three minutes,
until they pass through a voluntary context switch.

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/kernel-parameters.txt | 5 ++++
kernel/rcu/update.c | 50 +++++++++++++++++++++++++++++--------
2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 910c3829f81d..8cdbde7b17f5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2921,6 +2921,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
rcupdate.rcu_cpu_stall_timeout= [KNL]
Set timeout for RCU CPU stall warning messages.

+ rcupdate.rcu_task_stall_timeout= [KNL]
+ Set timeout in jiffies for RCU task stall warning
+ messages. Disable with a value less than or equal
+ to zero.
+
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index b7694019e952..e940b86af4e8 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -373,6 +373,10 @@ static struct rcu_head *rcu_tasks_cbs_head;
static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);

+/* Control stall timeouts. Disable with <= 0, otherwise jiffies till stall. */
+static int rcu_task_stall_timeout __read_mostly = HZ * 60 * 3;
+module_param(rcu_task_stall_timeout, int, 0644);
+
/* Post an RCU-tasks callback. */
void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
{
@@ -444,11 +448,33 @@ void rcu_barrier_tasks(void)
}
EXPORT_SYMBOL_GPL(rcu_barrier_tasks);

+/* See if tasks are still holding out, complain if so. */
+static void check_holdout_task(struct task_struct *t,
+ bool needreport, bool *firstreport)
+{
+ if (!ACCESS_ONCE(t->rcu_tasks_holdout) ||
+ t->rcu_tasks_nvcsw != ACCESS_ONCE(t->nvcsw) ||
+ !ACCESS_ONCE(t->on_rq)) {
+ ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
+ list_del_rcu(&t->rcu_tasks_holdout_list);
+ put_task_struct(t);
+ return;
+ }
+ if (!needreport)
+ return;
+ if (*firstreport) {
+ pr_err("INFO: rcu_tasks detected stalls on tasks:\n");
+ *firstreport = false;
+ }
+ sched_show_task(current);
+}
+
/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
static int __noreturn rcu_tasks_kthread(void *arg)
{
unsigned long flags;
struct task_struct *g, *t;
+ unsigned long lastreport;
struct rcu_head *list;
struct rcu_head *next;

@@ -518,22 +544,24 @@ static int __noreturn rcu_tasks_kthread(void *arg)
* of holdout tasks, removing any that are no longer
* holdouts. When the list is empty, we are done.
*/
+ lastreport = jiffies;
while (!list_empty(&rcu_tasks_holdouts)) {
+ bool firstreport;
+ bool needreport;
+ int rtst;
+
schedule_timeout_interruptible(HZ / 10);
+ rtst = ACCESS_ONCE(rcu_task_stall_timeout);
+ needreport = rtst > 0 &&
+ time_after(jiffies, lastreport + rtst);
+ if (needreport)
+ lastreport = jiffies;
+ firstreport = true;
flush_signals(current);
rcu_read_lock();
list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
- rcu_tasks_holdout_list) {
- if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
- if (t->rcu_tasks_nvcsw ==
- ACCESS_ONCE(t->nvcsw) &&
- ACCESS_ONCE(t->on_rq))
- continue;
- ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
- }
- list_del_rcu(&t->rcu_tasks_holdout_list);
- put_task_struct(t);
- }
+ rcu_tasks_holdout_list)
+ check_holdout_task(t, needreport, &firstreport);
rcu_read_unlock();
}

--
1.8.1.5

2014-07-31 21:55:21

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 6/9] rcutorture: Add RCU-tasks test cases

From: "Paul E. McKenney" <[email protected]>

This commit adds the TASKS01 and TASKS02 Kconfig fragments, along with
the corresponding TASKS01.boot and TASKS02.boot boot-parameter files
specifying that rcutorture test RCU-tasks instead of the default flavor.

Signed-off-by: Paul E. McKenney <[email protected]>
---
tools/testing/selftests/rcutorture/configs/rcu/TASKS01 | 7 +++++++
tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot | 1 +
tools/testing/selftests/rcutorture/configs/rcu/TASKS02 | 6 ++++++
tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot | 1 +
4 files changed, 15 insertions(+)
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS01
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS02
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
new file mode 100644
index 000000000000..263a20f01fae
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
@@ -0,0 +1,7 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=2
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+CONFIG_TASKS_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
new file mode 100644
index 000000000000..cd2a188eeb6d
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
new file mode 100644
index 000000000000..17b669c8833c
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
@@ -0,0 +1,6 @@
+CONFIG_SMP=n
+CONFIG_HOTPLUG_CPU=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_TASKS_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot
new file mode 100644
index 000000000000..cd2a188eeb6d
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02.boot
@@ -0,0 +1 @@
+rcutorture.torture_type=tasks
--
1.8.1.5

2014-07-31 21:55:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 3/9] rcu: Add synchronous grace-period waiting for RCU-tasks

From: "Paul E. McKenney" <[email protected]>

It turns out to be easier to add the synchronous grace-period waiting
functions to RCU-tasks than to work around their absense in rcutorture,
so this commit adds them. The key point is that the existence of
call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks().

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 2 ++
kernel/rcu/update.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index ac87f587a1c1..1f073af940a5 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -216,6 +216,8 @@ void synchronize_sched(void);
* memory ordering guarantees.
*/
void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
+void synchronize_rcu_tasks(void);
+void rcu_barrier_tasks(void);

#ifdef CONFIG_PREEMPT_RCU

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 50453589e3ca..fe866848a063 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -387,6 +387,61 @@ void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
}
EXPORT_SYMBOL_GPL(call_rcu_tasks);

+/**
+ * synchronize_rcu_tasks - wait until an rcu-tasks grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu-tasks
+ * grace period has elapsed, in other words after all currently
+ * executing rcu-tasks read-side critical sections have elapsed. These
+ * read-side critical sections are delimited by calls to schedule(),
+ * cond_resched_rcu_qs(), idle execution, userspace execution, calls
+ * to synchronize_rcu_tasks(), and (in theory, anyway) cond_resched().
+ *
+ * This is a very specialized primitive, intended only for a few uses in
+ * tracing and other situations requiring manipulation of function
+ * preambles and profiling hooks. The synchronize_rcu_tasks() function
+ * is not (yet) intended for heavy use from multiple CPUs.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_rcu_tasks() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-tasks read-side critical section whose beginning
+ * preceded the call to synchronize_rcu_tasks(). In addition, each CPU
+ * having an RCU-tasks read-side critical section that extends beyond
+ * the return from synchronize_rcu_tasks() is guaranteed to have executed
+ * a full memory barrier after the beginning of synchronize_rcu_tasks()
+ * and before the beginning of that RCU-tasks read-side critical section.
+ * Note that these guarantees include CPUs that are offline, idle, or
+ * executing in user mode, as well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_rcu_tasks(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_rcu_tasks() -- even if CPU A and CPU B are the same CPU
+ * (but again only if the system has more than one CPU).
+ */
+void synchronize_rcu_tasks(void)
+{
+ /* Complain if the scheduler has not started. */
+ rcu_lockdep_assert(!rcu_scheduler_active,
+ "synchronize_rcu_tasks called too soon");
+
+ /* Wait for the grace period. */
+ wait_rcu_gp(call_rcu_tasks);
+}
+
+/**
+ * rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
+ *
+ * Although the current implementation is guaranteed to wait, it is not
+ * obligated to, for example, if there are no pending callbacks.
+ */
+void rcu_barrier_tasks(void)
+{
+ /* There is only one callback queue, so this is easy. ;-) */
+ synchronize_rcu_tasks();
+}
+
/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
static int __noreturn rcu_tasks_kthread(void *arg)
{
--
1.8.1.5

2014-07-31 21:56:39

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 2/9] rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops

From: "Paul E. McKenney" <[email protected]>

RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks. In some cases, this requires
instrumenting cond_resched(). However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks. Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <[email protected]>
---
fs/file.c | 2 +-
include/linux/rcupdate.h | 13 +++++++++++++
kernel/rcu/rcutorture.c | 4 ++--
kernel/rcu/tree.c | 12 ++++++------
kernel/rcu/tree_plugin.h | 2 +-
mm/mlock.c | 2 +-
6 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 66923fe3176e..1cafc4c9275b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -367,7 +367,7 @@ static struct fdtable *close_files(struct files_struct * files)
struct file * file = xchg(&fdt->fd[i], NULL);
if (file) {
filp_close(file, files);
- cond_resched();
+ cond_resched_rcu_qs();
}
}
i++;
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 829efc99df3e..ac87f587a1c1 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -330,6 +330,19 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
#define rcu_note_voluntary_context_switch(t) do { } while (0)
#endif /* #else #ifdef CONFIG_TASKS_RCU */

+/**
+ * cond_resched_rcu_qs - Report potential quiescent states to RCU
+ *
+ * This macro resembles cond_resched(), except that it is defined to
+ * report potential quiescent states to RCU-tasks even if the cond_resched()
+ * machinery were to be shut off, as some advocate for PREEMPT kernels.
+ */
+#define cond_resched_rcu_qs() \
+do { \
+ rcu_note_voluntary_context_switch(current); \
+ cond_resched(); \
+} while (0)
+
#if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
bool __rcu_is_watching(void);
#endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7fa34f86e5ba..febe07062ac5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -667,7 +667,7 @@ static int rcu_torture_boost(void *arg)
}
call_rcu_time = jiffies;
}
- cond_resched();
+ cond_resched_rcu_qs();
stutter_wait("rcu_torture_boost");
if (torture_must_stop())
goto checkwait;
@@ -1019,7 +1019,7 @@ rcu_torture_reader(void *arg)
__this_cpu_inc(rcu_torture_batch[completed]);
preempt_enable();
cur_ops->readunlock(idx);
- cond_resched();
+ cond_resched_rcu_qs();
stutter_wait("rcu_torture_reader");
} while (!torture_must_stop());
if (irqreader && cur_ops->irq_capable) {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f958c52f644d..645a33efc0d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1650,7 +1650,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
system_state == SYSTEM_RUNNING)
udelay(200);
#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
- cond_resched();
+ cond_resched_rcu_qs();
}

mutex_unlock(&rsp->onoff_mutex);
@@ -1739,7 +1739,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
/* smp_mb() provided by prior unlock-lock pair. */
nocb += rcu_future_gp_cleanup(rsp, rnp);
raw_spin_unlock_irq(&rnp->lock);
- cond_resched();
+ cond_resched_rcu_qs();
}
rnp = rcu_get_root(rsp);
raw_spin_lock_irq(&rnp->lock);
@@ -1788,7 +1788,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp))
break;
- cond_resched();
+ cond_resched_rcu_qs();
flush_signals(current);
trace_rcu_grace_period(rsp->name,
ACCESS_ONCE(rsp->gpnum),
@@ -1831,10 +1831,10 @@ static int __noreturn rcu_gp_kthread(void *arg)
trace_rcu_grace_period(rsp->name,
ACCESS_ONCE(rsp->gpnum),
TPS("fqsend"));
- cond_resched();
+ cond_resched_rcu_qs();
} else {
/* Deal with stray signal. */
- cond_resched();
+ cond_resched_rcu_qs();
flush_signals(current);
trace_rcu_grace_period(rsp->name,
ACCESS_ONCE(rsp->gpnum),
@@ -2437,7 +2437,7 @@ static void force_qs_rnp(struct rcu_state *rsp,
struct rcu_node *rnp;

rcu_for_each_leaf_node(rsp, rnp) {
- cond_resched();
+ cond_resched_rcu_qs();
mask = 0;
raw_spin_lock_irqsave(&rnp->lock, flags);
smp_mb__after_unlock_lock();
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 02ac0fb186b8..a86a363ea453 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1842,7 +1842,7 @@ static int rcu_oom_notify(struct notifier_block *self,
get_online_cpus();
for_each_online_cpu(cpu) {
smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
- cond_resched();
+ cond_resched_rcu_qs();
}
put_online_cpus();

diff --git a/mm/mlock.c b/mm/mlock.c
index b1eb53634005..bc386a22d647 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -782,7 +782,7 @@ static int do_mlockall(int flags)

/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
- cond_resched();
+ cond_resched_rcu_qs();
}
out:
return 0;
--
1.8.1.5

2014-07-31 21:57:04

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 8/9] rcu: Improve RCU-tasks energy efficiency

From: "Paul E. McKenney" <[email protected]>

The current RCU-tasks implementation uses strict polling to detect
callback arrivals. This works quite well, but is not so good for
energy efficiency. This commit therefore replaces the strict polling
with a wait queue.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index e940b86af4e8..f14a79d0d6de 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -371,6 +371,7 @@ static LIST_HEAD(rcu_tasks_holdouts);
/* Global list of callbacks and associated lock. */
static struct rcu_head *rcu_tasks_cbs_head;
static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
+static DECLARE_WAIT_QUEUE_HEAD(rcu_tasks_cbs_wq);
static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);

/* Control stall timeouts. Disable with <= 0, otherwise jiffies till stall. */
@@ -381,13 +382,17 @@ module_param(rcu_task_stall_timeout, int, 0644);
void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
{
unsigned long flags;
+ bool needwake;

rhp->next = NULL;
rhp->func = func;
raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
+ needwake = !rcu_tasks_cbs_head;
*rcu_tasks_cbs_tail = rhp;
rcu_tasks_cbs_tail = &rhp->next;
raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
+ if (needwake)
+ wake_up(&rcu_tasks_cbs_wq);
}
EXPORT_SYMBOL_GPL(call_rcu_tasks);

@@ -498,8 +503,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)

/* If there were none, wait a bit and start over. */
if (!list) {
- schedule_timeout_interruptible(HZ);
- flush_signals(current);
+ wait_event_interruptible(rcu_tasks_cbs_wq,
+ rcu_tasks_cbs_head);
+ if (!rcu_tasks_cbs_head) {
+ flush_signals(current);
+ schedule_timeout_interruptible(HZ/10);
+ }
continue;
}

@@ -591,6 +600,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
list = next;
cond_resched();
}
+ schedule_timeout_uninterruptible(HZ/10);
}
}

--
1.8.1.5

2014-07-31 21:57:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 4/9] rcu: Export RCU-tasks APIs to GPL modules

From: Steven Rostedt <[email protected]>

This commit exports the RCU-tasks APIs, call_rcu_tasks(),
synchronize_rcu_tasks(), and rcu_barrier_tasks(), to GPL-licensed
kernel modules.

Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/rcu/update.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index fe866848a063..b7694019e952 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -429,6 +429,7 @@ void synchronize_rcu_tasks(void)
/* Wait for the grace period. */
wait_rcu_gp(call_rcu_tasks);
}
+EXPORT_SYMBOL_GPL(synchronize_rcu_tasks);

/**
* rcu_barrier_tasks - Wait for in-flight call_rcu_tasks() callbacks.
@@ -441,6 +442,7 @@ void rcu_barrier_tasks(void)
/* There is only one callback queue, so this is easy. ;-) */
synchronize_rcu_tasks();
}
+EXPORT_SYMBOL_GPL(rcu_barrier_tasks);

/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
static int __noreturn rcu_tasks_kthread(void *arg)
--
1.8.1.5

2014-07-31 23:57:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

On Thu, Jul 31, 2014 at 02:55:01PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs. Includes fixes from Steven Rostedt.
>
> This RCU flavor is assumed to have very infrequent latency-tolerate
> updaters. This assumption permits significant simplifications, including
> a single global callback list protected by a single global lock, along
> with a single linked list containing all tasks that have not yet passed
> through a quiescent state. If experience shows this assumption to be
> incorrect, the required additional complexity will be added.
>
> Suggested-by: Steven Rostedt <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/init_task.h | 9 +++
> include/linux/rcupdate.h | 36 ++++++++++
> include/linux/sched.h | 23 ++++---
> init/Kconfig | 10 +++
> kernel/rcu/tiny.c | 2 +
> kernel/rcu/tree.c | 2 +
> kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 242 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..78715ea7c30c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> #else
> #define INIT_TASK_RCU_PREEMPT(tsk)
> #endif
> +#ifdef CONFIG_TASKS_RCU
> +#define INIT_TASK_RCU_TASKS(tsk) \
> + .rcu_tasks_holdout = false, \
> + .rcu_tasks_holdout_list = \
> + LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> +#else
> +#define INIT_TASK_RCU_TASKS(tsk)
> +#endif
>
> extern struct cred init_cred;
>
> @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> INIT_FTRACE_GRAPH \
> INIT_TRACE_RECURSION \
> INIT_TASK_RCU_PREEMPT(tsk) \
> + INIT_TASK_RCU_TASKS(tsk) \
> INIT_CPUSET_SEQ(tsk) \
> INIT_RT_MUTEXES(tsk) \
> INIT_VTIME(tsk) \
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6a94cc8b1ca0..829efc99df3e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
>
> void synchronize_sched(void);
>
> +/**
> + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks() assumes
> + * that the read-side critical sections end at a voluntary context
> + * switch (not a preemption!), entry into idle, or transition to usermode
> + * execution. As such, there are no read-side primitives analogous to
> + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> + * to determine that all tasks have passed through a safe state, not so
> + * much for data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> +
> #ifdef CONFIG_PREEMPT_RCU
>
> void __rcu_read_lock(void);
> @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> rcu_irq_exit(); \
> } while (0)
>
> +/*
> + * Note a voluntary context switch for RCU-tasks benefit. This is a
> + * macro rather than an inline function to avoid #include hell.
> + */
> +#ifdef CONFIG_TASKS_RCU
> +#define rcu_note_voluntary_context_switch(t) \
> + do { \
> + preempt_disable(); /* Exclude synchronize_sched(); */ \
> + if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> + ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> + preempt_enable(); \
> + } while (0)
> +#else /* #ifdef CONFIG_TASKS_RCU */
> +#define rcu_note_voluntary_context_switch(t) do { } while (0)
> +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> +
> #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> bool __rcu_is_watching(void);
> #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 306f4f0c987a..3cf124389ec7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1273,6 +1273,11 @@ struct task_struct {
> #ifdef CONFIG_RCU_BOOST
> struct rt_mutex *rcu_boost_mutex;
> #endif /* #ifdef CONFIG_RCU_BOOST */
> +#ifdef CONFIG_TASKS_RCU
> + unsigned long rcu_tasks_nvcsw;
> + int rcu_tasks_holdout;
> + struct list_head rcu_tasks_holdout_list;
> +#endif /* #ifdef CONFIG_TASKS_RCU */
>
> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> struct sched_info sched_info;
> @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> unsigned int mask);
>
> #ifdef CONFIG_PREEMPT_RCU
> -
> #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
>
> static inline void rcu_copy_process(struct task_struct *p)
> {
> +#ifdef CONFIG_PREEMPT_RCU
> p->rcu_read_lock_nesting = 0;
> p->rcu_read_unlock_special = 0;
> -#ifdef CONFIG_TREE_PREEMPT_RCU
> p->rcu_blocked_node = NULL;
> -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> #ifdef CONFIG_RCU_BOOST
> p->rcu_boost_mutex = NULL;
> #endif /* #ifdef CONFIG_RCU_BOOST */
> INIT_LIST_HEAD(&p->rcu_node_entry);
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> +#ifdef CONFIG_TASKS_RCU
> + p->rcu_tasks_holdout = false;
> + INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> +#endif /* #ifdef CONFIG_TASKS_RCU */
> }
>
> -#else
> -
> -static inline void rcu_copy_process(struct task_struct *p)
> -{
> -}
> -
> -#endif
> -
> static inline void tsk_restore_flags(struct task_struct *task,
> unsigned long orig_flags, unsigned long flags)
> {
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d76b99af1b9..c56cb62a2df1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -507,6 +507,16 @@ config PREEMPT_RCU
> This option enables preemptible-RCU code that is common between
> the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
>
> +config TASKS_RCU
> + bool "Task_based RCU implementation using voluntary context switch"
> + default n
> + help
> + This option enables a task-based RCU implementation that uses
> + only voluntary context switch (not preemption!), idle, and
> + user-mode execution as quiescent states.
> +
> + If unsure, say N.

I don't remember who said that, but indeed this is a pure internal feature
only. The user doesn't need to select that option ever.

> +
> config RCU_STALL_COMMON
> def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> help
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index d9efcc13008c..717f00854fc0 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> rcu_sched_qs(cpu);
> else if (!in_softirq())
> rcu_bh_qs(cpu);
> + if (user)
> + rcu_note_voluntary_context_switch(current);
> }
>
> /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 625d0b0cd75a..f958c52f644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> rcu_preempt_check_callbacks(cpu);
> if (rcu_pending(cpu))
> invoke_rcu_core();
> + if (user)
> + rcu_note_voluntary_context_switch(current);
> trace_rcu_utilization(TPS("End scheduler-tick"));
> }
>
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index bc7883570530..50453589e3ca 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -47,6 +47,7 @@
> #include <linux/hardirq.h>
> #include <linux/delay.h>
> #include <linux/module.h>
> +#include <linux/kthread.h>
>
> #define CREATE_TRACE_POINTS
>
> @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> early_initcall(check_cpu_stall_init);
>
> #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> +
> +#ifdef CONFIG_TASKS_RCU
> +
> +/*
> + * Simple variant of RCU whose quiescent states are voluntary context switch,
> + * user-space execution, and idle. As such, grace periods can take one good
> + * long time. There are no read-side primitives similar to rcu_read_lock()
> + * and rcu_read_unlock() because this implementation is intended to get
> + * the system into a safe state for some of the manipulations involved in
> + * tracing and the like. Finally, this implementation does not support
> + * high call_rcu_tasks() rates from multiple CPUs. If this is required,
> + * per-CPU callback lists will be needed.
> + */
> +
> +/* Lists of tasks that we are still waiting for during this grace period. */
> +static LIST_HEAD(rcu_tasks_holdouts);
> +
> +/* Global list of callbacks and associated lock. */
> +static struct rcu_head *rcu_tasks_cbs_head;
> +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> +
> +/* Post an RCU-tasks callback. */
> +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> +{
> + unsigned long flags;
> +
> + rhp->next = NULL;
> + rhp->func = func;
> + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> + *rcu_tasks_cbs_tail = rhp;
> + rcu_tasks_cbs_tail = &rhp->next;
> + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> +
> +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> +static int __noreturn rcu_tasks_kthread(void *arg)
> +{
> + unsigned long flags;
> + struct task_struct *g, *t;
> + struct rcu_head *list;
> + struct rcu_head *next;
> +
> + /* FIXME: Add housekeeping affinity. */
> +
> + /*
> + * Each pass through the following loop makes one check for
> + * newly arrived callbacks, and, if there are some, waits for
> + * one RCU-tasks grace period and then invokes the callbacks.
> + * This loop is terminated by the system going down. ;-)
> + */
> + for (;;) {
> +
> + /* Pick up any new callbacks. */
> + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> + smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */

I have no idea which against what this is exactly ordering. GP is a vast thing.
Especially for tricky barriers like __after_unlock_lock() which suggest very
counter-intuitive ordering, a detailed comment is very welcome :)

> + list = rcu_tasks_cbs_head;
> + rcu_tasks_cbs_head = NULL;
> + rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +
> + /* If there were none, wait a bit and start over. */
> + if (!list) {
> + schedule_timeout_interruptible(HZ);

So this thread is going to poll every second? I guess something prevents it
to run when system is idle somewhere? But I'm not familiar with the whole patchset
yet. But even without that it looks like a very annoying noise. why not use something
wait/wakeup based?

> + flush_signals(current);
> + continue;
> + }
> +
> + /*
> + * Wait for all pre-existing t->on_rq and t->nvcsw
> + * transitions to complete. Invoking synchronize_sched()
> + * suffices because all these transitions occur with
> + * interrupts disabled. Without this synchronize_sched(),
> + * a read-side critical section that started before the
> + * grace period might be incorrectly seen as having started
> + * after the grace period.
> + *
> + * This synchronize_sched() also dispenses with the
> + * need for a memory barrier on the first store to
> + * ->rcu_tasks_holdout, as it forces the store to happen
> + * after the beginning of the grace period.
> + */
> + synchronize_sched();
> +
> + /*
> + * There were callbacks, so we need to wait for an
> + * RCU-tasks grace period. Start off by scanning
> + * the task list for tasks that are not already
> + * voluntarily blocked. Mark these tasks and make
> + * a list of them in rcu_tasks_holdouts.
> + */
> + rcu_read_lock();
> + for_each_process_thread(g, t) {
> + if (t != current && ACCESS_ONCE(t->on_rq) &&
> + !is_idle_task(t)) {
> + get_task_struct(t);
> + t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> + ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> + list_add(&t->rcu_tasks_holdout_list,
> + &rcu_tasks_holdouts);
> + }
> + }
> + rcu_read_unlock();
> +
> + /*
> + * Each pass through the following loop scans the list
> + * of holdout tasks, removing any that are no longer
> + * holdouts. When the list is empty, we are done.
> + */
> + while (!list_empty(&rcu_tasks_holdouts)) {
> + schedule_timeout_interruptible(HZ / 10);

OTOH here it is not annoying because it should only happen when rcu task
is used, which should be rare.

Thanks.

> + flush_signals(current);
> + rcu_read_lock();
> + list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> + rcu_tasks_holdout_list) {
> + if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> + if (t->rcu_tasks_nvcsw ==
> + ACCESS_ONCE(t->nvcsw) &&
> + ACCESS_ONCE(t->on_rq))
> + continue;
> + ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> + }
> + list_del_rcu(&t->rcu_tasks_holdout_list);
> + put_task_struct(t);
> + }
> + rcu_read_unlock();
> + }
> +
> + /*
> + * Because ->on_rq and ->nvcsw are not guaranteed
> + * to have a full memory barriers prior to them in the
> + * schedule() path, memory reordering on other CPUs could
> + * cause their RCU-tasks read-side critical sections to
> + * extend past the end of the grace period. However,
> + * because these ->nvcsw updates are carried out with
> + * interrupts disabled, we can use synchronize_sched()
> + * to force the needed ordering on all such CPUs.
> + *
> + * This synchronize_sched() also confines all
> + * ->rcu_tasks_holdout accesses to be within the grace
> + * period, avoiding the need for memory barriers for
> + * ->rcu_tasks_holdout accesses.
> + */
> + synchronize_sched();
> +
> + /* Invoke the callbacks. */
> + while (list) {
> + next = list->next;
> + local_bh_disable();
> + list->func(list);
> + local_bh_enable();
> + list = next;
> + cond_resched();
> + }
> + }
> +}
> +
> +/* Spawn rcu_tasks_kthread() at boot time. */
> +static int __init rcu_spawn_tasks_kthread(void)
> +{
> + struct task_struct __maybe_unused *t;
> +
> + t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> + BUG_ON(IS_ERR(t));
> + return 0;
> +}
> +early_initcall(rcu_spawn_tasks_kthread);
> +
> +#endif /* #ifdef CONFIG_TASKS_RCU */
> --
> 1.8.1.5
>

2014-08-01 01:14:21

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs. Includes fixes from Steven Rostedt.
>
> This RCU flavor is assumed to have very infrequent latency-tolerate
> updaters. This assumption permits significant simplifications, including
> a single global callback list protected by a single global lock, along
> with a single linked list containing all tasks that have not yet passed
> through a quiescent state. If experience shows this assumption to be
> incorrect, the required additional complexity will be added.
>
> Suggested-by: Steven Rostedt <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/init_task.h | 9 +++
> include/linux/rcupdate.h | 36 ++++++++++
> include/linux/sched.h | 23 ++++---
> init/Kconfig | 10 +++
> kernel/rcu/tiny.c | 2 +
> kernel/rcu/tree.c | 2 +
> kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 242 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..78715ea7c30c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> #else
> #define INIT_TASK_RCU_PREEMPT(tsk)
> #endif
> +#ifdef CONFIG_TASKS_RCU
> +#define INIT_TASK_RCU_TASKS(tsk) \
> + .rcu_tasks_holdout = false, \
> + .rcu_tasks_holdout_list = \
> + LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> +#else
> +#define INIT_TASK_RCU_TASKS(tsk)
> +#endif
>
> extern struct cred init_cred;
>
> @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> INIT_FTRACE_GRAPH \
> INIT_TRACE_RECURSION \
> INIT_TASK_RCU_PREEMPT(tsk) \
> + INIT_TASK_RCU_TASKS(tsk) \
> INIT_CPUSET_SEQ(tsk) \
> INIT_RT_MUTEXES(tsk) \
> INIT_VTIME(tsk) \
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6a94cc8b1ca0..829efc99df3e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
>
> void synchronize_sched(void);
>
> +/**
> + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks() assumes
> + * that the read-side critical sections end at a voluntary context
> + * switch (not a preemption!), entry into idle, or transition to usermode
> + * execution. As such, there are no read-side primitives analogous to
> + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> + * to determine that all tasks have passed through a safe state, not so
> + * much for data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> +
> #ifdef CONFIG_PREEMPT_RCU
>
> void __rcu_read_lock(void);
> @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> rcu_irq_exit(); \
> } while (0)
>
> +/*
> + * Note a voluntary context switch for RCU-tasks benefit. This is a
> + * macro rather than an inline function to avoid #include hell.
> + */
> +#ifdef CONFIG_TASKS_RCU
> +#define rcu_note_voluntary_context_switch(t) \
> + do { \
> + preempt_disable(); /* Exclude synchronize_sched(); */ \
> + if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> + ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> + preempt_enable(); \
> + } while (0)
> +#else /* #ifdef CONFIG_TASKS_RCU */
> +#define rcu_note_voluntary_context_switch(t) do { } while (0)
> +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> +
> #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> bool __rcu_is_watching(void);
> #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 306f4f0c987a..3cf124389ec7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1273,6 +1273,11 @@ struct task_struct {
> #ifdef CONFIG_RCU_BOOST
> struct rt_mutex *rcu_boost_mutex;
> #endif /* #ifdef CONFIG_RCU_BOOST */
> +#ifdef CONFIG_TASKS_RCU
> + unsigned long rcu_tasks_nvcsw;
> + int rcu_tasks_holdout;
> + struct list_head rcu_tasks_holdout_list;
> +#endif /* #ifdef CONFIG_TASKS_RCU */
>
> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> struct sched_info sched_info;
> @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> unsigned int mask);
>
> #ifdef CONFIG_PREEMPT_RCU
> -
> #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
>
> static inline void rcu_copy_process(struct task_struct *p)
> {
> +#ifdef CONFIG_PREEMPT_RCU
> p->rcu_read_lock_nesting = 0;
> p->rcu_read_unlock_special = 0;
> -#ifdef CONFIG_TREE_PREEMPT_RCU
> p->rcu_blocked_node = NULL;
> -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> #ifdef CONFIG_RCU_BOOST
> p->rcu_boost_mutex = NULL;
> #endif /* #ifdef CONFIG_RCU_BOOST */
> INIT_LIST_HEAD(&p->rcu_node_entry);
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> +#ifdef CONFIG_TASKS_RCU
> + p->rcu_tasks_holdout = false;
> + INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> +#endif /* #ifdef CONFIG_TASKS_RCU */
> }
>
> -#else
> -
> -static inline void rcu_copy_process(struct task_struct *p)
> -{
> -}
> -
> -#endif
> -
> static inline void tsk_restore_flags(struct task_struct *task,
> unsigned long orig_flags, unsigned long flags)
> {
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d76b99af1b9..c56cb62a2df1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -507,6 +507,16 @@ config PREEMPT_RCU
> This option enables preemptible-RCU code that is common between
> the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
>
> +config TASKS_RCU
> + bool "Task_based RCU implementation using voluntary context switch"
> + default n
> + help
> + This option enables a task-based RCU implementation that uses
> + only voluntary context switch (not preemption!), idle, and
> + user-mode execution as quiescent states.
> +
> + If unsure, say N.
> +
> config RCU_STALL_COMMON
> def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> help
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index d9efcc13008c..717f00854fc0 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> rcu_sched_qs(cpu);
> else if (!in_softirq())
> rcu_bh_qs(cpu);
> + if (user)
> + rcu_note_voluntary_context_switch(current);
> }
>
> /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 625d0b0cd75a..f958c52f644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> rcu_preempt_check_callbacks(cpu);
> if (rcu_pending(cpu))
> invoke_rcu_core();
> + if (user)
> + rcu_note_voluntary_context_switch(current);
> trace_rcu_utilization(TPS("End scheduler-tick"));
> }
>
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index bc7883570530..50453589e3ca 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -47,6 +47,7 @@
> #include <linux/hardirq.h>
> #include <linux/delay.h>
> #include <linux/module.h>
> +#include <linux/kthread.h>
>
> #define CREATE_TRACE_POINTS
>
> @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> early_initcall(check_cpu_stall_init);
>
> #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> +
> +#ifdef CONFIG_TASKS_RCU
> +
> +/*
> + * Simple variant of RCU whose quiescent states are voluntary context switch,
> + * user-space execution, and idle. As such, grace periods can take one good
> + * long time. There are no read-side primitives similar to rcu_read_lock()
> + * and rcu_read_unlock() because this implementation is intended to get
> + * the system into a safe state for some of the manipulations involved in
> + * tracing and the like. Finally, this implementation does not support
> + * high call_rcu_tasks() rates from multiple CPUs. If this is required,
> + * per-CPU callback lists will be needed.
> + */
> +
> +/* Lists of tasks that we are still waiting for during this grace period. */
> +static LIST_HEAD(rcu_tasks_holdouts);
> +
> +/* Global list of callbacks and associated lock. */
> +static struct rcu_head *rcu_tasks_cbs_head;
> +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> +
> +/* Post an RCU-tasks callback. */
> +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> +{
> + unsigned long flags;
> +
> + rhp->next = NULL;
> + rhp->func = func;
> + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> + *rcu_tasks_cbs_tail = rhp;
> + rcu_tasks_cbs_tail = &rhp->next;
> + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> +
> +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> +static int __noreturn rcu_tasks_kthread(void *arg)
> +{
> + unsigned long flags;
> + struct task_struct *g, *t;
> + struct rcu_head *list;
> + struct rcu_head *next;
> +
> + /* FIXME: Add housekeeping affinity. */
> +
> + /*
> + * Each pass through the following loop makes one check for
> + * newly arrived callbacks, and, if there are some, waits for
> + * one RCU-tasks grace period and then invokes the callbacks.
> + * This loop is terminated by the system going down. ;-)
> + */
> + for (;;) {
> +
> + /* Pick up any new callbacks. */
> + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> + smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> + list = rcu_tasks_cbs_head;
> + rcu_tasks_cbs_head = NULL;
> + rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> +
> + /* If there were none, wait a bit and start over. */
> + if (!list) {
> + schedule_timeout_interruptible(HZ);
> + flush_signals(current);
> + continue;
> + }
> +
> + /*
> + * Wait for all pre-existing t->on_rq and t->nvcsw
> + * transitions to complete. Invoking synchronize_sched()
> + * suffices because all these transitions occur with
> + * interrupts disabled. Without this synchronize_sched(),
> + * a read-side critical section that started before the
> + * grace period might be incorrectly seen as having started
> + * after the grace period.
> + *
> + * This synchronize_sched() also dispenses with the
> + * need for a memory barrier on the first store to
> + * ->rcu_tasks_holdout, as it forces the store to happen
> + * after the beginning of the grace period.
> + */
> + synchronize_sched();
> +
> + /*
> + * There were callbacks, so we need to wait for an
> + * RCU-tasks grace period. Start off by scanning
> + * the task list for tasks that are not already
> + * voluntarily blocked. Mark these tasks and make
> + * a list of them in rcu_tasks_holdouts.
> + */
> + rcu_read_lock();
> + for_each_process_thread(g, t) {
> + if (t != current && ACCESS_ONCE(t->on_rq) &&
> + !is_idle_task(t)) {
> + get_task_struct(t);
> + t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> + ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> + list_add(&t->rcu_tasks_holdout_list,
> + &rcu_tasks_holdouts);
> + }
> + }
> + rcu_read_unlock();
> +
> + /*
> + * Each pass through the following loop scans the list
> + * of holdout tasks, removing any that are no longer
> + * holdouts. When the list is empty, we are done.
> + */
> + while (!list_empty(&rcu_tasks_holdouts)) {
> + schedule_timeout_interruptible(HZ / 10);
> + flush_signals(current);
> + rcu_read_lock();
> + list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> + rcu_tasks_holdout_list) {
> + if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> + if (t->rcu_tasks_nvcsw ==
> + ACCESS_ONCE(t->nvcsw) &&
> + ACCESS_ONCE(t->on_rq))
> + continue;
> + ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> + }
> + list_del_rcu(&t->rcu_tasks_holdout_list);
> + put_task_struct(t);
> + }
> + rcu_read_unlock();

rcu_read_lock() and list opertions of rcu variant are unneeded.

> + }
> +
> + /*
> + * Because ->on_rq and ->nvcsw are not guaranteed
> + * to have a full memory barriers prior to them in the
> + * schedule() path, memory reordering on other CPUs could
> + * cause their RCU-tasks read-side critical sections to
> + * extend past the end of the grace period. However,
> + * because these ->nvcsw updates are carried out with
> + * interrupts disabled, we can use synchronize_sched()
> + * to force the needed ordering on all such CPUs.
> + *
> + * This synchronize_sched() also confines all
> + * ->rcu_tasks_holdout accesses to be within the grace
> + * period, avoiding the need for memory barriers for
> + * ->rcu_tasks_holdout accesses.
> + */
> + synchronize_sched();
> +
> + /* Invoke the callbacks. */
> + while (list) {
> + next = list->next;
> + local_bh_disable();
> + list->func(list);
> + local_bh_enable();
> + list = next;
> + cond_resched();
> + }
> + }
> +}
> +
> +/* Spawn rcu_tasks_kthread() at boot time. */
> +static int __init rcu_spawn_tasks_kthread(void)
> +{
> + struct task_struct __maybe_unused *t;
> +
> + t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> + BUG_ON(IS_ERR(t));
> + return 0;
> +}
> +early_initcall(rcu_spawn_tasks_kthread);
> +
> +#endif /* #ifdef CONFIG_TASKS_RCU */

2014-08-01 01:30:25

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> This commit adds a new RCU-tasks flavor of RCU, which provides
> call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
> context switch (not preemption!), userspace execution, and the idle loop.
> Note that unlike other RCU flavors, these quiescent states occur in tasks,
> not necessarily CPUs. Includes fixes from Steven Rostedt.
>
> This RCU flavor is assumed to have very infrequent latency-tolerate
> updaters. This assumption permits significant simplifications, including
> a single global callback list protected by a single global lock, along
> with a single linked list containing all tasks that have not yet passed
> through a quiescent state. If experience shows this assumption to be
> incorrect, the required additional complexity will be added.
>
> Suggested-by: Steven Rostedt <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/init_task.h | 9 +++
> include/linux/rcupdate.h | 36 ++++++++++
> include/linux/sched.h | 23 ++++---
> init/Kconfig | 10 +++
> kernel/rcu/tiny.c | 2 +
> kernel/rcu/tree.c | 2 +
> kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 242 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..78715ea7c30c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> #else
> #define INIT_TASK_RCU_PREEMPT(tsk)
> #endif
> +#ifdef CONFIG_TASKS_RCU
> +#define INIT_TASK_RCU_TASKS(tsk) \
> + .rcu_tasks_holdout = false, \
> + .rcu_tasks_holdout_list = \
> + LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> +#else
> +#define INIT_TASK_RCU_TASKS(tsk)
> +#endif
>
> extern struct cred init_cred;
>
> @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> INIT_FTRACE_GRAPH \
> INIT_TRACE_RECURSION \
> INIT_TASK_RCU_PREEMPT(tsk) \
> + INIT_TASK_RCU_TASKS(tsk) \
> INIT_CPUSET_SEQ(tsk) \
> INIT_RT_MUTEXES(tsk) \
> INIT_VTIME(tsk) \
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 6a94cc8b1ca0..829efc99df3e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
>
> void synchronize_sched(void);
>
> +/**
> + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. call_rcu_tasks() assumes
> + * that the read-side critical sections end at a voluntary context
> + * switch (not a preemption!), entry into idle, or transition to usermode
> + * execution. As such, there are no read-side primitives analogous to
> + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> + * to determine that all tasks have passed through a safe state, not so
> + * much for data-strcuture synchronization.
> + *
> + * See the description of call_rcu() for more detailed information on
> + * memory ordering guarantees.
> + */
> +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> +
> #ifdef CONFIG_PREEMPT_RCU
>
> void __rcu_read_lock(void);
> @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> rcu_irq_exit(); \
> } while (0)
>
> +/*
> + * Note a voluntary context switch for RCU-tasks benefit. This is a
> + * macro rather than an inline function to avoid #include hell.
> + */
> +#ifdef CONFIG_TASKS_RCU
> +#define rcu_note_voluntary_context_switch(t) \
> + do { \
> + preempt_disable(); /* Exclude synchronize_sched(); */ \
> + if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> + ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> + preempt_enable(); \

Why the preempt_disable() is needed here? The comments in rcu_tasks_kthread()
can't persuade me. Maybe it could be removed?

2014-08-01 01:59:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

On Fri, Aug 01, 2014 at 09:15:34AM +0800, Lai Jiangshan wrote:
> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs. Includes fixes from Steven Rostedt.
> >
> > This RCU flavor is assumed to have very infrequent latency-tolerate
> > updaters. This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state. If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> >
> > Suggested-by: Steven Rostedt <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > include/linux/init_task.h | 9 +++
> > include/linux/rcupdate.h | 36 ++++++++++
> > include/linux/sched.h | 23 ++++---
> > init/Kconfig | 10 +++
> > kernel/rcu/tiny.c | 2 +
> > kernel/rcu/tree.c | 2 +
> > kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> > 7 files changed, 242 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> > #else
> > #define INIT_TASK_RCU_PREEMPT(tsk)
> > #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk) \
> > + .rcu_tasks_holdout = false, \
> > + .rcu_tasks_holdout_list = \
> > + LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> >
> > extern struct cred init_cred;
> >
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> > INIT_FTRACE_GRAPH \
> > INIT_TRACE_RECURSION \
> > INIT_TASK_RCU_PREEMPT(tsk) \
> > + INIT_TASK_RCU_TASKS(tsk) \
> > INIT_CPUSET_SEQ(tsk) \
> > INIT_RT_MUTEXES(tsk) \
> > INIT_VTIME(tsk) \
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >
> > void synchronize_sched(void);
> >
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution. As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> > #ifdef CONFIG_PREEMPT_RCU
> >
> > void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> > rcu_irq_exit(); \
> > } while (0)
> >
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit. This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > + do { \
> > + preempt_disable(); /* Exclude synchronize_sched(); */ \
> > + if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > + ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > + preempt_enable(); \
> > + } while (0)
> > +#else /* #ifdef CONFIG_TASKS_RCU */
> > +#define rcu_note_voluntary_context_switch(t) do { } while (0)
> > +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> > +
> > #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> > bool __rcu_is_watching(void);
> > #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 306f4f0c987a..3cf124389ec7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1273,6 +1273,11 @@ struct task_struct {
> > #ifdef CONFIG_RCU_BOOST
> > struct rt_mutex *rcu_boost_mutex;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> > +#ifdef CONFIG_TASKS_RCU
> > + unsigned long rcu_tasks_nvcsw;
> > + int rcu_tasks_holdout;
> > + struct list_head rcu_tasks_holdout_list;
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >
> > #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> > struct sched_info sched_info;
> > @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> > unsigned int mask);
> >
> > #ifdef CONFIG_PREEMPT_RCU
> > -
> > #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> > #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> >
> > static inline void rcu_copy_process(struct task_struct *p)
> > {
> > +#ifdef CONFIG_PREEMPT_RCU
> > p->rcu_read_lock_nesting = 0;
> > p->rcu_read_unlock_special = 0;
> > -#ifdef CONFIG_TREE_PREEMPT_RCU
> > p->rcu_blocked_node = NULL;
> > -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> > #ifdef CONFIG_RCU_BOOST
> > p->rcu_boost_mutex = NULL;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> > INIT_LIST_HEAD(&p->rcu_node_entry);
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > +#ifdef CONFIG_TASKS_RCU
> > + p->rcu_tasks_holdout = false;
> > + INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> > }
> >
> > -#else
> > -
> > -static inline void rcu_copy_process(struct task_struct *p)
> > -{
> > -}
> > -
> > -#endif
> > -
> > static inline void tsk_restore_flags(struct task_struct *task,
> > unsigned long orig_flags, unsigned long flags)
> > {
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d76b99af1b9..c56cb62a2df1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -507,6 +507,16 @@ config PREEMPT_RCU
> > This option enables preemptible-RCU code that is common between
> > the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
> >
> > +config TASKS_RCU
> > + bool "Task_based RCU implementation using voluntary context switch"
> > + default n
> > + help
> > + This option enables a task-based RCU implementation that uses
> > + only voluntary context switch (not preemption!), idle, and
> > + user-mode execution as quiescent states.
> > +
> > + If unsure, say N.
> > +
> > config RCU_STALL_COMMON
> > def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> > help
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index d9efcc13008c..717f00854fc0 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> > rcu_sched_qs(cpu);
> > else if (!in_softirq())
> > rcu_bh_qs(cpu);
> > + if (user)
> > + rcu_note_voluntary_context_switch(current);
> > }
> >
> > /*
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 625d0b0cd75a..f958c52f644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> > rcu_preempt_check_callbacks(cpu);
> > if (rcu_pending(cpu))
> > invoke_rcu_core();
> > + if (user)
> > + rcu_note_voluntary_context_switch(current);
> > trace_rcu_utilization(TPS("End scheduler-tick"));
> > }
> >
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index bc7883570530..50453589e3ca 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -47,6 +47,7 @@
> > #include <linux/hardirq.h>
> > #include <linux/delay.h>
> > #include <linux/module.h>
> > +#include <linux/kthread.h>
> >
> > #define CREATE_TRACE_POINTS
> >
> > @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> > early_initcall(check_cpu_stall_init);
> >
> > #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> > +
> > +#ifdef CONFIG_TASKS_RCU
> > +
> > +/*
> > + * Simple variant of RCU whose quiescent states are voluntary context switch,
> > + * user-space execution, and idle. As such, grace periods can take one good
> > + * long time. There are no read-side primitives similar to rcu_read_lock()
> > + * and rcu_read_unlock() because this implementation is intended to get
> > + * the system into a safe state for some of the manipulations involved in
> > + * tracing and the like. Finally, this implementation does not support
> > + * high call_rcu_tasks() rates from multiple CPUs. If this is required,
> > + * per-CPU callback lists will be needed.
> > + */
> > +
> > +/* Lists of tasks that we are still waiting for during this grace period. */
> > +static LIST_HEAD(rcu_tasks_holdouts);
> > +
> > +/* Global list of callbacks and associated lock. */
> > +static struct rcu_head *rcu_tasks_cbs_head;
> > +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> > +
> > +/* Post an RCU-tasks callback. */
> > +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> > +{
> > + unsigned long flags;
> > +
> > + rhp->next = NULL;
> > + rhp->func = func;
> > + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > + *rcu_tasks_cbs_tail = rhp;
> > + rcu_tasks_cbs_tail = &rhp->next;
> > + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> > +
> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > + unsigned long flags;
> > + struct task_struct *g, *t;
> > + struct rcu_head *list;
> > + struct rcu_head *next;
> > +
> > + /* FIXME: Add housekeeping affinity. */
> > +
> > + /*
> > + * Each pass through the following loop makes one check for
> > + * newly arrived callbacks, and, if there are some, waits for
> > + * one RCU-tasks grace period and then invokes the callbacks.
> > + * This loop is terminated by the system going down. ;-)
> > + */
> > + for (;;) {
> > +
> > + /* Pick up any new callbacks. */
> > + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > + smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
> > + list = rcu_tasks_cbs_head;
> > + rcu_tasks_cbs_head = NULL;
> > + rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > + /* If there were none, wait a bit and start over. */
> > + if (!list) {
> > + schedule_timeout_interruptible(HZ);
> > + flush_signals(current);
> > + continue;
> > + }
> > +
> > + /*
> > + * Wait for all pre-existing t->on_rq and t->nvcsw
> > + * transitions to complete. Invoking synchronize_sched()
> > + * suffices because all these transitions occur with
> > + * interrupts disabled. Without this synchronize_sched(),
> > + * a read-side critical section that started before the
> > + * grace period might be incorrectly seen as having started
> > + * after the grace period.
> > + *
> > + * This synchronize_sched() also dispenses with the
> > + * need for a memory barrier on the first store to
> > + * ->rcu_tasks_holdout, as it forces the store to happen
> > + * after the beginning of the grace period.
> > + */
> > + synchronize_sched();
> > +
> > + /*
> > + * There were callbacks, so we need to wait for an
> > + * RCU-tasks grace period. Start off by scanning
> > + * the task list for tasks that are not already
> > + * voluntarily blocked. Mark these tasks and make
> > + * a list of them in rcu_tasks_holdouts.
> > + */
> > + rcu_read_lock();
> > + for_each_process_thread(g, t) {
> > + if (t != current && ACCESS_ONCE(t->on_rq) &&
> > + !is_idle_task(t)) {
> > + get_task_struct(t);
> > + t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > + ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > + list_add(&t->rcu_tasks_holdout_list,
> > + &rcu_tasks_holdouts);
> > + }
> > + }
> > + rcu_read_unlock();
> > +
> > + /*
> > + * Each pass through the following loop scans the list
> > + * of holdout tasks, removing any that are no longer
> > + * holdouts. When the list is empty, we are done.
> > + */
> > + while (!list_empty(&rcu_tasks_holdouts)) {
> > + schedule_timeout_interruptible(HZ / 10);
> > + flush_signals(current);
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > + rcu_tasks_holdout_list) {
> > + if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> > + if (t->rcu_tasks_nvcsw ==
> > + ACCESS_ONCE(t->nvcsw) &&
> > + ACCESS_ONCE(t->on_rq))
> > + continue;
> > + ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > + }
> > + list_del_rcu(&t->rcu_tasks_holdout_list);
> > + put_task_struct(t);
> > + }
> > + rcu_read_unlock();
>
> rcu_read_lock() and list opertions of rcu variant are unneeded.

Good point, will change to list_for_each_entry_safe().

Thanx, Paul

> > + }
> > +
> > + /*
> > + * Because ->on_rq and ->nvcsw are not guaranteed
> > + * to have a full memory barriers prior to them in the
> > + * schedule() path, memory reordering on other CPUs could
> > + * cause their RCU-tasks read-side critical sections to
> > + * extend past the end of the grace period. However,
> > + * because these ->nvcsw updates are carried out with
> > + * interrupts disabled, we can use synchronize_sched()
> > + * to force the needed ordering on all such CPUs.
> > + *
> > + * This synchronize_sched() also confines all
> > + * ->rcu_tasks_holdout accesses to be within the grace
> > + * period, avoiding the need for memory barriers for
> > + * ->rcu_tasks_holdout accesses.
> > + */
> > + synchronize_sched();
> > +
> > + /* Invoke the callbacks. */
> > + while (list) {
> > + next = list->next;
> > + local_bh_disable();
> > + list->func(list);
> > + local_bh_enable();
> > + list = next;
> > + cond_resched();
> > + }
> > + }
> > +}
> > +
> > +/* Spawn rcu_tasks_kthread() at boot time. */
> > +static int __init rcu_spawn_tasks_kthread(void)
> > +{
> > + struct task_struct __maybe_unused *t;
> > +
> > + t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> > + BUG_ON(IS_ERR(t));
> > + return 0;
> > +}
> > +early_initcall(rcu_spawn_tasks_kthread);
> > +
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
>

2014-08-01 02:04:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

On Fri, Aug 01, 2014 at 01:57:50AM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 31, 2014 at 02:55:01PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs. Includes fixes from Steven Rostedt.
> >
> > This RCU flavor is assumed to have very infrequent latency-tolerate
> > updaters. This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state. If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> >
> > Suggested-by: Steven Rostedt <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > include/linux/init_task.h | 9 +++
> > include/linux/rcupdate.h | 36 ++++++++++
> > include/linux/sched.h | 23 ++++---
> > init/Kconfig | 10 +++
> > kernel/rcu/tiny.c | 2 +
> > kernel/rcu/tree.c | 2 +
> > kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> > 7 files changed, 242 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> > #else
> > #define INIT_TASK_RCU_PREEMPT(tsk)
> > #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk) \
> > + .rcu_tasks_holdout = false, \
> > + .rcu_tasks_holdout_list = \
> > + LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> >
> > extern struct cred init_cred;
> >
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> > INIT_FTRACE_GRAPH \
> > INIT_TRACE_RECURSION \
> > INIT_TASK_RCU_PREEMPT(tsk) \
> > + INIT_TASK_RCU_TASKS(tsk) \
> > INIT_CPUSET_SEQ(tsk) \
> > INIT_RT_MUTEXES(tsk) \
> > INIT_VTIME(tsk) \
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >
> > void synchronize_sched(void);
> >
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution. As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> > #ifdef CONFIG_PREEMPT_RCU
> >
> > void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> > rcu_irq_exit(); \
> > } while (0)
> >
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit. This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > + do { \
> > + preempt_disable(); /* Exclude synchronize_sched(); */ \
> > + if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > + ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > + preempt_enable(); \
> > + } while (0)
> > +#else /* #ifdef CONFIG_TASKS_RCU */
> > +#define rcu_note_voluntary_context_switch(t) do { } while (0)
> > +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> > +
> > #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> > bool __rcu_is_watching(void);
> > #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 306f4f0c987a..3cf124389ec7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1273,6 +1273,11 @@ struct task_struct {
> > #ifdef CONFIG_RCU_BOOST
> > struct rt_mutex *rcu_boost_mutex;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> > +#ifdef CONFIG_TASKS_RCU
> > + unsigned long rcu_tasks_nvcsw;
> > + int rcu_tasks_holdout;
> > + struct list_head rcu_tasks_holdout_list;
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >
> > #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> > struct sched_info sched_info;
> > @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> > unsigned int mask);
> >
> > #ifdef CONFIG_PREEMPT_RCU
> > -
> > #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> > #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> >
> > static inline void rcu_copy_process(struct task_struct *p)
> > {
> > +#ifdef CONFIG_PREEMPT_RCU
> > p->rcu_read_lock_nesting = 0;
> > p->rcu_read_unlock_special = 0;
> > -#ifdef CONFIG_TREE_PREEMPT_RCU
> > p->rcu_blocked_node = NULL;
> > -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> > #ifdef CONFIG_RCU_BOOST
> > p->rcu_boost_mutex = NULL;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> > INIT_LIST_HEAD(&p->rcu_node_entry);
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > +#ifdef CONFIG_TASKS_RCU
> > + p->rcu_tasks_holdout = false;
> > + INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> > }
> >
> > -#else
> > -
> > -static inline void rcu_copy_process(struct task_struct *p)
> > -{
> > -}
> > -
> > -#endif
> > -
> > static inline void tsk_restore_flags(struct task_struct *task,
> > unsigned long orig_flags, unsigned long flags)
> > {
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d76b99af1b9..c56cb62a2df1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -507,6 +507,16 @@ config PREEMPT_RCU
> > This option enables preemptible-RCU code that is common between
> > the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
> >
> > +config TASKS_RCU
> > + bool "Task_based RCU implementation using voluntary context switch"
> > + default n
> > + help
> > + This option enables a task-based RCU implementation that uses
> > + only voluntary context switch (not preemption!), idle, and
> > + user-mode execution as quiescent states.
> > +
> > + If unsure, say N.
>
> I don't remember who said that, but indeed this is a pure internal feature
> only. The user doesn't need to select that option ever.

I suspect that you are correct. This way is convenient for me for testing,
but I expect to make it pure internal before long.

> > +
> > config RCU_STALL_COMMON
> > def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> > help
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index d9efcc13008c..717f00854fc0 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> > rcu_sched_qs(cpu);
> > else if (!in_softirq())
> > rcu_bh_qs(cpu);
> > + if (user)
> > + rcu_note_voluntary_context_switch(current);
> > }
> >
> > /*
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 625d0b0cd75a..f958c52f644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> > rcu_preempt_check_callbacks(cpu);
> > if (rcu_pending(cpu))
> > invoke_rcu_core();
> > + if (user)
> > + rcu_note_voluntary_context_switch(current);
> > trace_rcu_utilization(TPS("End scheduler-tick"));
> > }
> >
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index bc7883570530..50453589e3ca 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -47,6 +47,7 @@
> > #include <linux/hardirq.h>
> > #include <linux/delay.h>
> > #include <linux/module.h>
> > +#include <linux/kthread.h>
> >
> > #define CREATE_TRACE_POINTS
> >
> > @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> > early_initcall(check_cpu_stall_init);
> >
> > #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> > +
> > +#ifdef CONFIG_TASKS_RCU
> > +
> > +/*
> > + * Simple variant of RCU whose quiescent states are voluntary context switch,
> > + * user-space execution, and idle. As such, grace periods can take one good
> > + * long time. There are no read-side primitives similar to rcu_read_lock()
> > + * and rcu_read_unlock() because this implementation is intended to get
> > + * the system into a safe state for some of the manipulations involved in
> > + * tracing and the like. Finally, this implementation does not support
> > + * high call_rcu_tasks() rates from multiple CPUs. If this is required,
> > + * per-CPU callback lists will be needed.
> > + */
> > +
> > +/* Lists of tasks that we are still waiting for during this grace period. */
> > +static LIST_HEAD(rcu_tasks_holdouts);
> > +
> > +/* Global list of callbacks and associated lock. */
> > +static struct rcu_head *rcu_tasks_cbs_head;
> > +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> > +
> > +/* Post an RCU-tasks callback. */
> > +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> > +{
> > + unsigned long flags;
> > +
> > + rhp->next = NULL;
> > + rhp->func = func;
> > + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > + *rcu_tasks_cbs_tail = rhp;
> > + rcu_tasks_cbs_tail = &rhp->next;
> > + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> > +
> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > + unsigned long flags;
> > + struct task_struct *g, *t;
> > + struct rcu_head *list;
> > + struct rcu_head *next;
> > +
> > + /* FIXME: Add housekeeping affinity. */
> > +
> > + /*
> > + * Each pass through the following loop makes one check for
> > + * newly arrived callbacks, and, if there are some, waits for
> > + * one RCU-tasks grace period and then invokes the callbacks.
> > + * This loop is terminated by the system going down. ;-)
> > + */
> > + for (;;) {
> > +
> > + /* Pick up any new callbacks. */
> > + raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > + smp_mb__after_unlock_lock(); /* Enforce GP memory ordering. */
>
> I have no idea which against what this is exactly ordering. GP is a vast thing.
> Especially for tricky barriers like __after_unlock_lock() which suggest very
> counter-intuitive ordering, a detailed comment is very welcome :)

Mostly makes sure that whatever happened before the callback was queued
is seen by everyone to have happened before the grace period started.
Though the synchronize_sched() below may have obsoleted this, will
check.

> > + list = rcu_tasks_cbs_head;
> > + rcu_tasks_cbs_head = NULL;
> > + rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > + raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > + /* If there were none, wait a bit and start over. */
> > + if (!list) {
> > + schedule_timeout_interruptible(HZ);
>
> So this thread is going to poll every second? I guess something prevents it
> to run when system is idle somewhere? But I'm not familiar with the whole patchset
> yet. But even without that it looks like a very annoying noise. why not use something
> wait/wakeup based?

And a later patch does the wait/wakeup thing. Start stupid, add small
amounts of sophistication incrementally.

> > + flush_signals(current);
> > + continue;
> > + }
> > +
> > + /*
> > + * Wait for all pre-existing t->on_rq and t->nvcsw
> > + * transitions to complete. Invoking synchronize_sched()
> > + * suffices because all these transitions occur with
> > + * interrupts disabled. Without this synchronize_sched(),
> > + * a read-side critical section that started before the
> > + * grace period might be incorrectly seen as having started
> > + * after the grace period.
> > + *
> > + * This synchronize_sched() also dispenses with the
> > + * need for a memory barrier on the first store to
> > + * ->rcu_tasks_holdout, as it forces the store to happen
> > + * after the beginning of the grace period.
> > + */
> > + synchronize_sched();
> > +
> > + /*
> > + * There were callbacks, so we need to wait for an
> > + * RCU-tasks grace period. Start off by scanning
> > + * the task list for tasks that are not already
> > + * voluntarily blocked. Mark these tasks and make
> > + * a list of them in rcu_tasks_holdouts.
> > + */
> > + rcu_read_lock();
> > + for_each_process_thread(g, t) {
> > + if (t != current && ACCESS_ONCE(t->on_rq) &&
> > + !is_idle_task(t)) {
> > + get_task_struct(t);
> > + t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > + ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > + list_add(&t->rcu_tasks_holdout_list,
> > + &rcu_tasks_holdouts);
> > + }
> > + }
> > + rcu_read_unlock();
> > +
> > + /*
> > + * Each pass through the following loop scans the list
> > + * of holdout tasks, removing any that are no longer
> > + * holdouts. When the list is empty, we are done.
> > + */
> > + while (!list_empty(&rcu_tasks_holdouts)) {
> > + schedule_timeout_interruptible(HZ / 10);
>
> OTOH here it is not annoying because it should only happen when rcu task
> is used, which should be rare.

Glad you like it!

I will likely also add checks for other things needing the current CPU.

Thanx, Paul

> Thanks.
>
> > + flush_signals(current);
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > + rcu_tasks_holdout_list) {
> > + if (ACCESS_ONCE(t->rcu_tasks_holdout)) {
> > + if (t->rcu_tasks_nvcsw ==
> > + ACCESS_ONCE(t->nvcsw) &&
> > + ACCESS_ONCE(t->on_rq))
> > + continue;
> > + ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > + }
> > + list_del_rcu(&t->rcu_tasks_holdout_list);
> > + put_task_struct(t);
> > + }
> > + rcu_read_unlock();
> > + }
> > +
> > + /*
> > + * Because ->on_rq and ->nvcsw are not guaranteed
> > + * to have a full memory barriers prior to them in the
> > + * schedule() path, memory reordering on other CPUs could
> > + * cause their RCU-tasks read-side critical sections to
> > + * extend past the end of the grace period. However,
> > + * because these ->nvcsw updates are carried out with
> > + * interrupts disabled, we can use synchronize_sched()
> > + * to force the needed ordering on all such CPUs.
> > + *
> > + * This synchronize_sched() also confines all
> > + * ->rcu_tasks_holdout accesses to be within the grace
> > + * period, avoiding the need for memory barriers for
> > + * ->rcu_tasks_holdout accesses.
> > + */
> > + synchronize_sched();
> > +
> > + /* Invoke the callbacks. */
> > + while (list) {
> > + next = list->next;
> > + local_bh_disable();
> > + list->func(list);
> > + local_bh_enable();
> > + list = next;
> > + cond_resched();
> > + }
> > + }
> > +}
> > +
> > +/* Spawn rcu_tasks_kthread() at boot time. */
> > +static int __init rcu_spawn_tasks_kthread(void)
> > +{
> > + struct task_struct __maybe_unused *t;
> > +
> > + t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> > + BUG_ON(IS_ERR(t));
> > + return 0;
> > +}
> > +early_initcall(rcu_spawn_tasks_kthread);
> > +
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> > --
> > 1.8.1.5
> >
>

2014-08-01 02:11:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 tip/core/rcu 1/9] rcu: Add call_rcu_tasks()

On Fri, Aug 01, 2014 at 09:31:37AM +0800, Lai Jiangshan wrote:
> On 08/01/2014 05:55 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs. Includes fixes from Steven Rostedt.
> >
> > This RCU flavor is assumed to have very infrequent latency-tolerate
> > updaters. This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state. If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> >
> > Suggested-by: Steven Rostedt <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > include/linux/init_task.h | 9 +++
> > include/linux/rcupdate.h | 36 ++++++++++
> > include/linux/sched.h | 23 ++++---
> > init/Kconfig | 10 +++
> > kernel/rcu/tiny.c | 2 +
> > kernel/rcu/tree.c | 2 +
> > kernel/rcu/update.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> > 7 files changed, 242 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> > #else
> > #define INIT_TASK_RCU_PREEMPT(tsk)
> > #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk) \
> > + .rcu_tasks_holdout = false, \
> > + .rcu_tasks_holdout_list = \
> > + LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> >
> > extern struct cred init_cred;
> >
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> > INIT_FTRACE_GRAPH \
> > INIT_TRACE_RECURSION \
> > INIT_TASK_RCU_PREEMPT(tsk) \
> > + INIT_TASK_RCU_TASKS(tsk) \
> > INIT_CPUSET_SEQ(tsk) \
> > INIT_RT_MUTEXES(tsk) \
> > INIT_VTIME(tsk) \
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >
> > void synchronize_sched(void);
> >
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution. As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> > #ifdef CONFIG_PREEMPT_RCU
> >
> > void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> > rcu_irq_exit(); \
> > } while (0)
> >
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit. This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > + do { \
> > + preempt_disable(); /* Exclude synchronize_sched(); */ \
> > + if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > + ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > + preempt_enable(); \
>
> Why the preempt_disable() is needed here? The comments in rcu_tasks_kthread()
> can't persuade me. Maybe it could be removed?

The synchronize_sched() near the end of the main loop in rcu_tasks_thread()
might well have obsoleted this, will take a closer look.

Thanx, Paul