2011-02-23 01:39:28

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/14] Preview of RCU patches for 2.6.39

Hello!

This patchset fixes a few bugs and introduces some of the infrastructure
for TREE_RCU priority boosting. If testing goes well, TREE_RCU priority
boosting itself might make it as well. The patches are as follows:

1. call __rcu_read_unlock() in exit_rcu for tiny RCU to preserve
debug state (from Lai Jiangshan).
2. Get rid of duplicate sched.h include from rcutorture.c
(from Jesper Juhl).
3. Add documentation saying which RCU flavor to choose.
4. Remove dead code from DEBUG_OBJECTS_RCU_HEAD implementation
(from Amerigo Wang).
5. Document transitivity for memory barriers.
6. Remove conditional compilation for RCU CPU stall warnings.
(These can now be controlled by boot/module parameters.)
7. Decrease memory-barrier usage based on semi-formal proof.
Expedited RCU has invalidated an assumption that the old
dyntick-idle interface depended on, and here is a fix. I
am still working on a lighter-weight fix, but safety first!
8. Merge TREE_PREEPT_RCU blocked_tasks[] lists, which is a first
step towards TREE_RCU priority boosting.
9. Update documentation to reflect blocked_tasks[] merge.
10. move TREE_RCU from softirq to kthread, which is a second step
towards TREE_RCU priority boosting.

For a testing-only version of this patchset from git, please see the
following subject-to-rebase branch:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git rcu/testing

I am more confident in the first five of the above patches, which are
available at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git rcu/next

Thanx, Paul


2011-02-23 01:39:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 03/11] rcu: add documentation saying which RCU flavor to choose

Reported-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/whatisRCU.txt | 31 +++++++++++++++++++++++++++++++
1 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index cfaac34..6ef6926 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -849,6 +849,37 @@ All: lockdep-checked RCU-protected pointer access
See the comment headers in the source code (or the docbook generated
from them) for more information.

+However, given that there are no fewer than four families of RCU APIs
+in the Linux kernel, how do you choose which one to use? The following
+list can be helpful:
+
+a. Will readers need to block? If so, you need SRCU.
+
+b. What about the -rt patchset? If readers would need to block
+ in an non-rt kernel, you need SRCU. If readers would block
+ in a -rt kernel, but not in a non-rt kernel, SRCU is not
+ necessary.
+
+c. Do you need to treat NMI handlers, hardirq handlers,
+ and code segments with preemption disabled (whether
+ via preempt_disable(), local_irq_save(), local_bh_disable(),
+ or some other mechanism) as if they were explicit RCU readers?
+ If so, you need RCU-sched.
+
+d. Do you need RCU grace periods to complete even in the face
+ of softirq monopolization of one or more of the CPUs? For
+ example, is your code subject to network-based denial-of-service
+ attacks? If so, you need RCU-bh.
+
+e. Is your workload too update-intensive for normal use of
+ RCU, but inappropriate for other synchronization mechanisms?
+ If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
+
+f. Otherwise, use RCU.
+
+Of course, this all assumes that you have determined that RCU is in fact
+the right tool for your job.
+

8. ANSWERS TO QUICK QUIZZES

--
1.7.3.2

2011-02-23 01:39:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 14/14] rcu: Add boosting to TREE_PREEMPT_RCU tracing

From: Paul E. McKenney <[email protected]>

Includes total number of tasks boosted, number boosted on behalf of each
of normal and expedited grace periods, and statistics on attempts to
initiate boosting that failed for various reasons.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree.h | 19 +++++++++++
kernel/rcutree_plugin.h | 37 ++++++++++++++++++++-
kernel/rcutree_trace.c | 82 +++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 4c820a8..292fed7 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -152,6 +152,25 @@ struct rcu_node {
wait_queue_head_t boost_wq;
/* Wait queue on which to park the boost */
/* kthread. */
+ unsigned long n_tasks_boosted;
+ /* Total number of tasks boosted. */
+ unsigned long n_exp_boosts;
+ /* Number of tasks boosted for expedited GP. */
+ unsigned long n_normal_boosts;
+ /* Number of tasks boosted for normal GP. */
+ unsigned long n_balk_blkd_tasks;
+ /* Refused to boost: no blocked tasks. */
+ unsigned long n_balk_exp_gp_tasks;
+ /* Refused to boost: nothing blocking GP. */
+ unsigned long n_balk_boost_tasks;
+ /* Refused to boost: already boosting. */
+ unsigned long n_balk_notblocked;
+ /* Refused to boost: RCU RS CS still running. */
+ unsigned long n_balk_notyet;
+ /* Refused to boost: not yet time. */
+ unsigned long n_balk_nos;
+ /* Refused to boost: not sure why, though. */
+ /* This can happen due to race conditions. */
#endif /* #ifdef CONFIG_RCU_BOOST */
struct task_struct *node_kthread_task;
/* kthread that takes care of this rcu_node */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 1242949..381f117 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1071,6 +1071,33 @@ static void __init __rcu_init_preempt(void)

#include "rtmutex_common.h"

+#ifdef CONFIG_RCU_TRACE
+
+static void rcu_initiate_boost_trace(struct rcu_node *rnp)
+{
+ if (list_empty(&rnp->blkd_tasks))
+ rnp->n_balk_blkd_tasks++;
+ else if (rnp->exp_tasks == NULL && rnp->gp_tasks == NULL)
+ rnp->n_balk_exp_gp_tasks++;
+ else if (rnp->gp_tasks != NULL && rnp->boost_tasks != NULL)
+ rnp->n_balk_boost_tasks++;
+ else if (rnp->gp_tasks != NULL && rnp->qsmask != 0)
+ rnp->n_balk_notblocked++;
+ else if (rnp->gp_tasks != NULL &&
+ ULONG_CMP_GE(jiffies, rnp->boost_time))
+ rnp->n_balk_notyet++;
+ else
+ rnp->n_balk_nos++;
+}
+
+#else /* #ifdef CONFIG_RCU_TRACE */
+
+static void rcu_initiate_boost_trace(struct rcu_node *rnp)
+{
+}
+
+#endif /* #else #ifdef CONFIG_RCU_TRACE */
+
/*
* Carry out RCU priority boosting on the task indicated by ->exp_tasks
* or ->boost_tasks, advancing the pointer to the next task in the
@@ -1106,10 +1133,14 @@ static int rcu_boost(struct rcu_node *rnp)
* expedited grace period must boost all blocked tasks, including
* those blocking the pre-existing normal grace period.
*/
- if (rnp->exp_tasks != NULL)
+ if (rnp->exp_tasks != NULL) {
tb = rnp->exp_tasks;
- else
+ rnp->n_exp_boosts++;
+ } else {
tb = rnp->boost_tasks;
+ rnp->n_normal_boosts++;
+ }
+ rnp->n_tasks_boosted++;

/*
* We boost task t by manufacturing an rt_mutex that appears to
@@ -1209,6 +1240,8 @@ static void rcu_initiate_boost(struct rcu_node *rnp)
t = rnp->boost_kthread_task;
if (t != NULL)
wake_up_process(t);
+ } else {
+ rcu_initiate_boost_trace(rnp);
}
}

diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index 1cedf94..5a1554c 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -157,6 +157,72 @@ static const struct file_operations rcudata_csv_fops = {
.release = single_release,
};

+#ifdef CONFIG_RCU_BOOST
+
+static void print_one_rcu_node_boost(struct seq_file *m, struct rcu_node *rnp)
+{
+ seq_printf(m, "%d:%d tasks=%c%c%c%c ntb=%lu neb=%lu nnb=%lu "
+ "j=%04x bt=%04x\n",
+ rnp->grplo, rnp->grphi,
+ "T."[list_empty(&rnp->blkd_tasks)],
+ "N."[!rnp->gp_tasks],
+ "E."[!rnp->exp_tasks],
+ "B."[!rnp->boost_tasks],
+ rnp->n_tasks_boosted, rnp->n_exp_boosts,
+ rnp->n_normal_boosts,
+ (int)(jiffies & 0xffff),
+ (int)(rnp->boost_time & 0xffff));
+ seq_printf(m, "%s: nt=%lu egt=%lu bt=%lu nb=%lu ny=%lu nos=%lu\n",
+ " balk",
+ rnp->n_balk_blkd_tasks,
+ rnp->n_balk_exp_gp_tasks,
+ rnp->n_balk_boost_tasks,
+ rnp->n_balk_notblocked,
+ rnp->n_balk_notyet,
+ rnp->n_balk_nos);
+}
+
+static int show_rcu_node_boost(struct seq_file *m, void *unused)
+{
+ struct rcu_node *rnp;
+
+ rcu_for_each_leaf_node(rcu_preempt_state, rnp)
+ print_one_rcu_node_boost(m, rnp);
+ return 0;
+}
+
+static int rcu_node_boost_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, show_rcu_node_boost, NULL);
+}
+
+static const struct file_operations rcu_node_boost_fops = {
+ .owner = THIS_MODULE,
+ .open = rcu_node_boost_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/*
+ * Create the rcuboost debugfs entry. Note that the return value
+ * is the standard zero for success, non-zero for failure.
+ */
+static int rcu_boost_trace_create_file(struct dentry *rcudir)
+{
+ return IS_ERR_VALUE(debugfs_create_file("rcuboost", 0444, rcudir, NULL,
+ &rcu_node_boost_fops));
+}
+
+#else /* #ifdef CONFIG_RCU_BOOST */
+
+static int rcu_boost_trace_create_file(struct dentry *rcudir)
+{
+ return 0; /* There cannot be an error if we didn't create it! */
+}
+
+#endif /* #else #ifdef CONFIG_RCU_BOOST */
+
static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
{
unsigned long gpnum;
@@ -302,31 +368,35 @@ static int __init rcutree_trace_init(void)
struct dentry *retval;

rcudir = debugfs_create_dir("rcu", NULL);
- if (!rcudir)
+ if (IS_ERR_VALUE(rcudir))
goto free_out;

retval = debugfs_create_file("rcudata", 0444, rcudir,
NULL, &rcudata_fops);
- if (!retval)
+ if (IS_ERR_VALUE(retval))
goto free_out;

retval = debugfs_create_file("rcudata.csv", 0444, rcudir,
NULL, &rcudata_csv_fops);
- if (!retval)
+ if (IS_ERR_VALUE(retval))
+ goto free_out;
+
+ retval = rcu_boost_trace_create_file(rcudir);
+ if (retval)
goto free_out;

retval = debugfs_create_file("rcugp", 0444, rcudir, NULL, &rcugp_fops);
- if (!retval)
+ if (IS_ERR_VALUE(retval))
goto free_out;

retval = debugfs_create_file("rcuhier", 0444, rcudir,
NULL, &rcuhier_fops);
- if (!retval)
+ if (IS_ERR_VALUE(retval))
goto free_out;

retval = debugfs_create_file("rcu_pending", 0444, rcudir,
NULL, &rcu_pending_fops);
- if (!retval)
+ if (IS_ERR_VALUE(retval))
goto free_out;
return 0;
free_out:
--
1.7.3.2

2011-02-23 01:39:58

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 01/11] rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU

From: Lai Jiangshan <[email protected]>

Using __rcu_read_lock() in place of rcu_read_lock() leaves any debug
state as it really should be, namely with the lock still held.

Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutiny_plugin.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
index 015abae..3cb8e36 100644
--- a/kernel/rcutiny_plugin.h
+++ b/kernel/rcutiny_plugin.h
@@ -852,7 +852,7 @@ void exit_rcu(void)
if (t->rcu_read_lock_nesting == 0)
return;
t->rcu_read_lock_nesting = 1;
- rcu_read_unlock();
+ __rcu_read_unlock();
}

#else /* #ifdef CONFIG_TINY_PREEMPT_RCU */
--
1.7.3.2

2011-02-23 01:40:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

The build will break if you change the Kconfig to allow
DEBUG_OBJECTS_RCU_HEAD and !PREEMPT, so document the reasoning
near where the breakage would occur.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcupdate.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index afd21d1..f3240e9 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -214,6 +214,11 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
* Ensure that queued callbacks are all executed.
* If we detect that we are nested in a RCU read-side critical
* section, we should simply fail, otherwise we would deadlock.
+ * Note that the machinery to reliably determine whether
+ * or not we are in an RCU read-side critical section
+ * exists only in the preemptible RCU implementations
+ * (TINY_PREEMPT_RCU and TREE_PREEMPT_RCU), which is why
+ * DEBUG_OBJECTS_RCU_HEAD is disallowed if !PREEMPT.
*/
if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
irqs_disabled()) {
--
1.7.3.2

2011-02-23 01:40:35

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 12/14] rcu: priority boosting for TREE_PREEMPT_RCU

From: Paul E. McKenney <[email protected]>

Add priority boosting for TREE_PREEMPT_RCU, similar to that for
TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST
kernel parameter. The priority to which to boost preempted
RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
(defaulting to real-time priority 1) and the time to wait before
boosting the readers who are blocking a given grace period is
controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
500 milliseconds).

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
init/Kconfig | 2 +-
kernel/rcutree.c | 105 ++++++++++++------
kernel/rcutree.h | 28 +++++-
kernel/rcutree_plugin.h | 279 +++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 364 insertions(+), 50 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index e11bc79..1855ef0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -460,7 +460,7 @@ config TREE_RCU_TRACE

config RCU_BOOST
bool "Enable RCU priority boosting"
- depends on RT_MUTEXES && TINY_PREEMPT_RCU
+ depends on RT_MUTEXES && PREEMPT_RCU
default n
help
This option boosts the priority of preempted RCU readers that
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 2241f28..2d317b2 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -81,6 +81,8 @@ DEFINE_PER_CPU(struct rcu_data, rcu_sched_data);
struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh_state);
DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);

+static struct rcu_state *rcu_state;
+
int rcu_scheduler_active __read_mostly;
EXPORT_SYMBOL_GPL(rcu_scheduler_active);

@@ -92,7 +94,7 @@ static DEFINE_PER_CPU(char, rcu_cpu_has_work);
static char rcu_kthreads_spawnable;

static void rcu_node_kthread_setaffinity(struct rcu_node *rnp);
-static void invoke_rcu_kthread(void);
+static void invoke_rcu_cpu_kthread(void);

#define RCU_KTHREAD_PRIO 1 /* RT priority for per-CPU kthreads. */

@@ -789,6 +791,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
rnp->completed = rsp->completed;
rsp->signaled = RCU_SIGNAL_INIT; /* force_quiescent_state OK. */
rcu_start_gp_per_cpu(rsp, rnp, rdp);
+ rcu_preempt_boost_start_gp(rnp);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
return;
}
@@ -824,6 +827,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
rnp->completed = rsp->completed;
if (rnp == rdp->mynode)
rcu_start_gp_per_cpu(rsp, rnp, rdp);
+ rcu_preempt_boost_start_gp(rnp);
raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
}

@@ -880,7 +884,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
return;
}
rnp->qsmask &= ~mask;
- if (rnp->qsmask != 0 || rcu_preempted_readers(rnp)) {
+ if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {

/* Other bits still set at this level, so done. */
raw_spin_unlock_irqrestore(&rnp->lock, flags);
@@ -1183,7 +1187,7 @@ static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp)

/* Re-raise the RCU softirq if there are callbacks remaining. */
if (cpu_has_callbacks_ready_to_invoke(rdp))
- invoke_rcu_kthread();
+ invoke_rcu_cpu_kthread();
}

/*
@@ -1229,7 +1233,7 @@ void rcu_check_callbacks(int cpu, int user)
}
rcu_preempt_check_callbacks(cpu);
if (rcu_pending(cpu))
- invoke_rcu_kthread();
+ invoke_rcu_cpu_kthread();
}

#ifdef CONFIG_SMP
@@ -1237,6 +1241,8 @@ void rcu_check_callbacks(int cpu, int user)
/*
* Scan the leaf rcu_node structures, processing dyntick state for any that
* have not yet encountered a quiescent state, using the function specified.
+ * Also initiate boosting for any threads blocked on the root rcu_node.
+ *
* The caller must have suppressed start of new grace periods.
*/
static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
@@ -1255,6 +1261,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
return;
}
if (rnp->qsmask == 0) {
+ rcu_initiate_boost(rnp);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
continue;
}
@@ -1273,6 +1280,11 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
}
raw_spin_unlock_irqrestore(&rnp->lock, flags);
}
+ rnp = rcu_get_root(rsp);
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ if (rnp->qsmask == 0)
+ rcu_initiate_boost(rnp);
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
}

/*
@@ -1408,7 +1420,7 @@ static void rcu_process_callbacks(void)
* Wake up the current CPU's kthread. This replaces raise_softirq()
* in earlier versions of RCU.
*/
-static void invoke_rcu_kthread(void)
+static void invoke_rcu_cpu_kthread(void)
{
unsigned long flags;
wait_queue_head_t *q;
@@ -1427,24 +1439,33 @@ static void invoke_rcu_kthread(void)
}

/*
+ * Wake up the specified per-rcu_node-structure kthread.
+ * The caller must hold ->lock.
+ */
+static void invoke_rcu_node_kthread(struct rcu_node *rnp)
+{
+ struct task_struct *t;
+
+ t = rnp->node_kthread_task;
+ if (t != NULL)
+ wake_up_process(t);
+}
+
+/*
* Timer handler to initiate the waking up of per-CPU kthreads that
* have yielded the CPU due to excess numbers of RCU callbacks.
+ * We wake up the per-rcu_node kthread, which in turn will wake up
+ * the booster kthread.
*/
static void rcu_cpu_kthread_timer(unsigned long arg)
{
unsigned long flags;
- struct rcu_data *rdp = (struct rcu_data *)arg;
+ struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, arg);
struct rcu_node *rnp = rdp->mynode;
- struct task_struct *t;

raw_spin_lock_irqsave(&rnp->lock, flags);
rnp->wakemask |= rdp->grpmask;
- t = rnp->node_kthread_task;
- if (t == NULL) {
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
- return;
- }
- wake_up_process(t);
+ invoke_rcu_node_kthread(rnp);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
}

@@ -1454,13 +1475,12 @@ static void rcu_cpu_kthread_timer(unsigned long arg)
* remain preempted. Either way, we restore our real-time priority
* before returning.
*/
-static void rcu_yield(int cpu)
+static void rcu_yield(void (*f)(unsigned long), unsigned long arg)
{
- struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
struct sched_param sp;
struct timer_list yield_timer;

- setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
+ setup_timer(&yield_timer, f, arg);
mod_timer(&yield_timer, jiffies + 2);
sp.sched_priority = 0;
sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
@@ -1484,12 +1504,13 @@ static void rcu_yield(int cpu)
*/
static int rcu_cpu_kthread_should_stop(int cpu)
{
- while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
+ while (cpu_is_offline(cpu) ||
+ !cpumask_equal(&current->cpus_allowed, cpumask_of(cpu))) {
if (kthread_should_stop())
return 1;
local_bh_enable();
- schedule_timeout_uninterruptible(1);
- if (smp_processor_id() != cpu)
+ schedule_timeout_interruptible(1);
+ if (!cpumask_equal(&current->cpus_allowed, cpumask_of(cpu)))
set_cpus_allowed_ptr(current, cpumask_of(cpu));
local_bh_disable();
}
@@ -1529,7 +1550,7 @@ static int rcu_cpu_kthread(void *arg)
else
spincnt = 0;
if (spincnt > 10) {
- rcu_yield(cpu);
+ rcu_yield(rcu_cpu_kthread_timer, (unsigned long)cpu);
spincnt = 0;
}
}
@@ -1581,6 +1602,7 @@ static int rcu_node_kthread(void *arg)
mask = rnp->wakemask;
rnp->wakemask = 0;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ rcu_initiate_boost(rnp);
for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
if ((mask & 0x1) == 0)
continue;
@@ -1620,6 +1642,7 @@ static void rcu_node_kthread_setaffinity(struct rcu_node *rnp)
if (mask & 01)
cpumask_set_cpu(cpu, cm);
set_cpus_allowed_ptr(rnp->node_kthread_task, cm);
+ rcu_boost_kthread_setaffinity(rnp, cm);
free_cpumask_var(cm);
}

@@ -1634,17 +1657,19 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
struct task_struct *t;

if (!rcu_kthreads_spawnable ||
- rnp->qsmaskinit == 0 ||
- rnp->node_kthread_task != NULL)
+ rnp->qsmaskinit == 0)
return 0;
- t = kthread_create(rcu_node_kthread, (void *)rnp, "rcun%d", rnp_index);
- if (IS_ERR(t))
- return PTR_ERR(t);
- rnp->node_kthread_task = t;
- wake_up_process(t);
- sp.sched_priority = 99;
- sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
- return 0;
+ if (rnp->node_kthread_task == NULL) {
+ t = kthread_create(rcu_node_kthread, (void *)rnp,
+ "rcun%d", rnp_index);
+ if (IS_ERR(t))
+ return PTR_ERR(t);
+ rnp->node_kthread_task = t;
+ wake_up_process(t);
+ sp.sched_priority = 99;
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ }
+ return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
}

/*
@@ -1662,10 +1687,16 @@ static int __init rcu_spawn_kthreads(void)
if (cpu_online(cpu))
(void)rcu_spawn_one_cpu_kthread(cpu);
}
- rcu_for_each_leaf_node(&rcu_sched_state, rnp) {
- init_waitqueue_head(&rnp->node_wq);
- (void)rcu_spawn_one_node_kthread(&rcu_sched_state, rnp);
- }
+ rnp = rcu_get_root(rcu_state);
+ init_waitqueue_head(&rnp->node_wq);
+ rcu_init_boost_waitqueue(rnp);
+ (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ if (NUM_RCU_NODES > 1)
+ rcu_for_each_leaf_node(rcu_state, rnp) {
+ init_waitqueue_head(&rnp->node_wq);
+ rcu_init_boost_waitqueue(rnp);
+ (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ }
return 0;
}
early_initcall(rcu_spawn_kthreads);
@@ -2071,14 +2102,14 @@ static void __cpuinit rcu_online_cpu(int cpu)

static void __cpuinit rcu_online_kthreads(int cpu)
{
- struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
+ struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
struct rcu_node *rnp = rdp->mynode;

/* Fire up the incoming CPU's kthread and leaf rcu_node kthread. */
if (rcu_kthreads_spawnable) {
(void)rcu_spawn_one_cpu_kthread(cpu);
if (rnp->node_kthread_task == NULL)
- (void)rcu_spawn_one_node_kthread(&rcu_sched_state, rnp);
+ (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
}
}

@@ -2089,7 +2120,7 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
unsigned long action, void *hcpu)
{
long cpu = (long)hcpu;
- struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
+ struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
struct rcu_node *rnp = rdp->mynode;

switch (action) {
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index c021380..4c820a8 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -135,6 +135,24 @@ struct rcu_node {
/* if there is no such task. If there */
/* is no current expedited grace period, */
/* then there can cannot be any such task. */
+#ifdef CONFIG_RCU_BOOST
+ struct list_head *boost_tasks;
+ /* Pointer to first task that needs to be */
+ /* priority boosted, or NULL if no priority */
+ /* boosting is needed for this rcu_node */
+ /* structure. If there are no tasks */
+ /* queued on this rcu_node structure that */
+ /* are blocking the current grace period, */
+ /* there can be no such task. */
+ unsigned long boost_time;
+ /* When to start boosting (jiffies). */
+ struct task_struct *boost_kthread_task;
+ /* kthread that takes care of priority */
+ /* boosting for this rcu_node structure. */
+ wait_queue_head_t boost_wq;
+ /* Wait queue on which to park the boost */
+ /* kthread. */
+#endif /* #ifdef CONFIG_RCU_BOOST */
struct task_struct *node_kthread_task;
/* kthread that takes care of this rcu_node */
/* structure, for example, awakening the */
@@ -365,7 +383,7 @@ DECLARE_PER_CPU(struct rcu_data, rcu_preempt_data);
static void rcu_bootup_announce(void);
long rcu_batches_completed(void);
static void rcu_preempt_note_context_switch(int cpu);
-static int rcu_preempted_readers(struct rcu_node *rnp);
+static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp);
#ifdef CONFIG_HOTPLUG_CPU
static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp,
unsigned long flags);
@@ -392,5 +410,13 @@ static void __cpuinit rcu_preempt_init_percpu_data(int cpu);
static void rcu_preempt_send_cbs_to_online(void);
static void __init __rcu_init_preempt(void);
static void rcu_needs_cpu_flush(void);
+static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp);
+static void rcu_initiate_boost(struct rcu_node *rnp);
+static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp,
+ cpumask_var_t cm);
+static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
+static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
+ struct rcu_node *rnp,
+ int rnp_index);

#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index b9bd69a..1242949 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -66,6 +66,7 @@ static void __init rcu_bootup_announce_oddness(void)

struct rcu_state rcu_preempt_state = RCU_STATE_INITIALIZER(rcu_preempt_state);
DEFINE_PER_CPU(struct rcu_data, rcu_preempt_data);
+static struct rcu_state *rcu_state = &rcu_preempt_state;

static int rcu_preempted_readers_exp(struct rcu_node *rnp);

@@ -179,6 +180,10 @@ static void rcu_preempt_note_context_switch(int cpu)
if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) {
list_add(&t->rcu_node_entry, rnp->gp_tasks->prev);
rnp->gp_tasks = &t->rcu_node_entry;
+#ifdef CONFIG_RCU_BOOST
+ if (rnp->boost_tasks != NULL)
+ rnp->boost_tasks = rnp->gp_tasks;
+#endif /* #ifdef CONFIG_RCU_BOOST */
} else {
list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
if (rnp->qsmask & rdp->grpmask)
@@ -218,7 +223,7 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
* for the specified rcu_node structure. If the caller needs a reliable
* answer, it must hold the rcu_node's ->lock.
*/
-static int rcu_preempted_readers(struct rcu_node *rnp)
+static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
{
return rnp->gp_tasks != NULL;
}
@@ -236,7 +241,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
unsigned long mask;
struct rcu_node *rnp_p;

- if (rnp->qsmask != 0 || rcu_preempted_readers(rnp)) {
+ if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
raw_spin_unlock_irqrestore(&rnp->lock, flags);
return; /* Still need more quiescent states! */
}
@@ -325,7 +330,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
break;
raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
}
- empty = !rcu_preempted_readers(rnp);
+ empty = !rcu_preempt_blocked_readers_cgp(rnp);
empty_exp = !rcu_preempted_readers_exp(rnp);
smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
np = rcu_next_node_entry(t, rnp);
@@ -334,6 +339,10 @@ static void rcu_read_unlock_special(struct task_struct *t)
rnp->gp_tasks = np;
if (&t->rcu_node_entry == rnp->exp_tasks)
rnp->exp_tasks = np;
+#ifdef CONFIG_RCU_BOOST
+ if (&t->rcu_node_entry == rnp->boost_tasks)
+ rnp->boost_tasks = np;
+#endif /* #ifdef CONFIG_RCU_BOOST */
t->rcu_blocked_node = NULL;

/*
@@ -346,6 +355,15 @@ static void rcu_read_unlock_special(struct task_struct *t)
else
rcu_report_unblock_qs_rnp(rnp, flags);

+#ifdef CONFIG_RCU_BOOST
+ /* Unboost if we were boosted. */
+ if (special & RCU_READ_UNLOCK_BOOSTED) {
+ t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BOOSTED;
+ rt_mutex_unlock(t->rcu_boost_mutex);
+ t->rcu_boost_mutex = NULL;
+ }
+#endif /* #ifdef CONFIG_RCU_BOOST */
+
/*
* If this was the last task on the expedited lists,
* then we need to report up the rcu_node hierarchy.
@@ -391,7 +409,7 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
unsigned long flags;
struct task_struct *t;

- if (!rcu_preempted_readers(rnp))
+ if (!rcu_preempt_blocked_readers_cgp(rnp))
return;
raw_spin_lock_irqsave(&rnp->lock, flags);
t = list_entry(rnp->gp_tasks,
@@ -430,7 +448,7 @@ static void rcu_print_task_stall(struct rcu_node *rnp)
{
struct task_struct *t;

- if (!rcu_preempted_readers(rnp))
+ if (!rcu_preempt_blocked_readers_cgp(rnp))
return;
t = list_entry(rnp->gp_tasks,
struct task_struct, rcu_node_entry);
@@ -460,7 +478,7 @@ static void rcu_preempt_stall_reset(void)
*/
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
{
- WARN_ON_ONCE(rcu_preempted_readers(rnp));
+ WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
if (!list_empty(&rnp->blkd_tasks))
rnp->gp_tasks = rnp->blkd_tasks.next;
WARN_ON_ONCE(rnp->qsmask);
@@ -509,7 +527,7 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
* absolutely necessary, but this is a good performance/complexity
* tradeoff.
*/
- if (rcu_preempted_readers(rnp))
+ if (rcu_preempt_blocked_readers_cgp(rnp))
retval |= RCU_OFL_TASKS_NORM_GP;
if (rcu_preempted_readers_exp(rnp))
retval |= RCU_OFL_TASKS_EXP_GP;
@@ -525,8 +543,20 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
rnp_root->gp_tasks = rnp->gp_tasks;
if (&t->rcu_node_entry == rnp->exp_tasks)
rnp_root->exp_tasks = rnp->exp_tasks;
+#ifdef CONFIG_RCU_BOOST
+ if (&t->rcu_node_entry == rnp->boost_tasks)
+ rnp_root->boost_tasks = rnp->boost_tasks;
+#endif /* #ifdef CONFIG_RCU_BOOST */
raw_spin_unlock(&rnp_root->lock); /* irqs still disabled */
}
+
+#ifdef CONFIG_RCU_BOOST
+ /* In case root is being boosted and leaf is not. */
+ if (rnp_root->boost_tasks != NULL &&
+ rnp_root->boost_tasks != rnp_root->gp_tasks)
+ rnp_root->boost_tasks = rnp_root->gp_tasks;
+#endif /* #ifdef CONFIG_RCU_BOOST */
+
rnp->gp_tasks = NULL;
rnp->exp_tasks = NULL;
return retval;
@@ -684,6 +714,7 @@ sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
raw_spin_lock(&rnp->lock); /* irqs already disabled */
if (!list_empty(&rnp->blkd_tasks)) {
rnp->exp_tasks = rnp->blkd_tasks.next;
+ rcu_initiate_boost(rnp);
must_wait = 1;
}
raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
@@ -830,6 +861,8 @@ void exit_rcu(void)

#else /* #ifdef CONFIG_TREE_PREEMPT_RCU */

+static struct rcu_state *rcu_state = &rcu_sched_state;
+
/*
* Tell them what RCU they are running.
*/
@@ -870,7 +903,7 @@ static void rcu_preempt_note_context_switch(int cpu)
* Because preemptable RCU does not exist, there are never any preempted
* RCU readers.
*/
-static int rcu_preempted_readers(struct rcu_node *rnp)
+static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
{
return 0;
}
@@ -1034,6 +1067,230 @@ static void __init __rcu_init_preempt(void)

#endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */

+#ifdef CONFIG_RCU_BOOST
+
+#include "rtmutex_common.h"
+
+/*
+ * Carry out RCU priority boosting on the task indicated by ->exp_tasks
+ * or ->boost_tasks, advancing the pointer to the next task in the
+ * ->blkd_tasks list.
+ *
+ * Note that irqs must be enabled: boosting the task can block.
+ * Returns 1 if there are more tasks needing to be boosted.
+ */
+static int rcu_boost(struct rcu_node *rnp)
+{
+ unsigned long flags;
+ struct rt_mutex mtx;
+ struct task_struct *t;
+ struct list_head *tb;
+
+ if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)
+ return 0; /* Nothing left to boost. */
+
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+
+ /*
+ * Recheck under the lock: all tasks in need of boosting
+ * might exit their RCU read-side critical sections on their own.
+ */
+ if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL) {
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ return 0;
+ }
+
+ /*
+ * Preferentially boost tasks blocking expedited grace periods.
+ * This cannot starve the normal grace periods because a second
+ * expedited grace period must boost all blocked tasks, including
+ * those blocking the pre-existing normal grace period.
+ */
+ if (rnp->exp_tasks != NULL)
+ tb = rnp->exp_tasks;
+ else
+ tb = rnp->boost_tasks;
+
+ /*
+ * We boost task t by manufacturing an rt_mutex that appears to
+ * be held by task t. We leave a pointer to that rt_mutex where
+ * task t can find it, and task t will release the mutex when it
+ * exits its outermost RCU read-side critical section. Then
+ * simply acquiring this artificial rt_mutex will boost task
+ * t's priority. (Thanks to tglx for suggesting this approach!)
+ *
+ * Note that task t must acquire rnp->lock to remove itself from
+ * the ->blkd_tasks list, which it will do from exit() if from
+ * nowhere else. We therefore are guaranteed that task t will
+ * stay around at least until we drop rnp->lock. Note that
+ * rnp->lock also resolves races between our priority boosting
+ * and task t's exiting its outermost RCU read-side critical
+ * section.
+ */
+ t = container_of(tb, struct task_struct, rcu_node_entry);
+ rt_mutex_init_proxy_locked(&mtx, t);
+ t->rcu_boost_mutex = &mtx;
+ t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BOOSTED;
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
+ rt_mutex_unlock(&mtx); /* Keep lockdep happy. */
+
+ return rnp->exp_tasks != NULL || rnp->boost_tasks != NULL;
+}
+
+/*
+ * Timer handler to initiate waking up of boost kthreads that
+ * have yielded the CPU due to excessive numbers of tasks to
+ * boost. We wake up the per-rcu_node kthread, which in turn
+ * will wake up the booster kthread.
+ */
+static void rcu_boost_kthread_timer(unsigned long arg)
+{
+ unsigned long flags;
+ struct rcu_node *rnp = (struct rcu_node *)arg;
+
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ invoke_rcu_node_kthread(rnp);
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+}
+
+/*
+ * Priority-boosting kthread. One per leaf rcu_node and one for the
+ * root rcu_node.
+ */
+static int rcu_boost_kthread(void *arg)
+{
+ struct rcu_node *rnp = (struct rcu_node *)arg;
+ int spincnt = 0;
+ int more2boost;
+
+ for (;;) {
+ wait_event_interruptible(rnp->boost_wq, rnp->boost_tasks ||
+ rnp->exp_tasks ||
+ kthread_should_stop());
+ if (kthread_should_stop())
+ break;
+ more2boost = rcu_boost(rnp);
+ if (more2boost)
+ spincnt++;
+ else
+ spincnt = 0;
+ if (spincnt > 10) {
+ rcu_yield(rcu_boost_kthread_timer, (unsigned long)rnp);
+ spincnt = 0;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Check to see if it is time to start boosting RCU readers that are
+ * blocking the current grace period, and, if so, tell the per-rcu_node
+ * kthread to start boosting them. If there is an expedited grace
+ * period in progress, it is always time to boost.
+ *
+ * The caller must hold rnp->lock.
+ */
+static void rcu_initiate_boost(struct rcu_node *rnp)
+{
+ struct task_struct *t;
+
+ if (!rcu_preempt_blocked_readers_cgp(rnp) && rnp->exp_tasks == NULL) {
+ rnp->n_balk_exp_gp_tasks++;
+ return;
+ }
+ if (rnp->exp_tasks != NULL ||
+ (rnp->gp_tasks != NULL &&
+ rnp->boost_tasks == NULL &&
+ rnp->qsmask == 0 &&
+ ULONG_CMP_GE(jiffies, rnp->boost_time))) {
+ if (rnp->exp_tasks == NULL)
+ rnp->boost_tasks = rnp->gp_tasks;
+ t = rnp->boost_kthread_task;
+ if (t != NULL)
+ wake_up_process(t);
+ }
+}
+
+static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp,
+ cpumask_var_t cm)
+{
+ set_cpus_allowed_ptr(rnp->boost_kthread_task, cm);
+}
+
+#define RCU_BOOST_DELAY_JIFFIES DIV_ROUND_UP(CONFIG_RCU_BOOST_DELAY * HZ, 1000)
+
+/*
+ * Do priority-boost accounting for the start of a new grace period.
+ */
+static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
+{
+ rnp->boost_time = jiffies + RCU_BOOST_DELAY_JIFFIES;
+}
+
+/*
+ * Initialize the RCU-boost waitqueue.
+ */
+static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp)
+{
+ init_waitqueue_head(&rnp->boost_wq);
+}
+
+/*
+ * Create an RCU-boost kthread for the specified node if one does not
+ * already exist. We only create this kthread for preemptible RCU.
+ * Returns zero if all is well, a negated errno otherwise.
+ */
+static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
+ struct rcu_node *rnp,
+ int rnp_index)
+{
+ struct sched_param sp;
+ struct task_struct *t;
+
+ if (&rcu_preempt_state != rsp)
+ return 0;
+ if (rnp->boost_kthread_task != NULL)
+ return 0;
+ t = kthread_create(rcu_boost_kthread, (void *)rnp,
+ "rcub%d", rnp_index);
+ if (IS_ERR(t))
+ return PTR_ERR(t);
+ rnp->boost_kthread_task = t;
+ wake_up_process(t);
+ sp.sched_priority = RCU_KTHREAD_PRIO;
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ return 0;
+}
+
+#else /* #ifdef CONFIG_RCU_BOOST */
+
+static void rcu_initiate_boost(struct rcu_node *rnp)
+{
+}
+
+static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp,
+ cpumask_var_t cm)
+{
+}
+
+static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
+{
+}
+
+static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp)
+{
+}
+
+static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
+ struct rcu_node *rnp,
+ int rnp_index)
+{
+ return 0;
+}
+
+#endif /* #else #ifdef CONFIG_RCU_BOOST */
+
#ifndef CONFIG_SMP

void synchronize_sched_expedited(void)
@@ -1206,8 +1463,8 @@ static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff);
*
* Because it is not legal to invoke rcu_process_callbacks() with irqs
* disabled, we do one pass of force_quiescent_state(), then do a
- * invoke_rcu_kthread() to cause rcu_process_callbacks() to be invoked later.
- * The per-cpu rcu_dyntick_drain variable controls the sequencing.
+ * invoke_rcu_cpu_kthread() to cause rcu_process_callbacks() to be invoked
+ * later. The per-cpu rcu_dyntick_drain variable controls the sequencing.
*/
int rcu_needs_cpu(int cpu)
{
@@ -1257,7 +1514,7 @@ int rcu_needs_cpu(int cpu)

/* If RCU callbacks are still pending, RCU still needs this CPU. */
if (c)
- invoke_rcu_kthread();
+ invoke_rcu_cpu_kthread();
return c;
}

--
1.7.3.2

2011-02-23 01:40:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 10/11] rcu: Update documentation to reflect blocked_tasks[] merge

From: Paul E. McKenney <[email protected]>

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/trace.txt | 29 ++++++++++++++++++-----------
1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index e731ad2..5a704ff 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -166,14 +166,14 @@ o "gpnum" is the number of grace periods that have started. It is
The output of "cat rcu/rcuhier" looks as follows, with very long lines:

c=6902 g=6903 s=2 jfq=3 j=72c7 nfqs=13142/nfqsng=0(13142) fqlh=6
-1/1 .>. 0:127 ^0
-3/3 .>. 0:35 ^0 0/0 .>. 36:71 ^1 0/0 .>. 72:107 ^2 0/0 .>. 108:127 ^3
-3/3f .>. 0:5 ^0 2/3 .>. 6:11 ^1 0/0 .>. 12:17 ^2 0/0 .>. 18:23 ^3 0/0 .>. 24:29 ^4 0/0 .>. 30:35 ^5 0/0 .>. 36:41 ^0 0/0 .>. 42:47 ^1 0/0 .>. 48:53 ^2 0/0 .>. 54:59 ^3 0/0 .>. 60:65 ^4 0/0 .>. 66:71 ^5 0/0 .>. 72:77 ^0 0/0 .>. 78:83 ^1 0/0 .>. 84:89 ^2 0/0 .>. 90:95 ^3 0/0 .>. 96:101 ^4 0/0 .>. 102:107 ^5 0/0 .>. 108:113 ^0 0/0 .>. 114:119 ^1 0/0 .>. 120:125 ^2 0/0 .>. 126:127 ^3
+1/1 ..>. 0:127 ^0
+3/3 ..>. 0:35 ^0 0/0 ..>. 36:71 ^1 0/0 ..>. 72:107 ^2 0/0 ..>. 108:127 ^3
+3/3f ..>. 0:5 ^0 2/3 ..>. 6:11 ^1 0/0 ..>. 12:17 ^2 0/0 ..>. 18:23 ^3 0/0 ..>. 24:29 ^4 0/0 ..>. 30:35 ^5 0/0 ..>. 36:41 ^0 0/0 ..>. 42:47 ^1 0/0 ..>. 48:53 ^2 0/0 ..>. 54:59 ^3 0/0 ..>. 60:65 ^4 0/0 ..>. 66:71 ^5 0/0 ..>. 72:77 ^0 0/0 ..>. 78:83 ^1 0/0 ..>. 84:89 ^2 0/0 ..>. 90:95 ^3 0/0 ..>. 96:101 ^4 0/0 ..>. 102:107 ^5 0/0 ..>. 108:113 ^0 0/0 ..>. 114:119 ^1 0/0 ..>. 120:125 ^2 0/0 ..>. 126:127 ^3
rcu_bh:
c=-226 g=-226 s=1 jfq=-5701 j=72c7 nfqs=88/nfqsng=0(88) fqlh=0
-0/1 .>. 0:127 ^0
-0/3 .>. 0:35 ^0 0/0 .>. 36:71 ^1 0/0 .>. 72:107 ^2 0/0 .>. 108:127 ^3
-0/3f .>. 0:5 ^0 0/3 .>. 6:11 ^1 0/0 .>. 12:17 ^2 0/0 .>. 18:23 ^3 0/0 .>. 24:29 ^4 0/0 .>. 30:35 ^5 0/0 .>. 36:41 ^0 0/0 .>. 42:47 ^1 0/0 .>. 48:53 ^2 0/0 .>. 54:59 ^3 0/0 .>. 60:65 ^4 0/0 .>. 66:71 ^5 0/0 .>. 72:77 ^0 0/0 .>. 78:83 ^1 0/0 .>. 84:89 ^2 0/0 .>. 90:95 ^3 0/0 .>. 96:101 ^4 0/0 .>. 102:107 ^5 0/0 .>. 108:113 ^0 0/0 .>. 114:119 ^1 0/0 .>. 120:125 ^2 0/0 .>. 126:127 ^3
+0/1 ..>. 0:127 ^0
+0/3 ..>. 0:35 ^0 0/0 ..>. 36:71 ^1 0/0 ..>. 72:107 ^2 0/0 ..>. 108:127 ^3
+0/3f ..>. 0:5 ^0 0/3 ..>. 6:11 ^1 0/0 ..>. 12:17 ^2 0/0 ..>. 18:23 ^3 0/0 ..>. 24:29 ^4 0/0 ..>. 30:35 ^5 0/0 ..>. 36:41 ^0 0/0 ..>. 42:47 ^1 0/0 ..>. 48:53 ^2 0/0 ..>. 54:59 ^3 0/0 ..>. 60:65 ^4 0/0 ..>. 66:71 ^5 0/0 ..>. 72:77 ^0 0/0 ..>. 78:83 ^1 0/0 ..>. 84:89 ^2 0/0 ..>. 90:95 ^3 0/0 ..>. 96:101 ^4 0/0 ..>. 102:107 ^5 0/0 ..>. 108:113 ^0 0/0 ..>. 114:119 ^1 0/0 ..>. 120:125 ^2 0/0 ..>. 126:127 ^3

This is once again split into "rcu_sched" and "rcu_bh" portions,
and CONFIG_TREE_PREEMPT_RCU kernels will again have an additional
@@ -232,13 +232,20 @@ o Each element of the form "1/1 0:127 ^0" represents one struct
current grace period.

o The characters separated by the ">" indicate the state
- of the blocked-tasks lists. A "T" preceding the ">"
+ of the blocked-tasks lists. A "G" preceding the ">"
indicates that at least one task blocked in an RCU
read-side critical section blocks the current grace
- period, while a "." preceding the ">" indicates otherwise.
- The character following the ">" indicates similarly for
- the next grace period. A "T" should appear in this
- field only for rcu-preempt.
+ period, while a "E" preceding the ">" indicates that
+ at least one task blocked in an RCU read-side critical
+ section blocks the current expedited grace period.
+ A "T" character following the ">" indicates that at
+ least one task is blocked within an RCU read-side
+ critical section, regardless of whether any current
+ grace period (expedited or normal) is inconvenienced.
+ A "." character appears if the corresponding condition
+ does not hold, so that "..>." indicates that no tasks
+ are blocked. In contrast, "GE>T" indicates maximal
+ inconvenience from blocked tasks.

o The numbers separated by the ":" are the range of CPUs
served by this struct rcu_node. This can be helpful
--
1.7.3.2

2011-02-23 01:41:02

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 02/11] rcutorture: Get rid of duplicate sched.h include

From: Jesper Juhl <[email protected]>

linux/sched.h is included twice in kernel/rcutorture.c - once is enough.

Signed-off-by: Jesper Juhl <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutorture.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 89613f9..c224da4 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -47,7 +47,6 @@
#include <linux/srcu.h>
#include <linux/slab.h>
#include <asm/byteorder.h>
-#include <linux/sched.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Paul E. McKenney <[email protected]> and "
--
1.7.3.2

2011-02-23 01:41:26

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 13/14] rcu: eliminate unused boosting statistics

From: Paul E. McKenney <[email protected]>

The n_rcu_torture_boost_allocerror and n_rcu_torture_boost_afferror
statistics are not actually incremented anymore, so eliminate them.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutorture.c | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index c224da4..68f434f 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -1067,7 +1067,7 @@ rcu_torture_printk(char *page)
cnt += sprintf(&page[cnt], "%s%s ", torture_type, TORTURE_FLAG);
cnt += sprintf(&page[cnt],
"rtc: %p ver: %ld tfle: %d rta: %d rtaf: %d rtf: %d "
- "rtmbe: %d rtbke: %ld rtbre: %ld rtbae: %ld rtbafe: %ld "
+ "rtmbe: %d rtbke: %ld rtbre: %ld "
"rtbf: %ld rtb: %ld nt: %ld",
rcu_torture_current,
rcu_torture_current_version,
@@ -1078,8 +1078,6 @@ rcu_torture_printk(char *page)
atomic_read(&n_rcu_torture_mberror),
n_rcu_torture_boost_ktrerror,
n_rcu_torture_boost_rterror,
- n_rcu_torture_boost_allocerror,
- n_rcu_torture_boost_afferror,
n_rcu_torture_boost_failure,
n_rcu_torture_boosts,
n_rcu_torture_timers);
--
1.7.3.2

2011-02-23 01:41:28

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 07/11] rcu: Remove conditional compilation for RCU CPU stall warnings

The RCU CPU stall warnings can now be controlled using the
rcu_cpu_stall_suppress boot-time parameter or via the same parameter
from sysfs. There is therefore no longer any reason to have
kernel config parameters for this feature. This commit therefore
removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
kernel config parameters. The RCU_CPU_STALL_TIMEOUT parameter remains
to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
remains to allow task-stall information to be suppressed if desired.

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/00-INDEX | 2 +-
Documentation/RCU/stallwarn.txt | 23 +++++++++++++----------
kernel/rcutree.c | 26 +-------------------------
kernel/rcutree.h | 12 ------------
kernel/rcutree_plugin.h | 12 ------------
lib/Kconfig.debug | 30 ++----------------------------
6 files changed, 17 insertions(+), 88 deletions(-)

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index 71b6f50..1d7a885 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -21,7 +21,7 @@ rcu.txt
RTFP.txt
- List of RCU papers (bibliography) going back to 1980.
stallwarn.txt
- - RCU CPU stall warnings (CONFIG_RCU_CPU_STALL_DETECTOR)
+ - RCU CPU stall warnings (module parameter rcu_cpu_stall_suppress)
torture.txt
- RCU Torture Test Operation (CONFIG_RCU_TORTURE_TEST)
trace.txt
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 862c08e..4e95920 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -1,22 +1,25 @@
Using RCU's CPU Stall Detector

-The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables
-RCU's CPU stall detector, which detects conditions that unduly delay
-RCU grace periods. The stall detector's idea of what constitutes
-"unduly delayed" is controlled by a set of C preprocessor macros:
+The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
+detector, which detects conditions that unduly delay RCU grace periods.
+This module parameter enables CPU stall detection by default, but
+may be overridden via boot-time parameter or at runtime via sysfs.
+The stall detector's idea of what constitutes "unduly delayed" is
+controlled by a set of kernel configuration variables and cpp macros:

-RCU_SECONDS_TILL_STALL_CHECK
+CONFIG_RCU_CPU_STALL_TIMEOUT

- This macro defines the period of time that RCU will wait from
- the beginning of a grace period until it issues an RCU CPU
- stall warning. This time period is normally ten seconds.
+ This kernel configuration parameter defines the period of time
+ that RCU will wait from the beginning of a grace period until it
+ issues an RCU CPU stall warning. This time period is normally
+ ten seconds.

RCU_SECONDS_TILL_STALL_RECHECK

This macro defines the period of time that RCU will wait after
issuing a stall warning until it issues another stall warning
- for the same stall. This time period is normally set to thirty
- seconds.
+ for the same stall. This time period is normally set to three
+ times the check interval plus thirty seconds.

RCU_STALL_RAT_DELAY

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d0ddfea..5513129 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -140,10 +140,8 @@ module_param(blimit, int, 0);
module_param(qhimark, int, 0);
module_param(qlowmark, int, 0);

-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
-int rcu_cpu_stall_suppress __read_mostly = RCU_CPU_STALL_SUPPRESS_INIT;
+int rcu_cpu_stall_suppress __read_mostly;
module_param(rcu_cpu_stall_suppress, int, 0644);
-#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */

static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
static int rcu_pending(int cpu);
@@ -450,8 +448,6 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)

#endif /* #else #ifdef CONFIG_NO_HZ */

-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
-
int rcu_cpu_stall_suppress __read_mostly;

static void record_gp_stall_check_time(struct rcu_state *rsp)
@@ -587,26 +583,6 @@ static void __init check_cpu_stall_init(void)
atomic_notifier_chain_register(&panic_notifier_list, &rcu_panic_block);
}

-#else /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
-
-static void record_gp_stall_check_time(struct rcu_state *rsp)
-{
-}
-
-static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
-{
-}
-
-void rcu_cpu_stall_reset(void)
-{
-}
-
-static void __init check_cpu_stall_init(void)
-{
-}
-
-#endif /* #else #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
-
/*
* Update CPU-local rcu_data state to record the newly noticed grace period.
* This is used both when we started the grace period and when we notice
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index e8f057e..e1a6663 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -254,7 +254,6 @@ struct rcu_data {
#endif /* #else #ifdef CONFIG_NO_HZ */

#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */
-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR

#ifdef CONFIG_PROVE_RCU
#define RCU_STALL_DELAY_DELTA (5 * HZ)
@@ -272,13 +271,6 @@ struct rcu_data {
/* scheduling clock irq */
/* before ratting on them. */

-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR_RUNNABLE
-#define RCU_CPU_STALL_SUPPRESS_INIT 0
-#else
-#define RCU_CPU_STALL_SUPPRESS_INIT 1
-#endif
-
-#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */

/*
* RCU global state, including node hierarchy. This hierarchy is
@@ -325,12 +317,10 @@ struct rcu_state {
/* due to lock unavailable. */
unsigned long n_force_qs_ngp; /* Number of calls leaving */
/* due to no GP active. */
-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
unsigned long gp_start; /* Time at which GP started, */
/* but in jiffies. */
unsigned long jiffies_stall; /* Time at which to check */
/* for CPU stalls. */
-#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
char *name; /* Name of structure. */
};

@@ -366,11 +356,9 @@ static int rcu_preempted_readers(struct rcu_node *rnp);
static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp,
unsigned long flags);
#endif /* #ifdef CONFIG_HOTPLUG_CPU */
-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
static void rcu_print_detail_task_stall(struct rcu_state *rsp);
static void rcu_print_task_stall(struct rcu_node *rnp);
static void rcu_preempt_stall_reset(void);
-#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp);
#ifdef CONFIG_HOTPLUG_CPU
static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a363871..38426ef 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -54,10 +54,6 @@ static void __init rcu_bootup_announce_oddness(void)
#ifdef CONFIG_RCU_TORTURE_TEST_RUNNABLE
printk(KERN_INFO "\tRCU torture testing starts during boot.\n");
#endif
-#ifndef CONFIG_RCU_CPU_STALL_DETECTOR
- printk(KERN_INFO
- "\tRCU-based detection of stalled CPUs is disabled.\n");
-#endif
#if defined(CONFIG_TREE_PREEMPT_RCU) && !defined(CONFIG_RCU_CPU_STALL_VERBOSE)
printk(KERN_INFO "\tVerbose stalled-CPUs detection is disabled.\n");
#endif
@@ -356,8 +352,6 @@ void __rcu_read_unlock(void)
}
EXPORT_SYMBOL_GPL(__rcu_read_unlock);

-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
-
#ifdef CONFIG_RCU_CPU_STALL_VERBOSE

/*
@@ -430,8 +424,6 @@ static void rcu_preempt_stall_reset(void)
rcu_preempt_state.jiffies_stall = jiffies + ULONG_MAX / 2;
}

-#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
-
/*
* Check that the list of blocked tasks for the newly completed grace
* period is in fact empty. It is a serious bug to complete a grace
@@ -862,8 +854,6 @@ static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)

#endif /* #ifdef CONFIG_HOTPLUG_CPU */

-#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
-
/*
* Because preemptable RCU does not exist, we never have to check for
* tasks blocked within RCU read-side critical sections.
@@ -888,8 +878,6 @@ static void rcu_preempt_stall_reset(void)
{
}

-#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
-
/*
* Because there is no preemptable RCU, there can be no readers blocked,
* so there is no need to check for blocked tasks. So check only for
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 28b42b9..9296cc7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -862,22 +862,9 @@ config RCU_TORTURE_TEST_RUNNABLE
Say N here if you want the RCU torture tests to start only
after being manually enabled via /proc.

-config RCU_CPU_STALL_DETECTOR
- bool "Check for stalled CPUs delaying RCU grace periods"
- depends on TREE_RCU || TREE_PREEMPT_RCU
- default y
- help
- This option causes RCU to printk information on which
- CPUs are delaying the current grace period, but only when
- the grace period extends for excessive time periods.
-
- Say N if you want to disable such checks.
-
- Say Y if you are unsure.
-
config RCU_CPU_STALL_TIMEOUT
int "RCU CPU stall timeout in seconds"
- depends on RCU_CPU_STALL_DETECTOR
+ depends on TREE_RCU || TREE_PREEMPT_RCU
range 3 300
default 60
help
@@ -886,22 +873,9 @@ config RCU_CPU_STALL_TIMEOUT
RCU grace period persists, additional CPU stall warnings are
printed at more widely spaced intervals.

-config RCU_CPU_STALL_DETECTOR_RUNNABLE
- bool "RCU CPU stall checking starts automatically at boot"
- depends on RCU_CPU_STALL_DETECTOR
- default y
- help
- If set, start checking for RCU CPU stalls immediately on
- boot. Otherwise, RCU CPU stall checking must be manually
- enabled.
-
- Say Y if you are unsure.
-
- Say N if you wish to suppress RCU CPU stall checking during boot.
-
config RCU_CPU_STALL_VERBOSE
bool "Print additional per-task information for RCU_CPU_STALL_DETECTOR"
- depends on RCU_CPU_STALL_DETECTOR && TREE_PREEMPT_RCU
+ depends on TREE_PREEMPT_RCU
default y
help
This option causes RCU to printk detailed per-task information
--
1.7.3.2

2011-02-23 01:41:51

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 06/11] smp: Document transitivity for memory barriers.

Transitivity is guaranteed only for full memory barriers (smp_mb()).

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/memory-barriers.txt | 58 +++++++++++++++++++++++++++++++++++++
1 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 631ad2f..f0d3a80 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -21,6 +21,7 @@ Contents:
- SMP barrier pairing.
- Examples of memory barrier sequences.
- Read memory barriers vs load speculation.
+ - Transitivity

(*) Explicit kernel barriers.

@@ -959,6 +960,63 @@ the speculation will be cancelled and the value reloaded:
retrieved : : +-------+


+TRANSITIVITY
+------------
+
+Transitivity is a deeply intuitive notion about ordering that is not
+always provided by real computer systems. The following example
+demonstrates transitivity (also called "cumulativity"):
+
+ CPU 1 CPU 2 CPU 3
+ ======================= ======================= =======================
+ { X = 0, Y = 0 }
+ STORE X=1 LOAD X STORE Y=1
+ <general barrier> <general barrier>
+ LOAD Y LOAD X
+
+Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
+This indicates that CPU 2's load from X in some sense follows CPU 1's
+store to X and that CPU 2's load from Y in some sense preceded CPU 3's
+store to Y. The question is then "Can CPU 3's load from X return 0?"
+
+Because CPU 2's load from X in some sense came after CPU 1's store, it
+is natural to expect that CPU 3's load from X must therefore return 1.
+This expectation is an example of transitivity: if a load executing on
+CPU A follows a load from the same variable executing on CPU B, then
+CPU A's load must either return the same value that CPU B's load did,
+or must return some later value.
+
+In the Linux kernel, use of general memory barriers guarantees
+transitivity. Therefore, in the above example, if CPU 2's load from X
+returns 1 and its load from Y returns 0, then CPU 3's load from X must
+also return 1.
+
+However, transitivity is -not- guaranteed for read or write barriers.
+For example, suppose that CPU 2's general barrier in the above example
+is changed to a read barrier as shown below:
+
+ CPU 1 CPU 2 CPU 3
+ ======================= ======================= =======================
+ { X = 0, Y = 0 }
+ STORE X=1 LOAD X STORE Y=1
+ <read barrier> <general barrier>
+ LOAD Y LOAD X
+
+This substitution destroys transitivity: in this example, it is perfectly
+legal for CPU 2's load from X to return 1, its load from Y to return 0,
+and CPU 3's load from X to return 0.
+
+The key point is that although CPU 2's read barrier orders its pair
+of loads, it does not guarantee to order CPU 1's store. Therefore, if
+this example runs on a system where CPUs 1 and 2 share a store buffer
+or a level of cache, CPU 2 might have early access to CPU 1's writes.
+General barriers are therefore required to ensure that all CPUs agree
+on the combined order of CPU 1's and CPU 2's accesses.
+
+To reiterate, if your code requires transitivity, use general barriers
+throughout.
+
+
========================
EXPLICIT KERNEL BARRIERS
========================
--
1.7.3.2

2011-02-23 01:41:56

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

From: Paul E. McKenney <[email protected]>

If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers. Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked. If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.

But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/interrupt.h | 1 -
include/trace/events/irq.h | 3 +-
kernel/rcutree.c | 324 ++++++++++++++++++++++++++++++++++-
kernel/rcutree.h | 8 +
kernel/rcutree_plugin.h | 4 +-
tools/perf/util/trace-event-parse.c | 1 -
6 files changed, 331 insertions(+), 10 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..ed47deb 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -385,7 +385,6 @@ enum
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
- RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */

NR_SOFTIRQS
};
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 1c09820..ae045ca 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -20,8 +20,7 @@ struct softirq_action;
softirq_name(BLOCK_IOPOLL), \
softirq_name(TASKLET), \
softirq_name(SCHED), \
- softirq_name(HRTIMER), \
- softirq_name(RCU))
+ softirq_name(HRTIMER))

/**
* irq_handler_entry - called immediately before the irq action handler
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 0ac1cc0..2241f28 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -47,6 +47,8 @@
#include <linux/mutex.h>
#include <linux/time.h>
#include <linux/kernel_stat.h>
+#include <linux/wait.h>
+#include <linux/kthread.h>

#include "rcutree.h"

@@ -82,6 +84,18 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
int rcu_scheduler_active __read_mostly;
EXPORT_SYMBOL_GPL(rcu_scheduler_active);

+/* Control variables for per-CPU and per-rcu_node kthreads. */
+
+static DEFINE_PER_CPU(struct task_struct *, rcu_cpu_kthread_task);
+static DEFINE_PER_CPU(wait_queue_head_t, rcu_cpu_wq);
+static DEFINE_PER_CPU(char, rcu_cpu_has_work);
+static char rcu_kthreads_spawnable;
+
+static void rcu_node_kthread_setaffinity(struct rcu_node *rnp);
+static void invoke_rcu_kthread(void);
+
+#define RCU_KTHREAD_PRIO 1 /* RT priority for per-CPU kthreads. */
+
/*
* Return true if an RCU grace period is in progress. The ACCESS_ONCE()s
* permit this function to be invoked without holding the root rcu_node
@@ -1018,6 +1032,12 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp;

+ /* Stop the CPU's kthread. */
+ if (per_cpu(rcu_cpu_kthread_task, cpu) != NULL) {
+ kthread_stop(per_cpu(rcu_cpu_kthread_task, cpu));
+ per_cpu(rcu_cpu_kthread_task, cpu) = NULL;
+ }
+
/* Exclude any attempts to start a new grace period. */
raw_spin_lock_irqsave(&rsp->onofflock, flags);

@@ -1054,6 +1074,18 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore(&rnp->lock, flags);
if (need_report & RCU_OFL_TASKS_EXP_GP)
rcu_report_exp_rnp(rsp, rnp);
+
+ /*
+ * If there are no more online CPUs for this rcu_node structure,
+ * kill the rcu_node structure's kthread. Otherwise, adjust its
+ * affinity.
+ */
+ if (rnp->node_kthread_task != NULL &&
+ rnp->qsmaskinit == 0) {
+ kthread_stop(rnp->node_kthread_task);
+ rnp->node_kthread_task = NULL;
+ } else
+ rcu_node_kthread_setaffinity(rnp);
}

/*
@@ -1151,7 +1183,7 @@ static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp)

/* Re-raise the RCU softirq if there are callbacks remaining. */
if (cpu_has_callbacks_ready_to_invoke(rdp))
- raise_softirq(RCU_SOFTIRQ);
+ invoke_rcu_kthread();
}

/*
@@ -1197,7 +1229,7 @@ void rcu_check_callbacks(int cpu, int user)
}
rcu_preempt_check_callbacks(cpu);
if (rcu_pending(cpu))
- raise_softirq(RCU_SOFTIRQ);
+ invoke_rcu_kthread();
}

#ifdef CONFIG_SMP
@@ -1361,7 +1393,7 @@ __rcu_process_callbacks(struct rcu_state *rsp, struct rcu_data *rdp)
/*
* Do softirq processing for the current CPU.
*/
-static void rcu_process_callbacks(struct softirq_action *unused)
+static void rcu_process_callbacks(void)
{
__rcu_process_callbacks(&rcu_sched_state,
&__get_cpu_var(rcu_sched_data));
@@ -1372,6 +1404,272 @@ static void rcu_process_callbacks(struct softirq_action *unused)
rcu_needs_cpu_flush();
}

+/*
+ * Wake up the current CPU's kthread. This replaces raise_softirq()
+ * in earlier versions of RCU.
+ */
+static void invoke_rcu_kthread(void)
+{
+ unsigned long flags;
+ wait_queue_head_t *q;
+ int cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
+ local_irq_restore(flags);
+ return;
+ }
+ per_cpu(rcu_cpu_has_work, cpu) = 1;
+ q = &per_cpu(rcu_cpu_wq, cpu);
+ wake_up(q);
+ local_irq_restore(flags);
+}
+
+/*
+ * Timer handler to initiate the waking up of per-CPU kthreads that
+ * have yielded the CPU due to excess numbers of RCU callbacks.
+ */
+static void rcu_cpu_kthread_timer(unsigned long arg)
+{
+ unsigned long flags;
+ struct rcu_data *rdp = (struct rcu_data *)arg;
+ struct rcu_node *rnp = rdp->mynode;
+ struct task_struct *t;
+
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ rnp->wakemask |= rdp->grpmask;
+ t = rnp->node_kthread_task;
+ if (t == NULL) {
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ return;
+ }
+ wake_up_process(t);
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+}
+
+/*
+ * Drop to non-real-time priority and yield, but only after posting a
+ * timer that will cause us to regain our real-time priority if we
+ * remain preempted. Either way, we restore our real-time priority
+ * before returning.
+ */
+static void rcu_yield(int cpu)
+{
+ struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
+ struct sched_param sp;
+ struct timer_list yield_timer;
+
+ setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
+ mod_timer(&yield_timer, jiffies + 2);
+ sp.sched_priority = 0;
+ sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
+ schedule();
+ sp.sched_priority = RCU_KTHREAD_PRIO;
+ sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
+ del_timer(&yield_timer);
+}
+
+/*
+ * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
+ * This can happen while the corresponding CPU is either coming online
+ * or going offline. We cannot wait until the CPU is fully online
+ * before starting the kthread, because the various notifier functions
+ * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
+ * the corresponding CPU is online.
+ *
+ * Return 1 if the kthread needs to stop, 0 otherwise.
+ *
+ * Caller must disable bh. This function can momentarily enable it.
+ */
+static int rcu_cpu_kthread_should_stop(int cpu)
+{
+ while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
+ if (kthread_should_stop())
+ return 1;
+ local_bh_enable();
+ schedule_timeout_uninterruptible(1);
+ if (smp_processor_id() != cpu)
+ set_cpus_allowed_ptr(current, cpumask_of(cpu));
+ local_bh_disable();
+ }
+ return 0;
+}
+
+/*
+ * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
+ * earlier RCU softirq.
+ */
+static int rcu_cpu_kthread(void *arg)
+{
+ int cpu = (int)(long)arg;
+ unsigned long flags;
+ int spincnt = 0;
+ wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
+ char work;
+ char *workp = &per_cpu(rcu_cpu_has_work, cpu);
+
+ for (;;) {
+ wait_event_interruptible(*wqp,
+ *workp != 0 || kthread_should_stop());
+ local_bh_disable();
+ if (rcu_cpu_kthread_should_stop(cpu)) {
+ local_bh_enable();
+ break;
+ }
+ local_irq_save(flags);
+ work = *workp;
+ *workp = 0;
+ local_irq_restore(flags);
+ if (work)
+ rcu_process_callbacks();
+ local_bh_enable();
+ if (*workp != 0)
+ spincnt++;
+ else
+ spincnt = 0;
+ if (spincnt > 10) {
+ rcu_yield(cpu);
+ spincnt = 0;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Spawn a per-CPU kthread, setting up affinity and priority.
+ */
+static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
+{
+ struct sched_param sp;
+ struct task_struct *t;
+
+ if (!rcu_kthreads_spawnable ||
+ per_cpu(rcu_cpu_kthread_task, cpu) != NULL)
+ return 0;
+ t = kthread_create(rcu_cpu_kthread, (void *)(long)cpu, "rcuc%d", cpu);
+ if (IS_ERR(t))
+ return PTR_ERR(t);
+ kthread_bind(t, cpu);
+ WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
+ per_cpu(rcu_cpu_kthread_task, cpu) = t;
+ wake_up_process(t);
+ sp.sched_priority = RCU_KTHREAD_PRIO;
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ return 0;
+}
+
+/*
+ * Per-rcu_node kthread, which is in charge of waking up the per-CPU
+ * kthreads when needed.
+ */
+static int rcu_node_kthread(void *arg)
+{
+ int cpu;
+ unsigned long flags;
+ unsigned long mask;
+ struct rcu_node *rnp = (struct rcu_node *)arg;
+ struct sched_param sp;
+ struct task_struct *t;
+
+ for (;;) {
+ wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
+ kthread_should_stop());
+ if (kthread_should_stop())
+ break;
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ mask = rnp->wakemask;
+ rnp->wakemask = 0;
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
+ if ((mask & 0x1) == 0)
+ continue;
+ preempt_disable();
+ per_cpu(rcu_cpu_has_work, cpu) = 1;
+ t = per_cpu(rcu_cpu_kthread_task, cpu);
+ if (t == NULL) {
+ preempt_enable();
+ continue;
+ }
+ sp.sched_priority = RCU_KTHREAD_PRIO;
+ sched_setscheduler_nocheck(t, cpu, &sp);
+ wake_up_process(t);
+ preempt_enable();
+ }
+ }
+ return 0;
+}
+
+/*
+ * Set the per-rcu_node kthread's affinity to cover all CPUs that are
+ * served by the rcu_node in question.
+ */
+static void rcu_node_kthread_setaffinity(struct rcu_node *rnp)
+{
+ cpumask_var_t cm;
+ int cpu;
+ unsigned long mask = rnp->qsmaskinit;
+
+ if (rnp->node_kthread_task == NULL ||
+ rnp->qsmaskinit == 0)
+ return;
+ if (!alloc_cpumask_var(&cm, GFP_KERNEL))
+ return;
+ cpumask_clear(cm);
+ for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1)
+ if (mask & 01)
+ cpumask_set_cpu(cpu, cm);
+ set_cpus_allowed_ptr(rnp->node_kthread_task, cm);
+ free_cpumask_var(cm);
+}
+
+/*
+ * Spawn a per-rcu_node kthread, setting priority and affinity.
+ */
+static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
+ struct rcu_node *rnp)
+{
+ int rnp_index = rnp - &rsp->node[0];
+ struct sched_param sp;
+ struct task_struct *t;
+
+ if (!rcu_kthreads_spawnable ||
+ rnp->qsmaskinit == 0 ||
+ rnp->node_kthread_task != NULL)
+ return 0;
+ t = kthread_create(rcu_node_kthread, (void *)rnp, "rcun%d", rnp_index);
+ if (IS_ERR(t))
+ return PTR_ERR(t);
+ rnp->node_kthread_task = t;
+ wake_up_process(t);
+ sp.sched_priority = 99;
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+ return 0;
+}
+
+/*
+ * Spawn all kthreads -- called as soon as the scheduler is running.
+ */
+static int __init rcu_spawn_kthreads(void)
+{
+ int cpu;
+ struct rcu_node *rnp;
+
+ rcu_kthreads_spawnable = 1;
+ for_each_possible_cpu(cpu) {
+ init_waitqueue_head(&per_cpu(rcu_cpu_wq, cpu));
+ per_cpu(rcu_cpu_has_work, cpu) = 0;
+ if (cpu_online(cpu))
+ (void)rcu_spawn_one_cpu_kthread(cpu);
+ }
+ rcu_for_each_leaf_node(&rcu_sched_state, rnp) {
+ init_waitqueue_head(&rnp->node_wq);
+ (void)rcu_spawn_one_node_kthread(&rcu_sched_state, rnp);
+ }
+ return 0;
+}
+early_initcall(rcu_spawn_kthreads);
+
static void
__call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
struct rcu_state *rsp)
@@ -1771,6 +2069,19 @@ static void __cpuinit rcu_online_cpu(int cpu)
rcu_preempt_init_percpu_data(cpu);
}

+static void __cpuinit rcu_online_kthreads(int cpu)
+{
+ struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
+ struct rcu_node *rnp = rdp->mynode;
+
+ /* Fire up the incoming CPU's kthread and leaf rcu_node kthread. */
+ if (rcu_kthreads_spawnable) {
+ (void)rcu_spawn_one_cpu_kthread(cpu);
+ if (rnp->node_kthread_task == NULL)
+ (void)rcu_spawn_one_node_kthread(&rcu_sched_state, rnp);
+ }
+}
+
/*
* Handle CPU online/offline notification events.
*/
@@ -1778,11 +2089,17 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
unsigned long action, void *hcpu)
{
long cpu = (long)hcpu;
+ struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
+ struct rcu_node *rnp = rdp->mynode;

switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
rcu_online_cpu(cpu);
+ rcu_online_kthreads(cpu);
+ break;
+ case CPU_ONLINE:
+ rcu_node_kthread_setaffinity(rnp);
break;
case CPU_DYING:
case CPU_DYING_FROZEN:
@@ -1923,7 +2240,6 @@ void __init rcu_init(void)
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
__rcu_init_preempt();
- open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);

/*
* We don't need protection against CPU-hotplug here because
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 5a439c1..c021380 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -111,6 +111,7 @@ struct rcu_node {
/* elements that need to drain to allow the */
/* current expedited grace period to */
/* complete (only for TREE_PREEMPT_RCU). */
+ unsigned long wakemask; /* CPUs whose kthread needs to be awakened. */
unsigned long qsmaskinit;
/* Per-GP initial value for qsmask & expmask. */
unsigned long grpmask; /* Mask to apply to parent qsmask. */
@@ -134,6 +135,13 @@ struct rcu_node {
/* if there is no such task. If there */
/* is no current expedited grace period, */
/* then there can cannot be any such task. */
+ struct task_struct *node_kthread_task;
+ /* kthread that takes care of this rcu_node */
+ /* structure, for example, awakening the */
+ /* per-CPU kthreads as needed. */
+ wait_queue_head_t node_wq;
+ /* Wait queue on which to park the per-node */
+ /* kthread. */
} ____cacheline_internodealigned_in_smp;

/*
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 774f010..b9bd69a 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1206,7 +1206,7 @@ static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff);
*
* Because it is not legal to invoke rcu_process_callbacks() with irqs
* disabled, we do one pass of force_quiescent_state(), then do a
- * raise_softirq() to cause rcu_process_callbacks() to be invoked later.
+ * invoke_rcu_kthread() to cause rcu_process_callbacks() to be invoked later.
* The per-cpu rcu_dyntick_drain variable controls the sequencing.
*/
int rcu_needs_cpu(int cpu)
@@ -1257,7 +1257,7 @@ int rcu_needs_cpu(int cpu)

/* If RCU callbacks are still pending, RCU still needs this CPU. */
if (c)
- raise_softirq(RCU_SOFTIRQ);
+ invoke_rcu_kthread();
return c;
}

diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
index 73a0222..043c085 100644
--- a/tools/perf/util/trace-event-parse.c
+++ b/tools/perf/util/trace-event-parse.c
@@ -2187,7 +2187,6 @@ static const struct flag flags[] = {
{ "TASKLET_SOFTIRQ", 6 },
{ "SCHED_SOFTIRQ", 7 },
{ "HRTIMER_SOFTIRQ", 8 },
- { "RCU_SOFTIRQ", 9 },

{ "HRTIMER_NORESTART", 0 },
{ "HRTIMER_RESTART", 1 },
--
1.7.3.2

2011-02-23 01:41:55

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 08/11] rcu: Decrease memory-barrier usage based on semi-formal proof

Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
invocations in rcu_process_callbacks() that are no longer needed, but
sheer paranoia prevented them from being removed. This commit removes
them and provides a proof of correctness in their absence. It also adds
a memory barrier to rcu_report_qs_rsp() immediately before the update to
rsp->completed in order to handle the theoretical possibility that the
compiler or CPU might move massive quantities of code into a lock-based
critical section. This also proves that the sheer paranoia was not
entirely unjustified, at least from a theoretical point of view.

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period. Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU. (There is some hope
that the algorithms used in user-level RCU might be applied here, but
some work is required to handle the NMIs that user-space applications
can happily ignore.)

This proof assumes that neither compiler nor CPU will allow a lock
acquisition and release to be reordered, as doing so can result in
deadlock. The proof is as follows:

1. A given CPU declares a quiescent state under the protection of
its leaf rcu_node's lock.

2. If there is more than one level of rcu_node hierarchy, the
last CPU to declare a quiescent state will also acquire the
->lock of the next rcu_node up in the hierarchy, but only
after releasing the lower level's lock. The acquisition of this
lock clearly cannot occur prior to the acquisition of the leaf
node's lock.

3. Step 2 repeats until we reach the root rcu_node structure.
Please note again that only one lock is held at a time through
this process. The acquisition of the root rcu_node's ->lock
must occur after the release of that of the leaf rcu_node.

4. At this point, we set the ->completed field in the rcu_state
structure in rcu_report_qs_rsp(). However, if the rcu_node
hierarchy contains only one rcu_node, then in theory the code
preceding the quiescent state could leak into the critical
section. We therefore precede the update of ->completed with a
memory barrier. All CPUs will therefore agree that any updates
preceding any report of a quiescent state will have happened
before the update of ->completed.

5. Regardless of whether a new grace period is needed, rcu_start_gp()
will propagate the new value of ->completed to all of the leaf
rcu_node structures, under the protection of each rcu_node's ->lock.
If a new grace period is needed immediately, this propagation
will occur in the same critical section that ->completed was
set in, but courtesy of the memory barrier in #4 above, is still
seen to follow any pre-quiescent-state activity.

6. When a given CPU invokes __rcu_process_gp_end(), it becomes
aware of the end of the old grace period and therefore makes
any RCU callbacks that were waiting on that grace period eligible
for invocation.

If this CPU is the same one that detected the end of the grace
period, and if there is but a single rcu_node in the hierarchy,
we will still be in the single critical section. In this case,
the memory barrier in step #4 guarantees that all callbacks will
be seen to execute after each CPU's quiescent state.

On the other hand, if this is a different CPU, it will acquire
the leaf rcu_node's ->lock, and will again be serialized after
each CPU's quiescent state for the old grace period.

On the strength of this proof, this commit therefore removes the memory
barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
The effect is to reduce the number of memory barriers by one and to
reduce the frequency of execution from about once per scheduling tick
per CPU to once per grace period.

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/trace.txt | 48 +++++++---------
kernel/rcutree.c | 130 ++++++++++++++++++------------------------
kernel/rcutree.h | 9 +--
kernel/rcutree_plugin.h | 7 +-
kernel/rcutree_trace.c | 12 ++--
5 files changed, 88 insertions(+), 118 deletions(-)

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index 6a8c73f..e731ad2 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -21,23 +21,23 @@ rcu_pending() function decided that there was core RCU work to do).
The output of "cat rcu/rcudata" looks as follows:

rcu_sched:
- 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10
- 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10
- 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10
- 3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1 dn=0 df=1545 of=0 ri=0 ql=0 b=10
- 4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1 dn=0 df=1992 of=0 ri=0 ql=0 b=10
- 5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1 dn=0 df=3331 of=0 ri=4 ql=2 b=10
- 6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1 dn=0 df=3224 of=0 ri=0 ql=0 b=10
- 7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1 dn=0 df=1818 of=0 ri=0 ql=2 b=10
+ 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1/0 df=1101 of=0 ri=36 ql=0 b=10
+ 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1/0 df=1015 of=0 ri=0 ql=0 b=10
+ 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1/0 df=1839 of=0 ri=0 ql=0 b=10
+ 3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1/0 df=1545 of=0 ri=0 ql=0 b=10
+ 4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1/0 df=1992 of=0 ri=0 ql=0 b=10
+ 5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1/0 df=3331 of=0 ri=4 ql=2 b=10
+ 6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1/0 df=3224 of=0 ri=0 ql=0 b=10
+ 7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1/0 df=1818 of=0 ri=0 ql=2 b=10
rcu_bh:
- 0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1 dn=0 df=0 of=0 ri=0 ql=0 b=10
- 1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1 dn=0 df=13 of=0 ri=0 ql=0 b=10
- 2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
- 3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1 dn=0 df=9 of=0 ri=0 ql=0 b=10
- 4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
- 5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
- 6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
- 7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
+ 0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1/0 df=0 of=0 ri=0 ql=0 b=10
+ 1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1/0 df=13 of=0 ri=0 ql=0 b=10
+ 2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1/0 df=15 of=0 ri=0 ql=0 b=10
+ 3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1/0 df=9 of=0 ri=0 ql=0 b=10
+ 4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1/0 df=15 of=0 ri=0 ql=0 b=10
+ 5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1/0 df=15 of=0 ri=0 ql=0 b=10
+ 6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1/0 df=15 of=0 ri=0 ql=0 b=10
+ 7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1/0 df=15 of=0 ri=0 ql=0 b=10

The first section lists the rcu_data structures for rcu_sched, the second
for rcu_bh. Note that CONFIG_TREE_PREEMPT_RCU kernels will have an
@@ -85,18 +85,10 @@ o "qp" indicates that RCU still expects a quiescent state from

o "dt" is the current value of the dyntick counter that is incremented
when entering or leaving dynticks idle state, either by the
- scheduler or by irq. The number after the "/" is the interrupt
- nesting depth when in dyntick-idle state, or one greater than
- the interrupt-nesting depth otherwise.
-
- This field is displayed only for CONFIG_NO_HZ kernels.
-
-o "dn" is the current value of the dyntick counter that is incremented
- when entering or leaving dynticks idle state via NMI. If both
- the "dt" and "dn" values are even, then this CPU is in dynticks
- idle mode and may be ignored by RCU. If either of these two
- counters is odd, then RCU must be alert to the possibility of
- an RCU read-side critical section running on this CPU.
+ scheduler or by irq. The number after the first "/" is the
+ interrupt nesting depth when in dyntick-idle state, or one
+ greater than the interrupt-nesting depth otherwise. The number
+ after the second "/" is the NMI nesting depth.

This field is displayed only for CONFIG_NO_HZ kernels.

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5513129..90104a1 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -128,7 +128,7 @@ void rcu_note_context_switch(int cpu)
#ifdef CONFIG_NO_HZ
DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
.dynticks_nesting = 1,
- .dynticks = 1,
+ .dynticks = ATOMIC_INIT(1),
};
#endif /* #ifdef CONFIG_NO_HZ */

@@ -262,13 +262,25 @@ void rcu_enter_nohz(void)
unsigned long flags;
struct rcu_dynticks *rdtp;

- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
local_irq_save(flags);
rdtp = &__get_cpu_var(rcu_dynticks);
- rdtp->dynticks++;
- rdtp->dynticks_nesting--;
- WARN_ON_ONCE(rdtp->dynticks & 0x1);
+ if (--rdtp->dynticks_nesting) {
+ local_irq_restore(flags);
+ return;
+ }
+ /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
+ smp_mb__before_atomic_inc(); /* See above. */
+ atomic_inc(&rdtp->dynticks);
+ smp_mb__after_atomic_inc(); /* Force ordering with next sojourn. */
+ WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
local_irq_restore(flags);
+
+ /* If the interrupt queued a callback, get out of dyntick mode. */
+ if (in_irq() &&
+ (__get_cpu_var(rcu_sched_data).nxtlist ||
+ __get_cpu_var(rcu_bh_data).nxtlist ||
+ rcu_preempt_needs_cpu(smp_processor_id())))
+ set_need_resched();
}

/*
@@ -284,11 +296,16 @@ void rcu_exit_nohz(void)

local_irq_save(flags);
rdtp = &__get_cpu_var(rcu_dynticks);
- rdtp->dynticks++;
- rdtp->dynticks_nesting++;
- WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
+ if (rdtp->dynticks_nesting++) {
+ local_irq_restore(flags);
+ return;
+ }
+ smp_mb__before_atomic_inc(); /* Force ordering w/previous sojourn. */
+ atomic_inc(&rdtp->dynticks);
+ /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
+ smp_mb__after_atomic_inc(); /* See above. */
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
local_irq_restore(flags);
- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
}

/**
@@ -302,11 +319,15 @@ void rcu_nmi_enter(void)
{
struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);

- if (rdtp->dynticks & 0x1)
+ if (rdtp->dynticks_nmi_nesting == 0 &&
+ (atomic_read(&rdtp->dynticks) & 0x1))
return;
- rdtp->dynticks_nmi++;
- WARN_ON_ONCE(!(rdtp->dynticks_nmi & 0x1));
- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+ rdtp->dynticks_nmi_nesting++;
+ smp_mb__before_atomic_inc(); /* Force delay from prior write. */
+ atomic_inc(&rdtp->dynticks);
+ /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
+ smp_mb__after_atomic_inc(); /* See above. */
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
}

/**
@@ -320,11 +341,14 @@ void rcu_nmi_exit(void)
{
struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);

- if (rdtp->dynticks & 0x1)
+ if (rdtp->dynticks_nmi_nesting == 0 ||
+ --rdtp->dynticks_nmi_nesting != 0)
return;
- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
- rdtp->dynticks_nmi++;
- WARN_ON_ONCE(rdtp->dynticks_nmi & 0x1);
+ /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
+ smp_mb__before_atomic_inc(); /* See above. */
+ atomic_inc(&rdtp->dynticks);
+ smp_mb__after_atomic_inc(); /* Force delay to next write. */
+ WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
}

/**
@@ -335,13 +359,7 @@ void rcu_nmi_exit(void)
*/
void rcu_irq_enter(void)
{
- struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
-
- if (rdtp->dynticks_nesting++)
- return;
- rdtp->dynticks++;
- WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+ rcu_exit_nohz();
}

/**
@@ -353,18 +371,7 @@ void rcu_irq_enter(void)
*/
void rcu_irq_exit(void)
{
- struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
-
- if (--rdtp->dynticks_nesting)
- return;
- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
- rdtp->dynticks++;
- WARN_ON_ONCE(rdtp->dynticks & 0x1);
-
- /* If the interrupt queued a callback, get out of dyntick mode. */
- if (__get_cpu_var(rcu_sched_data).nxtlist ||
- __get_cpu_var(rcu_bh_data).nxtlist)
- set_need_resched();
+ rcu_enter_nohz();
}

#ifdef CONFIG_SMP
@@ -376,19 +383,8 @@ void rcu_irq_exit(void)
*/
static int dyntick_save_progress_counter(struct rcu_data *rdp)
{
- int ret;
- int snap;
- int snap_nmi;
-
- snap = rdp->dynticks->dynticks;
- snap_nmi = rdp->dynticks->dynticks_nmi;
- smp_mb(); /* Order sampling of snap with end of grace period. */
- rdp->dynticks_snap = snap;
- rdp->dynticks_nmi_snap = snap_nmi;
- ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0);
- if (ret)
- rdp->dynticks_fqs++;
- return ret;
+ rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+ return 0;
}

/*
@@ -399,16 +395,11 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
*/
static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
{
- long curr;
- long curr_nmi;
- long snap;
- long snap_nmi;
+ unsigned long curr;
+ unsigned long snap;

- curr = rdp->dynticks->dynticks;
- snap = rdp->dynticks_snap;
- curr_nmi = rdp->dynticks->dynticks_nmi;
- snap_nmi = rdp->dynticks_nmi_snap;
- smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
+ curr = (unsigned long)atomic_add_return(0, &rdp->dynticks->dynticks);
+ snap = (unsigned long)rdp->dynticks_snap;

/*
* If the CPU passed through or entered a dynticks idle phase with
@@ -418,8 +409,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
* read-side critical section that started before the beginning
* of the current RCU grace period.
*/
- if ((curr != snap || (curr & 0x1) == 0) &&
- (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
+ if ((curr & 0x1) == 0 || ULONG_CMP_GE(curr, snap + 2)) {
rdp->dynticks_fqs++;
return 1;
}
@@ -841,6 +831,12 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
__releases(rcu_get_root(rsp)->lock)
{
WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
+
+ /*
+ * Ensure that all grace-period and pre-grace-period activity
+ * is seen before the assignment to rsp->completed.
+ */
+ smp_mb(); /* See above block comment. */
rsp->completed = rsp->gpnum;
rsp->signaled = RCU_GP_IDLE;
rcu_start_gp(rsp, flags); /* releases root node's rnp->lock. */
@@ -1367,25 +1363,11 @@ __rcu_process_callbacks(struct rcu_state *rsp, struct rcu_data *rdp)
*/
static void rcu_process_callbacks(struct softirq_action *unused)
{
- /*
- * Memory references from any prior RCU read-side critical sections
- * executed by the interrupted code must be seen before any RCU
- * grace-period manipulations below.
- */
- smp_mb(); /* See above block comment. */
-
__rcu_process_callbacks(&rcu_sched_state,
&__get_cpu_var(rcu_sched_data));
__rcu_process_callbacks(&rcu_bh_state, &__get_cpu_var(rcu_bh_data));
rcu_preempt_process_callbacks();

- /*
- * Memory references from any later RCU read-side critical sections
- * executed by the interrupted code must be seen after any RCU
- * grace-period manipulations above.
- */
- smp_mb(); /* See above block comment. */
-
/* If we are last CPU on way to dyntick-idle mode, accelerate it. */
rcu_needs_cpu_flush();
}
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index e1a6663..bd891de 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -84,11 +84,9 @@
* Dynticks per-CPU state.
*/
struct rcu_dynticks {
- int dynticks_nesting; /* Track nesting level, sort of. */
- int dynticks; /* Even value for dynticks-idle, else odd. */
- int dynticks_nmi; /* Even value for either dynticks-idle or */
- /* not in nmi handler, else odd. So this */
- /* remains even for nmi from irq handler. */
+ int dynticks_nesting; /* Track irq/process nesting level. */
+ int dynticks_nmi_nesting; /* Track NMI nesting level. */
+ atomic_t dynticks; /* Even value for dynticks-idle, else odd. */
};

/*
@@ -218,7 +216,6 @@ struct rcu_data {
/* 3) dynticks interface. */
struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */
int dynticks_snap; /* Per-GP tracking for dynticks. */
- int dynticks_nmi_snap; /* Per-GP tracking for dynticks_nmi. */
#endif /* #ifdef CONFIG_NO_HZ */

/* 4) reasons this CPU needed to be kicked by force_quiescent_state */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 38426ef..764b5fc 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1182,7 +1182,6 @@ int rcu_needs_cpu(int cpu)
{
int c = 0;
int snap;
- int snap_nmi;
int thatcpu;

/* Check for being in the holdoff period. */
@@ -1193,10 +1192,10 @@ int rcu_needs_cpu(int cpu)
for_each_online_cpu(thatcpu) {
if (thatcpu == cpu)
continue;
- snap = per_cpu(rcu_dynticks, thatcpu).dynticks;
- snap_nmi = per_cpu(rcu_dynticks, thatcpu).dynticks_nmi;
+ snap = atomic_add_return(0, &per_cpu(rcu_dynticks,
+ thatcpu).dynticks);
smp_mb(); /* Order sampling of snap with end of grace period. */
- if (((snap & 0x1) != 0) || ((snap_nmi & 0x1) != 0)) {
+ if ((snap & 0x1) != 0) {
per_cpu(rcu_dyntick_drain, cpu) = 0;
per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
return rcu_needs_cpu_quick_check(cpu);
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index c8e9785..4a21ca5 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -57,10 +57,10 @@ static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
rdp->passed_quiesc, rdp->passed_quiesc_completed,
rdp->qs_pending);
#ifdef CONFIG_NO_HZ
- seq_printf(m, " dt=%d/%d dn=%d df=%lu",
- rdp->dynticks->dynticks,
+ seq_printf(m, " dt=%d/%d/%d df=%lu",
+ atomic_read(&rdp->dynticks->dynticks),
rdp->dynticks->dynticks_nesting,
- rdp->dynticks->dynticks_nmi,
+ rdp->dynticks->dynticks_nmi_nesting,
rdp->dynticks_fqs);
#endif /* #ifdef CONFIG_NO_HZ */
seq_printf(m, " of=%lu ri=%lu", rdp->offline_fqs, rdp->resched_ipi);
@@ -115,9 +115,9 @@ static void print_one_rcu_data_csv(struct seq_file *m, struct rcu_data *rdp)
rdp->qs_pending);
#ifdef CONFIG_NO_HZ
seq_printf(m, ",%d,%d,%d,%lu",
- rdp->dynticks->dynticks,
+ atomic_read(&rdp->dynticks->dynticks),
rdp->dynticks->dynticks_nesting,
- rdp->dynticks->dynticks_nmi,
+ rdp->dynticks->dynticks_nmi_nesting,
rdp->dynticks_fqs);
#endif /* #ifdef CONFIG_NO_HZ */
seq_printf(m, ",%lu,%lu", rdp->offline_fqs, rdp->resched_ipi);
@@ -130,7 +130,7 @@ static int show_rcudata_csv(struct seq_file *m, void *unused)
{
seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pqc\",\"pq\",");
#ifdef CONFIG_NO_HZ
- seq_puts(m, "\"dt\",\"dt nesting\",\"dn\",\"df\",");
+ seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
#endif /* #ifdef CONFIG_NO_HZ */
seq_puts(m, "\"of\",\"ri\",\"ql\",\"b\",\"ci\",\"co\",\"ca\"\n");
#ifdef CONFIG_TREE_PREEMPT_RCU
--
1.7.3.2

2011-02-23 01:41:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 09/11] rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists

From: Paul E. McKenney <[email protected]>

Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
and ->exp_tasks tail pointers. This is in preparation for RCU priority
boosting, which will add a third dimension to the combinatorial explosion
in the ->blocked_tasks[] case, but simply a third pointer in the new
->blkd_tasks case.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree.c | 5 +-
kernel/rcutree.h | 21 ++++--
kernel/rcutree_plugin.h | 163 ++++++++++++++++++++++++++++-------------------
kernel/rcutree_trace.c | 11 +--
4 files changed, 117 insertions(+), 83 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 90104a1..0ac1cc0 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1901,10 +1901,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
j / rsp->levelspread[i - 1];
}
rnp->level = i;
- INIT_LIST_HEAD(&rnp->blocked_tasks[0]);
- INIT_LIST_HEAD(&rnp->blocked_tasks[1]);
- INIT_LIST_HEAD(&rnp->blocked_tasks[2]);
- INIT_LIST_HEAD(&rnp->blocked_tasks[3]);
+ INIT_LIST_HEAD(&rnp->blkd_tasks);
}
}

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index bd891de..5a439c1 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -107,7 +107,7 @@ struct rcu_node {
/* an rcu_data structure, otherwise, each */
/* bit corresponds to a child rcu_node */
/* structure. */
- unsigned long expmask; /* Groups that have ->blocked_tasks[] */
+ unsigned long expmask; /* Groups that have ->blkd_tasks */
/* elements that need to drain to allow the */
/* current expedited grace period to */
/* complete (only for TREE_PREEMPT_RCU). */
@@ -120,11 +120,20 @@ struct rcu_node {
u8 grpnum; /* CPU/group number for next level up. */
u8 level; /* root is at level 0. */
struct rcu_node *parent;
- struct list_head blocked_tasks[4];
- /* Tasks blocked in RCU read-side critsect. */
- /* Grace period number (->gpnum) x blocked */
- /* by tasks on the (x & 0x1) element of the */
- /* blocked_tasks[] array. */
+ struct list_head blkd_tasks;
+ /* Tasks blocked in RCU read-side critical */
+ /* section. Tasks are placed at the head */
+ /* of this list and age towards the tail. */
+ struct list_head *gp_tasks;
+ /* Pointer to the first task blocking the */
+ /* current grace period, or NULL if there */
+ /* is no such task. */
+ struct list_head *exp_tasks;
+ /* Pointer to the first task blocking the */
+ /* current expedited grace period, or NULL */
+ /* if there is no such task. If there */
+ /* is no current expedited grace period, */
+ /* then there can cannot be any such task. */
} ____cacheline_internodealigned_in_smp;

/*
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 764b5fc..774f010 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -130,12 +130,12 @@ static void rcu_preempt_qs(int cpu)
* We have entered the scheduler, and the current task might soon be
* context-switched away from. If this task is in an RCU read-side
* critical section, we will no longer be able to rely on the CPU to
- * record that fact, so we enqueue the task on the appropriate entry
- * of the blocked_tasks[] array. The task will dequeue itself when
- * it exits the outermost enclosing RCU read-side critical section.
- * Therefore, the current grace period cannot be permitted to complete
- * until the blocked_tasks[] entry indexed by the low-order bit of
- * rnp->gpnum empties.
+ * record that fact, so we enqueue the task on the blkd_tasks list.
+ * The task will dequeue itself when it exits the outermost enclosing
+ * RCU read-side critical section. Therefore, the current grace period
+ * cannot be permitted to complete until the blkd_tasks list entries
+ * predating the current grace period drain, in other words, until
+ * rnp->gp_tasks becomes NULL.
*
* Caller must disable preemption.
*/
@@ -143,7 +143,6 @@ static void rcu_preempt_note_context_switch(int cpu)
{
struct task_struct *t = current;
unsigned long flags;
- int phase;
struct rcu_data *rdp;
struct rcu_node *rnp;

@@ -165,15 +164,26 @@ static void rcu_preempt_note_context_switch(int cpu)
* (i.e., this CPU has not yet passed through a quiescent
* state for the current grace period), then as long
* as that task remains queued, the current grace period
- * cannot end.
+ * cannot end. Note that there is some uncertainty as
+ * to exactly when the current grace period started.
+ * We take a conservative approach, which can result
+ * in unnecessarily waiting on tasks that started very
+ * slightly after the current grace period began. C'est
+ * la vie!!!
*
* But first, note that the current CPU must still be
* on line!
*/
WARN_ON_ONCE((rdp->grpmask & rnp->qsmaskinit) == 0);
WARN_ON_ONCE(!list_empty(&t->rcu_node_entry));
- phase = (rnp->gpnum + !(rnp->qsmask & rdp->grpmask)) & 0x1;
- list_add(&t->rcu_node_entry, &rnp->blocked_tasks[phase]);
+ if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) {
+ list_add(&t->rcu_node_entry, rnp->gp_tasks->prev);
+ rnp->gp_tasks = &t->rcu_node_entry;
+ } else {
+ list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
+ if (rnp->qsmask & rdp->grpmask)
+ rnp->gp_tasks = &t->rcu_node_entry;
+ }
raw_spin_unlock_irqrestore(&rnp->lock, flags);
}

@@ -210,10 +220,7 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
*/
static int rcu_preempted_readers(struct rcu_node *rnp)
{
- int phase = rnp->gpnum & 0x1;
-
- return !list_empty(&rnp->blocked_tasks[phase]) ||
- !list_empty(&rnp->blocked_tasks[phase + 2]);
+ return rnp->gp_tasks != NULL;
}

/*
@@ -253,6 +260,21 @@ static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
}

/*
+ * Advance a ->blkd_tasks-list pointer to the next entry, instead
+ * returning NULL if at the end of the list.
+ */
+static struct list_head *rcu_next_node_entry(struct task_struct *t,
+ struct rcu_node *rnp)
+{
+ struct list_head *np;
+
+ np = t->rcu_node_entry.next;
+ if (np == &rnp->blkd_tasks)
+ np = NULL;
+ return np;
+}
+
+/*
* Handle special cases during rcu_read_unlock(), such as needing to
* notify RCU core processing or task having blocked during the RCU
* read-side critical section.
@@ -262,6 +284,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
int empty;
int empty_exp;
unsigned long flags;
+ struct list_head *np;
struct rcu_node *rnp;
int special;

@@ -305,7 +328,12 @@ static void rcu_read_unlock_special(struct task_struct *t)
empty = !rcu_preempted_readers(rnp);
empty_exp = !rcu_preempted_readers_exp(rnp);
smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
+ np = rcu_next_node_entry(t, rnp);
list_del_init(&t->rcu_node_entry);
+ if (&t->rcu_node_entry == rnp->gp_tasks)
+ rnp->gp_tasks = np;
+ if (&t->rcu_node_entry == rnp->exp_tasks)
+ rnp->exp_tasks = np;
t->rcu_blocked_node = NULL;

/*
@@ -361,18 +389,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_unlock);
static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
{
unsigned long flags;
- struct list_head *lp;
- int phase;
struct task_struct *t;

- if (rcu_preempted_readers(rnp)) {
- raw_spin_lock_irqsave(&rnp->lock, flags);
- phase = rnp->gpnum & 0x1;
- lp = &rnp->blocked_tasks[phase];
- list_for_each_entry(t, lp, rcu_node_entry)
- sched_show_task(t);
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
- }
+ if (!rcu_preempted_readers(rnp))
+ return;
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ t = list_entry(rnp->gp_tasks,
+ struct task_struct, rcu_node_entry);
+ list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
+ sched_show_task(t);
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
}

/*
@@ -402,16 +428,14 @@ static void rcu_print_detail_task_stall(struct rcu_state *rsp)
*/
static void rcu_print_task_stall(struct rcu_node *rnp)
{
- struct list_head *lp;
- int phase;
struct task_struct *t;

- if (rcu_preempted_readers(rnp)) {
- phase = rnp->gpnum & 0x1;
- lp = &rnp->blocked_tasks[phase];
- list_for_each_entry(t, lp, rcu_node_entry)
- printk(" P%d", t->pid);
- }
+ if (!rcu_preempted_readers(rnp))
+ return;
+ t = list_entry(rnp->gp_tasks,
+ struct task_struct, rcu_node_entry);
+ list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
+ printk(" P%d", t->pid);
}

/*
@@ -430,10 +454,15 @@ static void rcu_preempt_stall_reset(void)
* period that still has RCU readers blocked! This function must be
* invoked -before- updating this rnp's ->gpnum, and the rnp's ->lock
* must be held by the caller.
+ *
+ * Also, if there are blocked tasks on the list, they automatically
+ * block the newly created grace period, so set up ->gp_tasks accordingly.
*/
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
{
WARN_ON_ONCE(rcu_preempted_readers(rnp));
+ if (!list_empty(&rnp->blkd_tasks))
+ rnp->gp_tasks = rnp->blkd_tasks.next;
WARN_ON_ONCE(rnp->qsmask);
}

@@ -457,45 +486,49 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
struct rcu_node *rnp,
struct rcu_data *rdp)
{
- int i;
struct list_head *lp;
struct list_head *lp_root;
int retval = 0;
struct rcu_node *rnp_root = rcu_get_root(rsp);
- struct task_struct *tp;
+ struct task_struct *t;

if (rnp == rnp_root) {
WARN_ONCE(1, "Last CPU thought to be offlined?");
return 0; /* Shouldn't happen: at least one CPU online. */
}
- WARN_ON_ONCE(rnp != rdp->mynode &&
- (!list_empty(&rnp->blocked_tasks[0]) ||
- !list_empty(&rnp->blocked_tasks[1]) ||
- !list_empty(&rnp->blocked_tasks[2]) ||
- !list_empty(&rnp->blocked_tasks[3])));
+
+ /* If we are on an internal node, complain bitterly. */
+ WARN_ON_ONCE(rnp != rdp->mynode);

/*
- * Move tasks up to root rcu_node. Rely on the fact that the
- * root rcu_node can be at most one ahead of the rest of the
- * rcu_nodes in terms of gp_num value. This fact allows us to
- * move the blocked_tasks[] array directly, element by element.
+ * Move tasks up to root rcu_node. Don't try to get fancy for
+ * this corner-case operation -- just put this node's tasks
+ * at the head of the root node's list, and update the root node's
+ * ->gp_tasks and ->exp_tasks pointers to those of this node's,
+ * if non-NULL. This might result in waiting for more tasks than
+ * absolutely necessary, but this is a good performance/complexity
+ * tradeoff.
*/
if (rcu_preempted_readers(rnp))
retval |= RCU_OFL_TASKS_NORM_GP;
if (rcu_preempted_readers_exp(rnp))
retval |= RCU_OFL_TASKS_EXP_GP;
- for (i = 0; i < 4; i++) {
- lp = &rnp->blocked_tasks[i];
- lp_root = &rnp_root->blocked_tasks[i];
- while (!list_empty(lp)) {
- tp = list_entry(lp->next, typeof(*tp), rcu_node_entry);
- raw_spin_lock(&rnp_root->lock); /* irqs already disabled */
- list_del(&tp->rcu_node_entry);
- tp->rcu_blocked_node = rnp_root;
- list_add(&tp->rcu_node_entry, lp_root);
- raw_spin_unlock(&rnp_root->lock); /* irqs remain disabled */
- }
+ lp = &rnp->blkd_tasks;
+ lp_root = &rnp_root->blkd_tasks;
+ while (!list_empty(lp)) {
+ t = list_entry(lp->next, typeof(*t), rcu_node_entry);
+ raw_spin_lock(&rnp_root->lock); /* irqs already disabled */
+ list_del(&t->rcu_node_entry);
+ t->rcu_blocked_node = rnp_root;
+ list_add(&t->rcu_node_entry, lp_root);
+ if (&t->rcu_node_entry == rnp->gp_tasks)
+ rnp_root->gp_tasks = rnp->gp_tasks;
+ if (&t->rcu_node_entry == rnp->exp_tasks)
+ rnp_root->exp_tasks = rnp->exp_tasks;
+ raw_spin_unlock(&rnp_root->lock); /* irqs still disabled */
}
+ rnp->gp_tasks = NULL;
+ rnp->exp_tasks = NULL;
return retval;
}

@@ -586,8 +619,7 @@ static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
*/
static int rcu_preempted_readers_exp(struct rcu_node *rnp)
{
- return !list_empty(&rnp->blocked_tasks[2]) ||
- !list_empty(&rnp->blocked_tasks[3]);
+ return rnp->exp_tasks != NULL;
}

/*
@@ -647,12 +679,13 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp)
static void
sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
{
- int must_wait;
+ int must_wait = 0;

raw_spin_lock(&rnp->lock); /* irqs already disabled */
- list_splice_init(&rnp->blocked_tasks[0], &rnp->blocked_tasks[2]);
- list_splice_init(&rnp->blocked_tasks[1], &rnp->blocked_tasks[3]);
- must_wait = rcu_preempted_readers_exp(rnp);
+ if (!list_empty(&rnp->blkd_tasks)) {
+ rnp->exp_tasks = rnp->blkd_tasks.next;
+ must_wait = 1;
+ }
raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
if (!must_wait)
rcu_report_exp_rnp(rsp, rnp);
@@ -661,9 +694,7 @@ sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
/*
* Wait for an rcu-preempt grace period, but expedite it. The basic idea
* is to invoke synchronize_sched_expedited() to push all the tasks to
- * the ->blocked_tasks[] lists, move all entries from the first set of
- * ->blocked_tasks[] lists to the second set, and finally wait for this
- * second set to drain.
+ * the ->blkd_tasks lists and wait for this list to drain.
*/
void synchronize_rcu_expedited(void)
{
@@ -695,7 +726,7 @@ void synchronize_rcu_expedited(void)
if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
goto unlock_mb_ret; /* Others did our work for us. */

- /* force all RCU readers onto blocked_tasks[]. */
+ /* force all RCU readers onto ->blkd_tasks lists. */
synchronize_sched_expedited();

raw_spin_lock_irqsave(&rsp->onofflock, flags);
@@ -707,7 +738,7 @@ void synchronize_rcu_expedited(void)
raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
}

- /* Snapshot current state of ->blocked_tasks[] lists. */
+ /* Snapshot current state of ->blkd_tasks lists. */
rcu_for_each_leaf_node(rsp, rnp)
sync_rcu_preempt_exp_init(rsp, rnp);
if (NUM_RCU_NODES > 1)
@@ -715,7 +746,7 @@ void synchronize_rcu_expedited(void)

raw_spin_unlock_irqrestore(&rsp->onofflock, flags);

- /* Wait for snapshotted ->blocked_tasks[] lists to drain. */
+ /* Wait for snapshotted ->blkd_tasks lists to drain. */
rnp = rcu_get_root(rsp);
wait_event(sync_rcu_preempt_exp_wq,
sync_rcu_preempt_exp_done(rnp));
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index 4a21ca5..1cedf94 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -161,7 +161,6 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
{
unsigned long gpnum;
int level = 0;
- int phase;
struct rcu_node *rnp;

gpnum = rsp->gpnum;
@@ -178,13 +177,11 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
seq_puts(m, "\n");
level = rnp->level;
}
- phase = gpnum & 0x1;
- seq_printf(m, "%lx/%lx %c%c>%c%c %d:%d ^%d ",
+ seq_printf(m, "%lx/%lx %c%c>%c %d:%d ^%d ",
rnp->qsmask, rnp->qsmaskinit,
- "T."[list_empty(&rnp->blocked_tasks[phase])],
- "E."[list_empty(&rnp->blocked_tasks[phase + 2])],
- "T."[list_empty(&rnp->blocked_tasks[!phase])],
- "E."[list_empty(&rnp->blocked_tasks[!phase + 2])],
+ ".G"[rnp->gp_tasks != NULL],
+ ".E"[rnp->exp_tasks != NULL],
+ ".T"[!list_empty(&rnp->blkd_tasks)],
rnp->grplo, rnp->grphi, rnp->grpnum);
}
seq_puts(m, "\n");
--
1.7.3.2

2011-02-23 01:41:52

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 04/11] rcupdate: remove dead code

From: Amerigo Wang <[email protected]>

DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT, so #ifndef CONFIG_PREEMPT
is totally useless in kernel/rcupdate.c.

Signed-off-by: WANG Cong <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcupdate.c | 5 -----
1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a23a57a..afd21d1 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -215,10 +215,6 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
* If we detect that we are nested in a RCU read-side critical
* section, we should simply fail, otherwise we would deadlock.
*/
-#ifndef CONFIG_PREEMPT
- WARN_ON(1);
- return 0;
-#else
if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
irqs_disabled()) {
WARN_ON(1);
@@ -229,7 +225,6 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
rcu_barrier_bh();
debug_object_free(head, &rcuhead_debug_descr);
return 1;
-#endif
default:
return 0;
}
--
1.7.3.2

2011-02-23 02:44:30

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> +static int rcu_node_kthread(void *arg)
> +{
> + int cpu;
> + unsigned long flags;
> + unsigned long mask;
> + struct rcu_node *rnp = (struct rcu_node *)arg;
> + struct sched_param sp;
> + struct task_struct *t;
> +
> + for (;;) {
> + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> + kthread_should_stop());
> + if (kthread_should_stop())
> + break;
> + raw_spin_lock_irqsave(&rnp->lock, flags);
> + mask = rnp->wakemask;
> + rnp->wakemask = 0;
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {

I may be confused, but shouldn't it be mask >>= 1 instead?

rnp->wakemask is the unioned rdp->grpmask of the cpu(s) for which we woke that
node thread up. Those mask start from 0, so what you want with the below
check is to watch if the next CPU in group range is in the wakeup mask by shifting
to the right.

No?

> + if ((mask & 0x1) == 0)
> + continue;
> + preempt_disable();
> + per_cpu(rcu_cpu_has_work, cpu) = 1;
> + t = per_cpu(rcu_cpu_kthread_task, cpu);
> + if (t == NULL) {
> + preempt_enable();
> + continue;
> + }
> + sp.sched_priority = RCU_KTHREAD_PRIO;
> + sched_setscheduler_nocheck(t, cpu, &sp);
> + wake_up_process(t);
> + preempt_enable();
> + }
> + }
> + return 0;
> +}

2011-02-23 03:06:50

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu: Add boosting to TREE_PREEMPT_RCU tracing

On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> rcudir = debugfs_create_dir("rcu", NULL);
> - if (!rcudir)
> + if (IS_ERR_VALUE(rcudir))
> goto free_out;

if !defined(CONFIG_DEBUG_FS)
debugfs_create_xxx() returns ERR_PTR(-ENODEV);
else
debugfs_create_xxx() returns NULL when failed.

Since TREE_RCU_TRACE selects DEBUG_FS, "if (!rcudir)" is correct,
we don't need to change it.

>
> retval = debugfs_create_file("rcudata", 0444, rcudir,
> NULL, &rcudata_fops);
> - if (!retval)
> + if (IS_ERR_VALUE(retval))
> goto free_out;
>
> retval = debugfs_create_file("rcudata.csv", 0444, rcudir,
> NULL, &rcudata_csv_fops);
> - if (!retval)
> + if (IS_ERR_VALUE(retval))
> + goto free_out;
> +
> + retval = rcu_boost_trace_create_file(rcudir);
> + if (retval)
> goto free_out;
>
> retval = debugfs_create_file("rcugp", 0444, rcudir, NULL, &rcugp_fops);
> - if (!retval)
> + if (IS_ERR_VALUE(retval))
> goto free_out;
>
> retval = debugfs_create_file("rcuhier", 0444, rcudir,
> NULL, &rcuhier_fops);
> - if (!retval)
> + if (IS_ERR_VALUE(retval))
> goto free_out;
>
> retval = debugfs_create_file("rcu_pending", 0444, rcudir,
> NULL, &rcu_pending_fops);
> - if (!retval)
> + if (IS_ERR_VALUE(retval))
> goto free_out;
> return 0;
> free_out:

2011-02-23 03:09:50

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> +/*
> + * Set the per-rcu_node kthread's affinity to cover all CPUs that are
> + * served by the rcu_node in question.
> + */
> +static void rcu_node_kthread_setaffinity(struct rcu_node *rnp)
> +{
> + cpumask_var_t cm;
> + int cpu;
> + unsigned long mask = rnp->qsmaskinit;
> +
> + if (rnp->node_kthread_task == NULL ||
> + rnp->qsmaskinit == 0)
> + return;
> + if (!alloc_cpumask_var(&cm, GFP_KERNEL))
> + return;
> + cpumask_clear(cm);
> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1)

There seem to be the same shift problem here.

> + if (mask & 01)
> + cpumask_set_cpu(cpu, cm);
> + set_cpus_allowed_ptr(rnp->node_kthread_task, cm);
> + free_cpumask_var(cm);
> +}

2011-02-23 03:23:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
> The build will break if you change the Kconfig to allow
> DEBUG_OBJECTS_RCU_HEAD and !PREEMPT, so document the reasoning
> near where the breakage would occur.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcupdate.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index afd21d1..f3240e9 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -214,6 +214,11 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * Note that the machinery to reliably determine whether
> + * or not we are in an RCU read-side critical section
> + * exists only in the preemptible RCU implementations
> + * (TINY_PREEMPT_RCU and TREE_PREEMPT_RCU), which is why
> + * DEBUG_OBJECTS_RCU_HEAD is disallowed if !PREEMPT.
> */

Shouldn't this comment also be in the kconfig where
DEBUG_OBJECTS_RCU_HEAD is defined?

-- Steve

> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {

2011-02-23 03:29:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 06/11] smp: Document transitivity for memory barriers.

On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
> Transitivity is guaranteed only for full memory barriers (smp_mb()).
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> Documentation/memory-barriers.txt | 58 +++++++++++++++++++++++++++++++++++++
> 1 files changed, 58 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 631ad2f..f0d3a80 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -21,6 +21,7 @@ Contents:
> - SMP barrier pairing.
> - Examples of memory barrier sequences.
> - Read memory barriers vs load speculation.
> + - Transitivity
>
> (*) Explicit kernel barriers.
>
> @@ -959,6 +960,63 @@ the speculation will be cancelled and the value reloaded:
> retrieved : : +-------+
>
>
> +TRANSITIVITY
> +------------
> +
> +Transitivity is a deeply intuitive notion about ordering that is not
> +always provided by real computer systems. The following example
> +demonstrates transitivity (also called "cumulativity"):
> +
> + CPU 1 CPU 2 CPU 3
> + ======================= ======================= =======================
> + { X = 0, Y = 0 }
> + STORE X=1 LOAD X STORE Y=1
> + <general barrier> <general barrier>
> + LOAD Y LOAD X
> +
> +Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
> +This indicates that CPU 2's load from X in some sense follows CPU 1's
> +store to X and that CPU 2's load from Y in some sense preceded CPU 3's
> +store to Y. The question is then "Can CPU 3's load from X return 0?"
> +
> +Because CPU 2's load from X in some sense came after CPU 1's store, it
> +is natural to expect that CPU 3's load from X must therefore return 1.
> +This expectation is an example of transitivity: if a load executing on
> +CPU A follows a load from the same variable executing on CPU B, then
> +CPU A's load must either return the same value that CPU B's load did,
> +or must return some later value.
> +
> +In the Linux kernel, use of general memory barriers guarantees
> +transitivity. Therefore, in the above example, if CPU 2's load from X
> +returns 1 and its load from Y returns 0, then CPU 3's load from X must
> +also return 1.
> +
> +However, transitivity is -not- guaranteed for read or write barriers.
> +For example, suppose that CPU 2's general barrier in the above example
> +is changed to a read barrier as shown below:
> +
> + CPU 1 CPU 2 CPU 3
> + ======================= ======================= =======================
> + { X = 0, Y = 0 }
> + STORE X=1 LOAD X STORE Y=1
> + <read barrier> <general barrier>
> + LOAD Y LOAD X
> +
> +This substitution destroys transitivity: in this example, it is perfectly
> +legal for CPU 2's load from X to return 1, its load from Y to return 0,
> +and CPU 3's load from X to return 0.
> +
> +The key point is that although CPU 2's read barrier orders its pair
> +of loads, it does not guarantee to order CPU 1's store. Therefore, if
> +this example runs on a system where CPUs 1 and 2 share a store buffer
> +or a level of cache, CPU 2 might have early access to CPU 1's writes.
> +General barriers are therefore required to ensure that all CPUs agree
> +on the combined order of CPU 1's and CPU 2's accesses.

Sounds like someone had a fun time debugging their code.

> +
> +To reiterate, if your code requires transitivity, use general barriers
> +throughout.

I expect that your code is the only code in the kernel that actually
requires transitivity ;-)

-- Steve

> +
> +
> ========================
> EXPLICIT KERNEL BARRIERS
> ========================

2011-02-23 06:20:48

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 06/11] smp: Document transitivity for memory barriers.

On 02/23/2011 11:29 AM, Steven Rostedt wrote:
> On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
>> Transitivity is guaranteed only for full memory barriers (smp_mb()).
>>
>> Signed-off-by: Paul E. McKenney <[email protected]>
>> ---
>> Documentation/memory-barriers.txt | 58 +++++++++++++++++++++++++++++++++++++
>> 1 files changed, 58 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
>> index 631ad2f..f0d3a80 100644
>> --- a/Documentation/memory-barriers.txt
>> +++ b/Documentation/memory-barriers.txt
>> @@ -21,6 +21,7 @@ Contents:
>> - SMP barrier pairing.
>> - Examples of memory barrier sequences.
>> - Read memory barriers vs load speculation.
>> + - Transitivity
>>
>> (*) Explicit kernel barriers.
>>
>> @@ -959,6 +960,63 @@ the speculation will be cancelled and the value reloaded:
>> retrieved : : +-------+
>>
>>
>> +TRANSITIVITY
>> +------------
>> +
>> +Transitivity is a deeply intuitive notion about ordering that is not
>> +always provided by real computer systems. The following example
>> +demonstrates transitivity (also called "cumulativity"):
>> +
>> + CPU 1 CPU 2 CPU 3
>> + ======================= ======================= =======================
>> + { X = 0, Y = 0 }
>> + STORE X=1 LOAD X STORE Y=1
>> + <general barrier> <general barrier>
>> + LOAD Y LOAD X
>> +
>> +Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
>> +This indicates that CPU 2's load from X in some sense follows CPU 1's
>> +store to X and that CPU 2's load from Y in some sense preceded CPU 3's
>> +store to Y. The question is then "Can CPU 3's load from X return 0?"
>> +
>> +Because CPU 2's load from X in some sense came after CPU 1's store, it
>> +is natural to expect that CPU 3's load from X must therefore return 1.
>> +This expectation is an example of transitivity: if a load executing on
>> +CPU A follows a load from the same variable executing on CPU B, then
>> +CPU A's load must either return the same value that CPU B's load did,
>> +or must return some later value.
>> +
>> +In the Linux kernel, use of general memory barriers guarantees
>> +transitivity. Therefore, in the above example, if CPU 2's load from X
>> +returns 1 and its load from Y returns 0, then CPU 3's load from X must
>> +also return 1.
>> +
>> +However, transitivity is -not- guaranteed for read or write barriers.
>> +For example, suppose that CPU 2's general barrier in the above example
>> +is changed to a read barrier as shown below:
>> +
>> + CPU 1 CPU 2 CPU 3
>> + ======================= ======================= =======================
>> + { X = 0, Y = 0 }
>> + STORE X=1 LOAD X STORE Y=1
>> + <read barrier> <general barrier>
>> + LOAD Y LOAD X
>> +
>> +This substitution destroys transitivity: in this example, it is perfectly
>> +legal for CPU 2's load from X to return 1, its load from Y to return 0,
>> +and CPU 3's load from X to return 0.
>> +
>> +The key point is that although CPU 2's read barrier orders its pair
>> +of loads, it does not guarantee to order CPU 1's store. Therefore, if
>> +this example runs on a system where CPUs 1 and 2 share a store buffer
>> +or a level of cache, CPU 2 might have early access to CPU 1's writes.
>> +General barriers are therefore required to ensure that all CPUs agree
>> +on the combined order of CPU 1's and CPU 2's accesses.
>
> Sounds like someone had a fun time debugging their code.
>
>> +
>> +To reiterate, if your code requires transitivity, use general barriers
>> +throughout.
>
> I expect that your code is the only code in the kernel that actually
> requires transitivity ;-)
>

Maybe, but my RCURING also requires transitivity, I had asked Paul for advice
one years ago when I was writing the patch. Good document for it!

Thanks,
Lai

2011-02-23 13:59:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

* Steven Rostedt ([email protected]) wrote:
> On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
> > The build will break if you change the Kconfig to allow
> > DEBUG_OBJECTS_RCU_HEAD and !PREEMPT, so document the reasoning
> > near where the breakage would occur.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcupdate.c | 5 +++++
> > 1 files changed, 5 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> > index afd21d1..f3240e9 100644
> > --- a/kernel/rcupdate.c
> > +++ b/kernel/rcupdate.c
> > @@ -214,6 +214,11 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
> > * Ensure that queued callbacks are all executed.
> > * If we detect that we are nested in a RCU read-side critical
> > * section, we should simply fail, otherwise we would deadlock.
> > + * Note that the machinery to reliably determine whether
> > + * or not we are in an RCU read-side critical section
> > + * exists only in the preemptible RCU implementations
> > + * (TINY_PREEMPT_RCU and TREE_PREEMPT_RCU), which is why
> > + * DEBUG_OBJECTS_RCU_HEAD is disallowed if !PREEMPT.
> > */
>
> Shouldn't this comment also be in the kconfig where
> DEBUG_OBJECTS_RCU_HEAD is defined?

hrm, but this is a "rcuhead_fixup_init" : it does not need to always
succeed. It's just that when it is safe to recover from an error
situation, it does it. We could do:

#ifdef CONFIG_PREEMPT
/*
* Ensure that queued callbacks are all executed.
* If we detect that we are nested in a RCU read-side
* critical
* section, we should simply fail, otherwise we would
* deadlock.
*/
if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
irqs_disabled()) {
WARN_ON(1);
return 0;
}
rcu_barrier();
rcu_barrier_sched();
rcu_barrier_bh();
debug_object_init(head, &rcuhead_debug_descr);
return 1;
#else
return 0;
#endif

instead, no ?

Mathieu

>
> -- Steve
>
> > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > irqs_disabled()) {
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 14:02:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

* Paul E. McKenney ([email protected]) wrote:
[...]
> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1)
> + if (mask & 01)

Wow, octal notation! ;)

Maybe we could consider 0x1 here.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 14:11:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

On Wed, 2011-02-23 at 08:59 -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> > On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
> > > The build will break if you change the Kconfig to allow
> > > DEBUG_OBJECTS_RCU_HEAD and !PREEMPT, so document the reasoning
> > > near where the breakage would occur.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/rcupdate.c | 5 +++++
> > > 1 files changed, 5 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> > > index afd21d1..f3240e9 100644
> > > --- a/kernel/rcupdate.c
> > > +++ b/kernel/rcupdate.c
> > > @@ -214,6 +214,11 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
> > > * Ensure that queued callbacks are all executed.
> > > * If we detect that we are nested in a RCU read-side critical
> > > * section, we should simply fail, otherwise we would deadlock.
> > > + * Note that the machinery to reliably determine whether
> > > + * or not we are in an RCU read-side critical section
> > > + * exists only in the preemptible RCU implementations
> > > + * (TINY_PREEMPT_RCU and TREE_PREEMPT_RCU), which is why
> > > + * DEBUG_OBJECTS_RCU_HEAD is disallowed if !PREEMPT.
> > > */
> >
> > Shouldn't this comment also be in the kconfig where
> > DEBUG_OBJECTS_RCU_HEAD is defined?
>
> hrm, but this is a "rcuhead_fixup_init" : it does not need to always

Looks like it was in rcuhead_fixup_free to me.

> succeed. It's just that when it is safe to recover from an error
> situation, it does it. We could do:
>
> #ifdef CONFIG_PREEMPT
> /*
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side
> * critical
> * section, we should simply fail, otherwise we would
> * deadlock.
> */
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> return 0;
> }
> rcu_barrier();
> rcu_barrier_sched();
> rcu_barrier_bh();
> debug_object_init(head, &rcuhead_debug_descr);
> return 1;
> #else
> return 0;
> #endif
>
> instead, no ?

The point is that this entire block of code is wrapped in #ifdef
CONFIG_DEBUG_OBJECTS_RCU_HEAD, and that config depends on PREEMPT. Thus
you will never have a case where #ifdef CONFIG_PREEMPT is false.

-- Steve

2011-02-23 14:36:53

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 04/11] rcupdate: remove dead code

* Paul E. McKenney ([email protected]) wrote:
> From: Amerigo Wang <[email protected]>
>
> DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT, so #ifndef CONFIG_PREEMPT
> is totally useless in kernel/rcupdate.c.

Why does it depend on PREEMPT exactly ? Only for the fixup ?

Thanks,

Mathieu

>
> Signed-off-by: WANG Cong <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcupdate.c | 5 -----
> 1 files changed, 0 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index a23a57a..afd21d1 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -215,10 +215,6 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> */
> -#ifndef CONFIG_PREEMPT
> - WARN_ON(1);
> - return 0;
> -#else
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> @@ -229,7 +225,6 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
> rcu_barrier_bh();
> debug_object_free(head, &rcuhead_debug_descr);
> return 1;
> -#endif
> default:
> return 0;
> }
> --
> 1.7.3.2
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 14:37:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

* Steven Rostedt ([email protected]) wrote:
> On Wed, 2011-02-23 at 08:59 -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > > On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
> > > > The build will break if you change the Kconfig to allow
> > > > DEBUG_OBJECTS_RCU_HEAD and !PREEMPT, so document the reasoning
> > > > near where the breakage would occur.
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > ---
> > > > kernel/rcupdate.c | 5 +++++
> > > > 1 files changed, 5 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> > > > index afd21d1..f3240e9 100644
> > > > --- a/kernel/rcupdate.c
> > > > +++ b/kernel/rcupdate.c
> > > > @@ -214,6 +214,11 @@ static int rcuhead_fixup_free(void *addr, enum debug_obj_state state)
> > > > * Ensure that queued callbacks are all executed.
> > > > * If we detect that we are nested in a RCU read-side critical
> > > > * section, we should simply fail, otherwise we would deadlock.
> > > > + * Note that the machinery to reliably determine whether
> > > > + * or not we are in an RCU read-side critical section
> > > > + * exists only in the preemptible RCU implementations
> > > > + * (TINY_PREEMPT_RCU and TREE_PREEMPT_RCU), which is why
> > > > + * DEBUG_OBJECTS_RCU_HEAD is disallowed if !PREEMPT.
> > > > */
> > >
> > > Shouldn't this comment also be in the kconfig where
> > > DEBUG_OBJECTS_RCU_HEAD is defined?
> >
> > hrm, but this is a "rcuhead_fixup_init" : it does not need to always
>
> Looks like it was in rcuhead_fixup_free to me.

OK.

>
> > succeed. It's just that when it is safe to recover from an error
> > situation, it does it. We could do:
> >
> > #ifdef CONFIG_PREEMPT
> > /*
> > * Ensure that queued callbacks are all executed.
> > * If we detect that we are nested in a RCU read-side
> > * critical
> > * section, we should simply fail, otherwise we would
> > * deadlock.
> > */
> > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > irqs_disabled()) {
> > WARN_ON(1);
> > return 0;
> > }
> > rcu_barrier();
> > rcu_barrier_sched();
> > rcu_barrier_bh();
> > debug_object_init(head, &rcuhead_debug_descr);
> > return 1;
> > #else
> > return 0;
> > #endif
> >
> > instead, no ?
>
> The point is that this entire block of code is wrapped in #ifdef
> CONFIG_DEBUG_OBJECTS_RCU_HEAD, and that config depends on PREEMPT. Thus
> you will never have a case where #ifdef CONFIG_PREEMPT is false.

Why does CONFIG_DEBUG_OBJECTS_RCU_HEAD depend on PREEMPT again ?

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 14:42:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 2011-02-23 at 09:02 -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney ([email protected]) wrote:
> [...]
> > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1)
> > + if (mask & 01)
>
> Wow, octal notation! ;)

Heh

>
> Maybe we could consider 0x1 here.

Why even the conversion? mask & 1 should be good enough.

Perhaps 1UL to match the unsigned long.

-- Steve

2011-02-23 14:55:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

On Wed, 2011-02-23 at 08:59 -0500, Mathieu Desnoyers wrote:

> #ifdef CONFIG_PREEMPT
> /*
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side
> * critical
> * section, we should simply fail, otherwise we would
> * deadlock.
> */
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> return 0;
> }
> rcu_barrier();
> rcu_barrier_sched();
> rcu_barrier_bh();
> debug_object_init(head, &rcuhead_debug_descr);
> return 1;
> #else
> return 0;
> #endif
>
> instead, no ?

BTW, if you do end up doing such a thing...

#ifndef CONFIG_PREEMPT
return 0;
#endif

before all that would be much cleaner.

Just make sure any non defined in PREEMPT macros/functions are defined
as nops in !CONFIG_PREEMPT

-- Steve

2011-02-23 15:02:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 05/11] rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT.

* Steven Rostedt ([email protected]) wrote:
> On Wed, 2011-02-23 at 08:59 -0500, Mathieu Desnoyers wrote:
>
> > #ifdef CONFIG_PREEMPT
> > /*
> > * Ensure that queued callbacks are all executed.
> > * If we detect that we are nested in a RCU read-side
> > * critical
> > * section, we should simply fail, otherwise we would
> > * deadlock.
> > */
> > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > irqs_disabled()) {
> > WARN_ON(1);
> > return 0;
> > }
> > rcu_barrier();
> > rcu_barrier_sched();
> > rcu_barrier_bh();
> > debug_object_init(head, &rcuhead_debug_descr);
> > return 1;
> > #else
> > return 0;
> > #endif
> >
> > instead, no ?
>
> BTW, if you do end up doing such a thing...
>
> #ifndef CONFIG_PREEMPT
> return 0;
> #endif
>
> before all that would be much cleaner.
>
> Just make sure any non defined in PREEMPT macros/functions are defined
> as nops in !CONFIG_PREEMPT

Good point, thanks !

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 15:11:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 03:44:24AM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > +static int rcu_node_kthread(void *arg)
> > +{
> > + int cpu;
> > + unsigned long flags;
> > + unsigned long mask;
> > + struct rcu_node *rnp = (struct rcu_node *)arg;
> > + struct sched_param sp;
> > + struct task_struct *t;
> > +
> > + for (;;) {
> > + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> > + kthread_should_stop());
> > + if (kthread_should_stop())
> > + break;
> > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > + mask = rnp->wakemask;
> > + rnp->wakemask = 0;
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
>
> I may be confused, but shouldn't it be mask >>= 1 instead?

You are not confused, but I sure was! ;-)

> rnp->wakemask is the unioned rdp->grpmask of the cpu(s) for which we woke that
> node thread up. Those mask start from 0, so what you want with the below
> check is to watch if the next CPU in group range is in the wakeup mask by shifting
> to the right.
>
> No?

Not only are you are quite correct, but this bug might well explain the
slowdown in grace-period latency that I was seeing in tests.

Thank you very much!!!

Thanx, Paul

> > + if ((mask & 0x1) == 0)
> > + continue;
> > + preempt_disable();
> > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t == NULL) {
> > + preempt_enable();
> > + continue;
> > + }
> > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > + sched_setscheduler_nocheck(t, cpu, &sp);
> > + wake_up_process(t);
> > + preempt_enable();
> > + }
> > + }
> > + return 0;
> > +}

2011-02-23 15:12:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 04:09:43AM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > +/*
> > + * Set the per-rcu_node kthread's affinity to cover all CPUs that are
> > + * served by the rcu_node in question.
> > + */
> > +static void rcu_node_kthread_setaffinity(struct rcu_node *rnp)
> > +{
> > + cpumask_var_t cm;
> > + int cpu;
> > + unsigned long mask = rnp->qsmaskinit;
> > +
> > + if (rnp->node_kthread_task == NULL ||
> > + rnp->qsmaskinit == 0)
> > + return;
> > + if (!alloc_cpumask_var(&cm, GFP_KERNEL))
> > + return;
> > + cpumask_clear(cm);
> > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1)
>
> There seem to be the same shift problem here.

Right you are, thank you again!!!

Thanx, Paul

> > + if (mask & 01)
> > + cpumask_set_cpu(cpu, cm);
> > + set_cpus_allowed_ptr(rnp->node_kthread_task, cm);
> > + free_cpumask_var(cm);
> > +}

2011-02-23 15:13:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] debug rcu head support !PREEMPT config

Remove DEBUG_RCU_HEAD dependency on PREEMPT config. Handle the unability to
detect if within a RCU read-side critical section by never performing any
attempt to recover from a failure situation in the fixup handlers. Just print
the warnings.

This patch is only compile-tested.

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
kernel/rcupdate.c | 17 +++++++++++++++++
lib/Kconfig.debug | 2 +-
2 files changed, 18 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/lib/Kconfig.debug
===================================================================
--- linux-2.6-lttng.orig/lib/Kconfig.debug
+++ linux-2.6-lttng/lib/Kconfig.debug
@@ -313,7 +313,7 @@ config DEBUG_OBJECTS_WORK

config DEBUG_OBJECTS_RCU_HEAD
bool "Debug RCU callbacks objects"
- depends on DEBUG_OBJECTS && PREEMPT
+ depends on DEBUG_OBJECTS
help
Enable this to turn on debugging of RCU list heads (call_rcu() usage).

Index: linux-2.6-lttng/kernel/rcupdate.c
===================================================================
--- linux-2.6-lttng.orig/kernel/rcupdate.c
+++ linux-2.6-lttng/kernel/rcupdate.c
@@ -142,7 +142,14 @@ static int rcuhead_fixup_init(void *addr
* Ensure that queued callbacks are all executed.
* If we detect that we are nested in a RCU read-side critical
* section, we should simply fail, otherwise we would deadlock.
+ * In !PREEMPT configurations, there is no way to tell if we are
+ * in a RCU read-side critical section or not, so we never
+ * attempt any fixup and just print a warning.
*/
+#ifndef CONFIG_PREEMPT
+ WARN_ON(1);
+ return 0;
+#endif
if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
irqs_disabled()) {
WARN_ON(1);
@@ -184,7 +191,14 @@ static int rcuhead_fixup_activate(void *
* Ensure that queued callbacks are all executed.
* If we detect that we are nested in a RCU read-side critical
* section, we should simply fail, otherwise we would deadlock.
+ * In !PREEMPT configurations, there is no way to tell if we are
+ * in a RCU read-side critical section or not, so we never
+ * attempt any fixup and just print a warning.
*/
+#ifndef CONFIG_PREEMPT
+ WARN_ON(1);
+ return 0;
+#endif
if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
irqs_disabled()) {
WARN_ON(1);
@@ -214,6 +228,9 @@ static int rcuhead_fixup_free(void *addr
* Ensure that queued callbacks are all executed.
* If we detect that we are nested in a RCU read-side critical
* section, we should simply fail, otherwise we would deadlock.
+ * In !PREEMPT configurations, there is no way to tell if we are
+ * in a RCU read-side critical section or not, so we never
+ * attempt any fixup and just print a warning.
*/
#ifndef CONFIG_PREEMPT
WARN_ON(1);
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 15:14:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 06/11] smp: Document transitivity for memory barriers.

On Wed, Feb 23, 2011 at 02:21:17PM +0800, Lai Jiangshan wrote:
> On 02/23/2011 11:29 AM, Steven Rostedt wrote:
> > On Tue, 2011-02-22 at 17:39 -0800, Paul E. McKenney wrote:
> >> Transitivity is guaranteed only for full memory barriers (smp_mb()).
> >>
> >> Signed-off-by: Paul E. McKenney <[email protected]>
> >> ---
> >> Documentation/memory-barriers.txt | 58 +++++++++++++++++++++++++++++++++++++
> >> 1 files changed, 58 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> >> index 631ad2f..f0d3a80 100644
> >> --- a/Documentation/memory-barriers.txt
> >> +++ b/Documentation/memory-barriers.txt
> >> @@ -21,6 +21,7 @@ Contents:
> >> - SMP barrier pairing.
> >> - Examples of memory barrier sequences.
> >> - Read memory barriers vs load speculation.
> >> + - Transitivity
> >>
> >> (*) Explicit kernel barriers.
> >>
> >> @@ -959,6 +960,63 @@ the speculation will be cancelled and the value reloaded:
> >> retrieved : : +-------+
> >>
> >>
> >> +TRANSITIVITY
> >> +------------
> >> +
> >> +Transitivity is a deeply intuitive notion about ordering that is not
> >> +always provided by real computer systems. The following example
> >> +demonstrates transitivity (also called "cumulativity"):
> >> +
> >> + CPU 1 CPU 2 CPU 3
> >> + ======================= ======================= =======================
> >> + { X = 0, Y = 0 }
> >> + STORE X=1 LOAD X STORE Y=1
> >> + <general barrier> <general barrier>
> >> + LOAD Y LOAD X
> >> +
> >> +Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
> >> +This indicates that CPU 2's load from X in some sense follows CPU 1's
> >> +store to X and that CPU 2's load from Y in some sense preceded CPU 3's
> >> +store to Y. The question is then "Can CPU 3's load from X return 0?"
> >> +
> >> +Because CPU 2's load from X in some sense came after CPU 1's store, it
> >> +is natural to expect that CPU 3's load from X must therefore return 1.
> >> +This expectation is an example of transitivity: if a load executing on
> >> +CPU A follows a load from the same variable executing on CPU B, then
> >> +CPU A's load must either return the same value that CPU B's load did,
> >> +or must return some later value.
> >> +
> >> +In the Linux kernel, use of general memory barriers guarantees
> >> +transitivity. Therefore, in the above example, if CPU 2's load from X
> >> +returns 1 and its load from Y returns 0, then CPU 3's load from X must
> >> +also return 1.
> >> +
> >> +However, transitivity is -not- guaranteed for read or write barriers.
> >> +For example, suppose that CPU 2's general barrier in the above example
> >> +is changed to a read barrier as shown below:
> >> +
> >> + CPU 1 CPU 2 CPU 3
> >> + ======================= ======================= =======================
> >> + { X = 0, Y = 0 }
> >> + STORE X=1 LOAD X STORE Y=1
> >> + <read barrier> <general barrier>
> >> + LOAD Y LOAD X
> >> +
> >> +This substitution destroys transitivity: in this example, it is perfectly
> >> +legal for CPU 2's load from X to return 1, its load from Y to return 0,
> >> +and CPU 3's load from X to return 0.
> >> +
> >> +The key point is that although CPU 2's read barrier orders its pair
> >> +of loads, it does not guarantee to order CPU 1's store. Therefore, if
> >> +this example runs on a system where CPUs 1 and 2 share a store buffer
> >> +or a level of cache, CPU 2 might have early access to CPU 1's writes.
> >> +General barriers are therefore required to ensure that all CPUs agree
> >> +on the combined order of CPU 1's and CPU 2's accesses.
> >
> > Sounds like someone had a fun time debugging their code.
> >
> >> +
> >> +To reiterate, if your code requires transitivity, use general barriers
> >> +throughout.
> >
> > I expect that your code is the only code in the kernel that actually
> > requires transitivity ;-)
>
> Maybe, but my RCURING also requires transitivity, I had asked Paul for advice
> one years ago when I was writing the patch. Good document for it!

Glad you like it!

By the way, what finally got me to get my act together and document
this was a group of patches that implicitly assumed that smp_rmb()
and smp_wmb() provide transitivity...

So, no, it is not just Lai and myself. ;-)

Thanx, Paul

2011-02-23 15:27:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] debug rcu head support !PREEMPT config

On Wed, 2011-02-23 at 10:13 -0500, Mathieu Desnoyers wrote:
> Remove DEBUG_RCU_HEAD dependency on PREEMPT config. Handle the unability to
> detect if within a RCU read-side critical section by never performing any
> attempt to recover from a failure situation in the fixup handlers. Just print
> the warnings.
>
> This patch is only compile-tested.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> ---
> kernel/rcupdate.c | 17 +++++++++++++++++
> lib/Kconfig.debug | 2 +-
> 2 files changed, 18 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-lttng/lib/Kconfig.debug
> ===================================================================
> --- linux-2.6-lttng.orig/lib/Kconfig.debug
> +++ linux-2.6-lttng/lib/Kconfig.debug
> @@ -313,7 +313,7 @@ config DEBUG_OBJECTS_WORK
>
> config DEBUG_OBJECTS_RCU_HEAD
> bool "Debug RCU callbacks objects"
> - depends on DEBUG_OBJECTS && PREEMPT
> + depends on DEBUG_OBJECTS
> help
> Enable this to turn on debugging of RCU list heads (call_rcu() usage).
>
> Index: linux-2.6-lttng/kernel/rcupdate.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/rcupdate.c
> +++ linux-2.6-lttng/kernel/rcupdate.c
> @@ -142,7 +142,14 @@ static int rcuhead_fixup_init(void *addr
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * In !PREEMPT configurations, there is no way to tell if we are
> + * in a RCU read-side critical section or not, so we never
> + * attempt any fixup and just print a warning.
> */
> +#ifndef CONFIG_PREEMPT
> + WARN_ON(1);
> + return 0;
> +#endif
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> @@ -184,7 +191,14 @@ static int rcuhead_fixup_activate(void *
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * In !PREEMPT configurations, there is no way to tell if we are
> + * in a RCU read-side critical section or not, so we never
> + * attempt any fixup and just print a warning.
> */
> +#ifndef CONFIG_PREEMPT
> + WARN_ON(1);
> + return 0;
> +#endif
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> @@ -214,6 +228,9 @@ static int rcuhead_fixup_free(void *addr
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * In !PREEMPT configurations, there is no way to tell if we are
> + * in a RCU read-side critical section or not, so we never
> + * attempt any fixup and just print a warning.
> */
> #ifndef CONFIG_PREEMPT
> WARN_ON(1);

Hmm, I wonder if s/WARN_ON/WARN_ON_ONCE/g is in order. Why spam the
console if it happens to trigger all the time?

-- Steve

2011-02-23 15:37:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] debug rcu head support !PREEMPT config

* Steven Rostedt ([email protected]) wrote:
> On Wed, 2011-02-23 at 10:13 -0500, Mathieu Desnoyers wrote:
> > Remove DEBUG_RCU_HEAD dependency on PREEMPT config. Handle the unability to
> > detect if within a RCU read-side critical section by never performing any
> > attempt to recover from a failure situation in the fixup handlers. Just print
> > the warnings.
> >
> > This patch is only compile-tested.
> >
> > Signed-off-by: Mathieu Desnoyers <[email protected]>
> > ---
> > kernel/rcupdate.c | 17 +++++++++++++++++
> > lib/Kconfig.debug | 2 +-
> > 2 files changed, 18 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6-lttng/lib/Kconfig.debug
> > ===================================================================
> > --- linux-2.6-lttng.orig/lib/Kconfig.debug
> > +++ linux-2.6-lttng/lib/Kconfig.debug
> > @@ -313,7 +313,7 @@ config DEBUG_OBJECTS_WORK
> >
> > config DEBUG_OBJECTS_RCU_HEAD
> > bool "Debug RCU callbacks objects"
> > - depends on DEBUG_OBJECTS && PREEMPT
> > + depends on DEBUG_OBJECTS
> > help
> > Enable this to turn on debugging of RCU list heads (call_rcu() usage).
> >
> > Index: linux-2.6-lttng/kernel/rcupdate.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/rcupdate.c
> > +++ linux-2.6-lttng/kernel/rcupdate.c
> > @@ -142,7 +142,14 @@ static int rcuhead_fixup_init(void *addr
> > * Ensure that queued callbacks are all executed.
> > * If we detect that we are nested in a RCU read-side critical
> > * section, we should simply fail, otherwise we would deadlock.
> > + * In !PREEMPT configurations, there is no way to tell if we are
> > + * in a RCU read-side critical section or not, so we never
> > + * attempt any fixup and just print a warning.
> > */
> > +#ifndef CONFIG_PREEMPT
> > + WARN_ON(1);
> > + return 0;
> > +#endif
> > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > irqs_disabled()) {
> > WARN_ON(1);
> > @@ -184,7 +191,14 @@ static int rcuhead_fixup_activate(void *
> > * Ensure that queued callbacks are all executed.
> > * If we detect that we are nested in a RCU read-side critical
> > * section, we should simply fail, otherwise we would deadlock.
> > + * In !PREEMPT configurations, there is no way to tell if we are
> > + * in a RCU read-side critical section or not, so we never
> > + * attempt any fixup and just print a warning.
> > */
> > +#ifndef CONFIG_PREEMPT
> > + WARN_ON(1);
> > + return 0;
> > +#endif
> > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > irqs_disabled()) {
> > WARN_ON(1);
> > @@ -214,6 +228,9 @@ static int rcuhead_fixup_free(void *addr
> > * Ensure that queued callbacks are all executed.
> > * If we detect that we are nested in a RCU read-side critical
> > * section, we should simply fail, otherwise we would deadlock.
> > + * In !PREEMPT configurations, there is no way to tell if we are
> > + * in a RCU read-side critical section or not, so we never
> > + * attempt any fixup and just print a warning.
> > */
> > #ifndef CONFIG_PREEMPT
> > WARN_ON(1);
>
> Hmm, I wonder if s/WARN_ON/WARN_ON_ONCE/g is in order. Why spam the
> console if it happens to trigger all the time?

The system should die pretty soon anyway due to list corruption, so I
don't think it's a problem in practice.

Thanks,

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 16:18:32

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> +/*
> + * Wake up the current CPU's kthread. This replaces raise_softirq()
> + * in earlier versions of RCU.
> + */
> +static void invoke_rcu_kthread(void)
> +{
> + unsigned long flags;
> + wait_queue_head_t *q;
> + int cpu;
> +
> + local_irq_save(flags);
> + cpu = smp_processor_id();
> + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
> + local_irq_restore(flags);
> + return;
> + }
> + per_cpu(rcu_cpu_has_work, cpu) = 1;
> + q = &per_cpu(rcu_cpu_wq, cpu);

I see you make extensive use of per_cpu() accessors even for
local variables.

I tend to think it's better to use __get_cpu_var() for local
accesses and keep per_cpu() for remote accesses.

There are several reasons for that:

* __get_cpu_var() checks we are in a non-preemptible section,
per_cpu() doesn't. That may sound of a limited interest for code like the
above, but by the time code can move, and then we might lose track of some
things, etc...

* local accesses can be optimized by architectures. per_cpu() implies
finding the local cpu number, and dereference an array of cpu offsets with
that number to find the local cpu offset.
__get_cpu_var() does a direct access to __my_cpu_offset which is a nice
shortcut if the arch implements it.

* It makes code easier to review: we know that __get_cpu_var() is
for local accesses and per_cpu() for remote.

> + wake_up(q);
> + local_irq_restore(flags);
> +}

2011-02-23 16:31:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu: Add boosting to TREE_PREEMPT_RCU tracing

On Wed, Feb 23, 2011 at 11:07:20AM +0800, Lai Jiangshan wrote:
> On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> > rcudir = debugfs_create_dir("rcu", NULL);
> > - if (!rcudir)
> > + if (IS_ERR_VALUE(rcudir))
> > goto free_out;
>
> if !defined(CONFIG_DEBUG_FS)
> debugfs_create_xxx() returns ERR_PTR(-ENODEV);
> else
> debugfs_create_xxx() returns NULL when failed.
>
> Since TREE_RCU_TRACE selects DEBUG_FS, "if (!rcudir)" is correct,
> we don't need to change it.

Yeah, but this really needs to be less fragile. And yes, I do appear
to have broken it in the NULL-pointer case. And who came up with the
API above, anyway? It is an accident waiting to happen.

All whining aside, I need something like the following, correct?

#define IS_ERR_VALUE_OR_NULL(x) \
unlikely((x) - 1 >= (unsigned long)(-MAX_ERRNO-1)

Thanx, Paul

> > retval = debugfs_create_file("rcudata", 0444, rcudir,
> > NULL, &rcudata_fops);
> > - if (!retval)
> > + if (IS_ERR_VALUE(retval))
> > goto free_out;
> >
> > retval = debugfs_create_file("rcudata.csv", 0444, rcudir,
> > NULL, &rcudata_csv_fops);
> > - if (!retval)
> > + if (IS_ERR_VALUE(retval))
> > + goto free_out;
> > +
> > + retval = rcu_boost_trace_create_file(rcudir);
> > + if (retval)
> > goto free_out;
> >
> > retval = debugfs_create_file("rcugp", 0444, rcudir, NULL, &rcugp_fops);
> > - if (!retval)
> > + if (IS_ERR_VALUE(retval))
> > goto free_out;
> >
> > retval = debugfs_create_file("rcuhier", 0444, rcudir,
> > NULL, &rcuhier_fops);
> > - if (!retval)
> > + if (IS_ERR_VALUE(retval))
> > goto free_out;
> >
> > retval = debugfs_create_file("rcu_pending", 0444, rcudir,
> > NULL, &rcu_pending_fops);
> > - if (!retval)
> > + if (IS_ERR_VALUE(retval))
> > goto free_out;
> > return 0;
> > free_out:
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-02-23 16:41:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 2011-02-23 at 17:16 +0100, Frederic Weisbecker wrote:
> On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > +/*
> > + * Wake up the current CPU's kthread. This replaces raise_softirq()
> > + * in earlier versions of RCU.
> > + */
> > +static void invoke_rcu_kthread(void)
> > +{
> > + unsigned long flags;
> > + wait_queue_head_t *q;
> > + int cpu;
> > +
> > + local_irq_save(flags);
> > + cpu = smp_processor_id();
> > + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
> > + local_irq_restore(flags);
> > + return;
> > + }
> > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > + q = &per_cpu(rcu_cpu_wq, cpu);
>
> I see you make extensive use of per_cpu() accessors even for
> local variables.
>
> I tend to think it's better to use __get_cpu_var() for local
> accesses and keep per_cpu() for remote accesses.
>
> There are several reasons for that:
>
> * __get_cpu_var() checks we are in a non-preemptible section,
> per_cpu() doesn't. That may sound of a limited interest for code like the
> above, but by the time code can move, and then we might lose track of some
> things, etc...

Ah, but so does smp_processor_id() ;-)

>
> * local accesses can be optimized by architectures. per_cpu() implies
> finding the local cpu number, and dereference an array of cpu offsets with
> that number to find the local cpu offset.
> __get_cpu_var() does a direct access to __my_cpu_offset which is a nice
> shortcut if the arch implements it.

True, but we could also argue that the multiple checks for being preempt
can also be an issue.

>
> * It makes code easier to review: we know that __get_cpu_var() is
> for local accesses and per_cpu() for remote.

This I'll agree with you.

In the past, I've thought about which one is better (per_cpu() vs
__get_cpu_var()).

But, that last point is a good one. Just knowing that this is for the
local CPU helps with the amount of info your mind needs to process when
looking at this code. And the mind needs all the help it can get when
reviewing this code ;-)

-- Steve

>
> > + wake_up(q);
> > + local_irq_restore(flags);
> > +}

2011-02-23 16:53:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> +}
> +
> +/*
> + * Drop to non-real-time priority and yield, but only after posting a
> + * timer that will cause us to regain our real-time priority if we
> + * remain preempted. Either way, we restore our real-time priority
> + * before returning.
> + */
> +static void rcu_yield(int cpu)
> +{
> + struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
> + struct sched_param sp;
> + struct timer_list yield_timer;
> +
> + setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
> + mod_timer(&yield_timer, jiffies + 2);
> + sp.sched_priority = 0;
> + sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
> + schedule();
> + sp.sched_priority = RCU_KTHREAD_PRIO;
> + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> + del_timer(&yield_timer);
> +}
> +
> +/*
> + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> + * This can happen while the corresponding CPU is either coming online
> + * or going offline. We cannot wait until the CPU is fully online
> + * before starting the kthread, because the various notifier functions
> + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> + * the corresponding CPU is online.
> + *
> + * Return 1 if the kthread needs to stop, 0 otherwise.
> + *
> + * Caller must disable bh. This function can momentarily enable it.
> + */
> +static int rcu_cpu_kthread_should_stop(int cpu)
> +{
> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> + if (kthread_should_stop())
> + return 1;
> + local_bh_enable();
> + schedule_timeout_uninterruptible(1);

Why is it uninterruptible? Well that doesn't change much anyway.
It can be a problem for long time sleeping kernel threads because of
the hung task detector, but certainly not for 1 jiffy.

> + if (smp_processor_id() != cpu)
> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> + local_bh_disable();
> + }
> + return 0;
> +}
> +
> +/*
> + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> + * earlier RCU softirq.
> + */
> +static int rcu_cpu_kthread(void *arg)
> +{
> + int cpu = (int)(long)arg;
> + unsigned long flags;
> + int spincnt = 0;
> + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> + char work;
> + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> +
> + for (;;) {
> + wait_event_interruptible(*wqp,
> + *workp != 0 || kthread_should_stop());
> + local_bh_disable();
> + if (rcu_cpu_kthread_should_stop(cpu)) {
> + local_bh_enable();
> + break;
> + }
> + local_irq_save(flags);
> + work = *workp;
> + *workp = 0;
> + local_irq_restore(flags);
> + if (work)
> + rcu_process_callbacks();
> + local_bh_enable();
> + if (*workp != 0)
> + spincnt++;
> + else
> + spincnt = 0;
> + if (spincnt > 10) {
> + rcu_yield(cpu);
> + spincnt = 0;
> + }
> + }
> + return 0;
> +}
> +
> +/*
> + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> + * kthreads when needed.
> + */
> +static int rcu_node_kthread(void *arg)
> +{
> + int cpu;
> + unsigned long flags;
> + unsigned long mask;
> + struct rcu_node *rnp = (struct rcu_node *)arg;
> + struct sched_param sp;
> + struct task_struct *t;
> +
> + for (;;) {
> + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> + kthread_should_stop());
> + if (kthread_should_stop())
> + break;
> + raw_spin_lock_irqsave(&rnp->lock, flags);
> + mask = rnp->wakemask;
> + rnp->wakemask = 0;
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> + if ((mask & 0x1) == 0)
> + continue;
> + preempt_disable();
> + per_cpu(rcu_cpu_has_work, cpu) = 1;
> + t = per_cpu(rcu_cpu_kthread_task, cpu);
> + if (t == NULL) {
> + preempt_enable();
> + continue;
> + }
> + sp.sched_priority = RCU_KTHREAD_PRIO;
> + sched_setscheduler_nocheck(t, cpu, &sp);
> + wake_up_process(t);

My (mis?)understanding of the picture is this node kthread is there to
wake up cpu threads that called rcu_yield(). But actually rcu_yield()
doesn't make the cpu thread sleeping, instead it switches to SCHED_NORMAL,
to avoid starving the system with callbacks.

So I wonder if this wake_up_process() is actually relevant.
sched_setscheduler_nocheck() already handles the per sched policy rq migration
and the process is not sleeping.

That said, by the time the process may have gone to sleep, because if no other
SCHED_NORMAL task was there, it has just continued and may have flushed
every callbacks. So this wake_up_process() may actually wake up the task
but it will sleep again right away due to the condition in wait_event_interruptible()
of the cpu thread.

Right?

> + preempt_enable();
> + }
> + }
> + return 0;
> +}

2011-02-23 17:04:04

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

* Steven Rostedt ([email protected]) wrote:
> On Wed, 2011-02-23 at 17:16 +0100, Frederic Weisbecker wrote:
> > On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > > +/*
> > > + * Wake up the current CPU's kthread. This replaces raise_softirq()
> > > + * in earlier versions of RCU.
> > > + */
> > > +static void invoke_rcu_kthread(void)
> > > +{
> > > + unsigned long flags;
> > > + wait_queue_head_t *q;
> > > + int cpu;
> > > +
> > > + local_irq_save(flags);
> > > + cpu = smp_processor_id();
> > > + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
> > > + local_irq_restore(flags);
> > > + return;
> > > + }
> > > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > > + q = &per_cpu(rcu_cpu_wq, cpu);
> >
> > I see you make extensive use of per_cpu() accessors even for
> > local variables.
> >
> > I tend to think it's better to use __get_cpu_var() for local
> > accesses and keep per_cpu() for remote accesses.
> >
> > There are several reasons for that:
> >
> > * __get_cpu_var() checks we are in a non-preemptible section,
> > per_cpu() doesn't. That may sound of a limited interest for code like the
> > above, but by the time code can move, and then we might lose track of some
> > things, etc...
>
> Ah, but so does smp_processor_id() ;-)
>
> >
> > * local accesses can be optimized by architectures. per_cpu() implies
> > finding the local cpu number, and dereference an array of cpu offsets with
> > that number to find the local cpu offset.
> > __get_cpu_var() does a direct access to __my_cpu_offset which is a nice
> > shortcut if the arch implements it.

[Adding Christoph Lameter to CC list]

This is not quite true on x86_64 and s390 anymore. __get_cpu_var() now
uses a segment selector override to get the local CPU variable on x86.
See x86's percpu.h for details.

So even performance-wise, using __get_cpu_var() over per_cpu() should be
a win on widely used architectures nowadays, thanks to Christoph's work
on this_cpu accessors.

>
> True, but we could also argue that the multiple checks for being preempt
> can also be an issue.

At least on x86 preemption don't actually need to be disabled: selection
of the right per-cpu memory location is done atomically with the rest of
the instruction by the segment selector.

>
> >
> > * It makes code easier to review: we know that __get_cpu_var() is
> > for local accesses and per_cpu() for remote.
>
> This I'll agree with you.
>
> In the past, I've thought about which one is better (per_cpu() vs
> __get_cpu_var()).
>
> But, that last point is a good one. Just knowing that this is for the
> local CPU helps with the amount of info your mind needs to process when
> looking at this code. And the mind needs all the help it can get when
> reviewing this code ;-)
>

Agreed, better documentation of the code is also a win.

Thanks,

Mathieu

> -- Steve
>
> >
> > > + wake_up(q);
> > > + local_irq_restore(flags);
> > > +}
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 17:17:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 11:41:42AM -0500, Steven Rostedt wrote:
> On Wed, 2011-02-23 at 17:16 +0100, Frederic Weisbecker wrote:
> > I see you make extensive use of per_cpu() accessors even for
> > local variables.
> >
> > I tend to think it's better to use __get_cpu_var() for local
> > accesses and keep per_cpu() for remote accesses.
> >
> > There are several reasons for that:
> >
> > * __get_cpu_var() checks we are in a non-preemptible section,
> > per_cpu() doesn't. That may sound of a limited interest for code like the
> > above, but by the time code can move, and then we might lose track of some
> > things, etc...
>
> Ah, but so does smp_processor_id() ;-)

Ah, right :)

> >
> > * local accesses can be optimized by architectures. per_cpu() implies
> > finding the local cpu number, and dereference an array of cpu offsets with
> > that number to find the local cpu offset.
> > __get_cpu_var() does a direct access to __my_cpu_offset which is a nice
> > shortcut if the arch implements it.
>
> True, but we could also argue that the multiple checks for being preempt
> can also be an issue.

It's only in debugging code, so it's not really an issue.

> >
> > * It makes code easier to review: we know that __get_cpu_var() is
> > for local accesses and per_cpu() for remote.
>
> This I'll agree with you.
>
> In the past, I've thought about which one is better (per_cpu() vs
> __get_cpu_var()).
>
> But, that last point is a good one. Just knowing that this is for the
> local CPU helps with the amount of info your mind needs to process when
> looking at this code. And the mind needs all the help it can get when
> reviewing this code ;-)

Yeah :-)

And that becomes even better when you have the opportunity to
use get_cpu_var(). So that tells that your are doing a local access
and the reason why you disable/enable preemption.

2011-02-23 17:34:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 12:03:56PM -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> > On Wed, 2011-02-23 at 17:16 +0100, Frederic Weisbecker wrote:
> > > On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > > > +/*
> > > > + * Wake up the current CPU's kthread. This replaces raise_softirq()
> > > > + * in earlier versions of RCU.
> > > > + */
> > > > +static void invoke_rcu_kthread(void)
> > > > +{
> > > > + unsigned long flags;
> > > > + wait_queue_head_t *q;
> > > > + int cpu;
> > > > +
> > > > + local_irq_save(flags);
> > > > + cpu = smp_processor_id();
> > > > + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
> > > > + local_irq_restore(flags);
> > > > + return;
> > > > + }
> > > > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > > > + q = &per_cpu(rcu_cpu_wq, cpu);
> > >
> > > I see you make extensive use of per_cpu() accessors even for
> > > local variables.
> > >
> > > I tend to think it's better to use __get_cpu_var() for local
> > > accesses and keep per_cpu() for remote accesses.
> > >
> > > There are several reasons for that:
> > >
> > > * __get_cpu_var() checks we are in a non-preemptible section,
> > > per_cpu() doesn't. That may sound of a limited interest for code like the
> > > above, but by the time code can move, and then we might lose track of some
> > > things, etc...
> >
> > Ah, but so does smp_processor_id() ;-)
> >
> > >
> > > * local accesses can be optimized by architectures. per_cpu() implies
> > > finding the local cpu number, and dereference an array of cpu offsets with
> > > that number to find the local cpu offset.
> > > __get_cpu_var() does a direct access to __my_cpu_offset which is a nice
> > > shortcut if the arch implements it.
>
> [Adding Christoph Lameter to CC list]
>
> This is not quite true on x86_64 and s390 anymore. __get_cpu_var() now
> uses a segment selector override to get the local CPU variable on x86.
> See x86's percpu.h for details.
>
> So even performance-wise, using __get_cpu_var() over per_cpu() should be
> a win on widely used architectures nowadays,

Looking at x86_64, it indeed optimizes further by overriding this_cpu_ptr().
It does the same than the generic this_cpu_ptr() on an
overriden my_cpu_offset, but it also economizes a temporary store.

>
> >
> > True, but we could also argue that the multiple checks for being preempt
> > can also be an issue.
>
> At least on x86 preemption don't actually need to be disabled: selection
> of the right per-cpu memory location is done atomically with the rest of
> the instruction by the segment selector.

It depends on the case, you may still need to disable preemption if you use
your variable further than just a quick op, which is often the case.

That's up to this_cpu_add() op things, depending on what the arch is capable
of wrt. local atomicity.

Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 23 Feb 2011, Mathieu Desnoyers wrote:

> > > > +
> > > > + local_irq_save(flags);
> > > > + cpu = smp_processor_id();

Drop this line.

> > > > + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {

use this_cpu_read(rcu_cpu_kthread_task)

> > > > + local_irq_restore(flags);
> > > > + return;
> > > > + }
> > > > + per_cpu(rcu_cpu_has_work, cpu) = 1;

this_cpu_write(rcu_cpu_has_work, 1);

> This is not quite true on x86_64 and s390 anymore. __get_cpu_var() now
> uses a segment selector override to get the local CPU variable on x86.
> See x86's percpu.h for details.

__get_cpu_var cannot use a segment override since there are places where
the address of the variable is taken. One needs to use this_cpu_ops for
that.


> > True, but we could also argue that the multiple checks for being preempt
> > can also be an issue.
>
> At least on x86 preemption don't actually need to be disabled: selection
> of the right per-cpu memory location is done atomically with the rest of
> the instruction by the segment selector.

Right.

2011-02-23 17:49:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] debug rcu head support !PREEMPT config

On Wed, Feb 23, 2011 at 10:13:19AM -0500, Mathieu Desnoyers wrote:
> Remove DEBUG_RCU_HEAD dependency on PREEMPT config. Handle the unability to
> detect if within a RCU read-side critical section by never performing any
> attempt to recover from a failure situation in the fixup handlers. Just print
> the warnings.
>
> This patch is only compile-tested.

Queued with mods due to skew, thank you Mathieu!

Thanx, Paul

> Signed-off-by: Mathieu Desnoyers <[email protected]>
> ---
> kernel/rcupdate.c | 17 +++++++++++++++++
> lib/Kconfig.debug | 2 +-
> 2 files changed, 18 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-lttng/lib/Kconfig.debug
> ===================================================================
> --- linux-2.6-lttng.orig/lib/Kconfig.debug
> +++ linux-2.6-lttng/lib/Kconfig.debug
> @@ -313,7 +313,7 @@ config DEBUG_OBJECTS_WORK
>
> config DEBUG_OBJECTS_RCU_HEAD
> bool "Debug RCU callbacks objects"
> - depends on DEBUG_OBJECTS && PREEMPT
> + depends on DEBUG_OBJECTS
> help
> Enable this to turn on debugging of RCU list heads (call_rcu() usage).
>
> Index: linux-2.6-lttng/kernel/rcupdate.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/rcupdate.c
> +++ linux-2.6-lttng/kernel/rcupdate.c
> @@ -142,7 +142,14 @@ static int rcuhead_fixup_init(void *addr
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * In !PREEMPT configurations, there is no way to tell if we are
> + * in a RCU read-side critical section or not, so we never
> + * attempt any fixup and just print a warning.
> */
> +#ifndef CONFIG_PREEMPT
> + WARN_ON(1);
> + return 0;
> +#endif
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> @@ -184,7 +191,14 @@ static int rcuhead_fixup_activate(void *
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * In !PREEMPT configurations, there is no way to tell if we are
> + * in a RCU read-side critical section or not, so we never
> + * attempt any fixup and just print a warning.
> */
> +#ifndef CONFIG_PREEMPT
> + WARN_ON(1);
> + return 0;
> +#endif
> if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> irqs_disabled()) {
> WARN_ON(1);
> @@ -214,6 +228,9 @@ static int rcuhead_fixup_free(void *addr
> * Ensure that queued callbacks are all executed.
> * If we detect that we are nested in a RCU read-side critical
> * section, we should simply fail, otherwise we would deadlock.
> + * In !PREEMPT configurations, there is no way to tell if we are
> + * in a RCU read-side critical section or not, so we never
> + * attempt any fixup and just print a warning.
> */
> #ifndef CONFIG_PREEMPT
> WARN_ON(1);
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

2011-02-23 18:17:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 2011-02-23 at 11:34 -0600, Christoph Lameter wrote:

> > > True, but we could also argue that the multiple checks for being preempt
> > > can also be an issue.
> >
> > At least on x86 preemption don't actually need to be disabled: selection
> > of the right per-cpu memory location is done atomically with the rest of
> > the instruction by the segment selector.
>
> Right.

But a test still needs to be made. Because three access of this_cpu_*()
that gets preempted and scheduled on another CPU can access a different
CPU var for each access. This does not matter how atomic the
this_cpu_*() code is.

IOW, the use of this_cpu_*() without preemption disabled is 99% of the
time a bug.

-- Steve

2011-02-23 18:27:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

* Christoph Lameter ([email protected]) wrote:
> On Wed, 23 Feb 2011, Mathieu Desnoyers wrote:
>
> > > > > +
> > > > > + local_irq_save(flags);
> > > > > + cpu = smp_processor_id();
>
> Drop this line.
>
> > > > > + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
>
> use this_cpu_read(rcu_cpu_kthread_task)
>
> > > > > + local_irq_restore(flags);
> > > > > + return;
> > > > > + }
> > > > > + per_cpu(rcu_cpu_has_work, cpu) = 1;
>
> this_cpu_write(rcu_cpu_has_work, 1);
>
> > This is not quite true on x86_64 and s390 anymore. __get_cpu_var() now
> > uses a segment selector override to get the local CPU variable on x86.
> > See x86's percpu.h for details.
>
> __get_cpu_var cannot use a segment override since there are places where
> the address of the variable is taken. One needs to use this_cpu_ops for
> that.

Ah, thanks for the clarification :)

>
>
> > > True, but we could also argue that the multiple checks for being preempt
> > > can also be an issue.
> >
> > At least on x86 preemption don't actually need to be disabled: selection
> > of the right per-cpu memory location is done atomically with the rest of
> > the instruction by the segment selector.
>
> Right.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 23 Feb 2011, Steven Rostedt wrote:

> On Wed, 2011-02-23 at 11:34 -0600, Christoph Lameter wrote:
>
> > > > True, but we could also argue that the multiple checks for being preempt
> > > > can also be an issue.
> > >
> > > At least on x86 preemption don't actually need to be disabled: selection
> > > of the right per-cpu memory location is done atomically with the rest of
> > > the instruction by the segment selector.
> >
> > Right.
>
> But a test still needs to be made. Because three access of this_cpu_*()
> that gets preempted and scheduled on another CPU can access a different
> CPU var for each access. This does not matter how atomic the
> this_cpu_*() code is.

Right if the kthread context can be rescheduled then either preemption
needs to be disabled to guarantee that all three access the same per cpu
area data or the code needs to be changed in such a way that a this_cpu
RMW instructions can do the mods in one go.

2011-02-23 18:31:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] debug rcu head support !PREEMPT config

On Wed, Feb 23, 2011 at 10:37:26AM -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> > On Wed, 2011-02-23 at 10:13 -0500, Mathieu Desnoyers wrote:
> > > Remove DEBUG_RCU_HEAD dependency on PREEMPT config. Handle the unability to
> > > detect if within a RCU read-side critical section by never performing any
> > > attempt to recover from a failure situation in the fixup handlers. Just print
> > > the warnings.
> > >
> > > This patch is only compile-tested.
> > >
> > > Signed-off-by: Mathieu Desnoyers <[email protected]>
> > > ---
> > > kernel/rcupdate.c | 17 +++++++++++++++++
> > > lib/Kconfig.debug | 2 +-
> > > 2 files changed, 18 insertions(+), 1 deletion(-)
> > >
> > > Index: linux-2.6-lttng/lib/Kconfig.debug
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/lib/Kconfig.debug
> > > +++ linux-2.6-lttng/lib/Kconfig.debug
> > > @@ -313,7 +313,7 @@ config DEBUG_OBJECTS_WORK
> > >
> > > config DEBUG_OBJECTS_RCU_HEAD
> > > bool "Debug RCU callbacks objects"
> > > - depends on DEBUG_OBJECTS && PREEMPT
> > > + depends on DEBUG_OBJECTS
> > > help
> > > Enable this to turn on debugging of RCU list heads (call_rcu() usage).
> > >
> > > Index: linux-2.6-lttng/kernel/rcupdate.c
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/kernel/rcupdate.c
> > > +++ linux-2.6-lttng/kernel/rcupdate.c
> > > @@ -142,7 +142,14 @@ static int rcuhead_fixup_init(void *addr
> > > * Ensure that queued callbacks are all executed.
> > > * If we detect that we are nested in a RCU read-side critical
> > > * section, we should simply fail, otherwise we would deadlock.
> > > + * In !PREEMPT configurations, there is no way to tell if we are
> > > + * in a RCU read-side critical section or not, so we never
> > > + * attempt any fixup and just print a warning.
> > > */
> > > +#ifndef CONFIG_PREEMPT
> > > + WARN_ON(1);
> > > + return 0;
> > > +#endif
> > > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > > irqs_disabled()) {
> > > WARN_ON(1);
> > > @@ -184,7 +191,14 @@ static int rcuhead_fixup_activate(void *
> > > * Ensure that queued callbacks are all executed.
> > > * If we detect that we are nested in a RCU read-side critical
> > > * section, we should simply fail, otherwise we would deadlock.
> > > + * In !PREEMPT configurations, there is no way to tell if we are
> > > + * in a RCU read-side critical section or not, so we never
> > > + * attempt any fixup and just print a warning.
> > > */
> > > +#ifndef CONFIG_PREEMPT
> > > + WARN_ON(1);
> > > + return 0;
> > > +#endif
> > > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > > irqs_disabled()) {
> > > WARN_ON(1);
> > > @@ -214,6 +228,9 @@ static int rcuhead_fixup_free(void *addr
> > > * Ensure that queued callbacks are all executed.
> > > * If we detect that we are nested in a RCU read-side critical
> > > * section, we should simply fail, otherwise we would deadlock.
> > > + * In !PREEMPT configurations, there is no way to tell if we are
> > > + * in a RCU read-side critical section or not, so we never
> > > + * attempt any fixup and just print a warning.
> > > */
> > > #ifndef CONFIG_PREEMPT
> > > WARN_ON(1);
> >
> > Hmm, I wonder if s/WARN_ON/WARN_ON_ONCE/g is in order. Why spam the
> > console if it happens to trigger all the time?
>
> The system should die pretty soon anyway due to list corruption, so I
> don't think it's a problem in practice.

Well, it would add noise, so I added a patch converting the WARN_ON's
to WARN_ON_ONCE's.

Thanx, Paul

2011-02-23 18:32:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 2011-02-23 at 12:29 -0600, Christoph Lameter wrote:

> Right if the kthread context can be rescheduled then either preemption
> needs to be disabled to guarantee that all three access the same per cpu
> area data or the code needs to be changed in such a way that a this_cpu
> RMW instructions can do the mods in one go.

Are you suggesting a this_cpu_atomic_inc()?

-- Steve

2011-02-23 18:38:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

* Steven Rostedt ([email protected]) wrote:
> On Wed, 2011-02-23 at 11:34 -0600, Christoph Lameter wrote:
>
> > > > True, but we could also argue that the multiple checks for being preempt
> > > > can also be an issue.
> > >
> > > At least on x86 preemption don't actually need to be disabled: selection
> > > of the right per-cpu memory location is done atomically with the rest of
> > > the instruction by the segment selector.
> >
> > Right.
>
> But a test still needs to be made. Because three access of this_cpu_*()
> that gets preempted and scheduled on another CPU can access a different
> CPU var for each access. This does not matter how atomic the
> this_cpu_*() code is.
>
> IOW, the use of this_cpu_*() without preemption disabled is 99% of the
> time a bug.

Agreed. Unless the algorithm is carefully crafted to allow being
migrated in between the ops (which is a non-trivial exercise), these
would be bugs.

As far as I am aware, there are at least two cases where leaving preempt
enabled makes sense: if we use cmpxchg or add_return, these can be
followed by tests on the return value and allow creation of preemptable
fast-paths that allow migration between the per-cpu op and the following
test (useful to keep per-cpu counters that are summed into global
counters when some power-of-2 threshold value is reached in the
low-order bits). Using cpuid field within atomically updated variable
can also allow detection of migration between the operations. In
addition, we can add the trivial single update case, but this really
does not count as a non-trivial algorithm.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 18:40:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] debug rcu head support !PREEMPT config

* Paul E. McKenney ([email protected]) wrote:
> On Wed, Feb 23, 2011 at 10:37:26AM -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > > On Wed, 2011-02-23 at 10:13 -0500, Mathieu Desnoyers wrote:
> > > > Remove DEBUG_RCU_HEAD dependency on PREEMPT config. Handle the unability to
> > > > detect if within a RCU read-side critical section by never performing any
> > > > attempt to recover from a failure situation in the fixup handlers. Just print
> > > > the warnings.
> > > >
> > > > This patch is only compile-tested.
> > > >
> > > > Signed-off-by: Mathieu Desnoyers <[email protected]>
> > > > ---
> > > > kernel/rcupdate.c | 17 +++++++++++++++++
> > > > lib/Kconfig.debug | 2 +-
> > > > 2 files changed, 18 insertions(+), 1 deletion(-)
> > > >
> > > > Index: linux-2.6-lttng/lib/Kconfig.debug
> > > > ===================================================================
> > > > --- linux-2.6-lttng.orig/lib/Kconfig.debug
> > > > +++ linux-2.6-lttng/lib/Kconfig.debug
> > > > @@ -313,7 +313,7 @@ config DEBUG_OBJECTS_WORK
> > > >
> > > > config DEBUG_OBJECTS_RCU_HEAD
> > > > bool "Debug RCU callbacks objects"
> > > > - depends on DEBUG_OBJECTS && PREEMPT
> > > > + depends on DEBUG_OBJECTS
> > > > help
> > > > Enable this to turn on debugging of RCU list heads (call_rcu() usage).
> > > >
> > > > Index: linux-2.6-lttng/kernel/rcupdate.c
> > > > ===================================================================
> > > > --- linux-2.6-lttng.orig/kernel/rcupdate.c
> > > > +++ linux-2.6-lttng/kernel/rcupdate.c
> > > > @@ -142,7 +142,14 @@ static int rcuhead_fixup_init(void *addr
> > > > * Ensure that queued callbacks are all executed.
> > > > * If we detect that we are nested in a RCU read-side critical
> > > > * section, we should simply fail, otherwise we would deadlock.
> > > > + * In !PREEMPT configurations, there is no way to tell if we are
> > > > + * in a RCU read-side critical section or not, so we never
> > > > + * attempt any fixup and just print a warning.
> > > > */
> > > > +#ifndef CONFIG_PREEMPT
> > > > + WARN_ON(1);
> > > > + return 0;
> > > > +#endif
> > > > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > > > irqs_disabled()) {
> > > > WARN_ON(1);
> > > > @@ -184,7 +191,14 @@ static int rcuhead_fixup_activate(void *
> > > > * Ensure that queued callbacks are all executed.
> > > > * If we detect that we are nested in a RCU read-side critical
> > > > * section, we should simply fail, otherwise we would deadlock.
> > > > + * In !PREEMPT configurations, there is no way to tell if we are
> > > > + * in a RCU read-side critical section or not, so we never
> > > > + * attempt any fixup and just print a warning.
> > > > */
> > > > +#ifndef CONFIG_PREEMPT
> > > > + WARN_ON(1);
> > > > + return 0;
> > > > +#endif
> > > > if (rcu_preempt_depth() != 0 || preempt_count() != 0 ||
> > > > irqs_disabled()) {
> > > > WARN_ON(1);
> > > > @@ -214,6 +228,9 @@ static int rcuhead_fixup_free(void *addr
> > > > * Ensure that queued callbacks are all executed.
> > > > * If we detect that we are nested in a RCU read-side critical
> > > > * section, we should simply fail, otherwise we would deadlock.
> > > > + * In !PREEMPT configurations, there is no way to tell if we are
> > > > + * in a RCU read-side critical section or not, so we never
> > > > + * attempt any fixup and just print a warning.
> > > > */
> > > > #ifndef CONFIG_PREEMPT
> > > > WARN_ON(1);
> > >
> > > Hmm, I wonder if s/WARN_ON/WARN_ON_ONCE/g is in order. Why spam the
> > > console if it happens to trigger all the time?
> >
> > The system should die pretty soon anyway due to list corruption, so I
> > don't think it's a problem in practice.
>
> Well, it would add noise, so I added a patch converting the WARN_ON's
> to WARN_ON_ONCE's.

Thinking about it, it makes a lot of sense. Thanks!

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2011-02-23 18:52:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 09:02:01AM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney ([email protected]) wrote:
> [...]
> > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1)
> > + if (mask & 01)
>
> Wow, octal notation! ;)

/me pauses to thank the octal-notation PDP-12, CDC 3300, CDC Cyber 73,
and PDP-11 systems for funding my college education. The two CDC
systems had ones-complement in the bargain. ;-)

> Maybe we could consider 0x1 here.

Good catch, fixed. The hex says "bitmask" to me.

Thanx, Paul

2011-02-23 19:06:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 05:50:46PM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > +}
> > +
> > +/*
> > + * Drop to non-real-time priority and yield, but only after posting a
> > + * timer that will cause us to regain our real-time priority if we
> > + * remain preempted. Either way, we restore our real-time priority
> > + * before returning.
> > + */
> > +static void rcu_yield(int cpu)
> > +{
> > + struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
> > + struct sched_param sp;
> > + struct timer_list yield_timer;
> > +
> > + setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
> > + mod_timer(&yield_timer, jiffies + 2);
> > + sp.sched_priority = 0;
> > + sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
> > + schedule();
> > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> > + del_timer(&yield_timer);
> > +}
> > +
> > +/*
> > + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> > + * This can happen while the corresponding CPU is either coming online
> > + * or going offline. We cannot wait until the CPU is fully online
> > + * before starting the kthread, because the various notifier functions
> > + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> > + * the corresponding CPU is online.
> > + *
> > + * Return 1 if the kthread needs to stop, 0 otherwise.
> > + *
> > + * Caller must disable bh. This function can momentarily enable it.
> > + */
> > +static int rcu_cpu_kthread_should_stop(int cpu)
> > +{
> > + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> > + if (kthread_should_stop())
> > + return 1;
> > + local_bh_enable();
> > + schedule_timeout_uninterruptible(1);
>
> Why is it uninterruptible? Well that doesn't change much anyway.
> It can be a problem for long time sleeping kernel threads because of
> the hung task detector, but certainly not for 1 jiffy.

Yep, and the next patch does in fact change this to
schedule_timeout_interruptible().

Good eyes, though!

Thanx, Paul

> > + if (smp_processor_id() != cpu)
> > + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> > + local_bh_disable();
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> > + * earlier RCU softirq.
> > + */
> > +static int rcu_cpu_kthread(void *arg)
> > +{
> > + int cpu = (int)(long)arg;
> > + unsigned long flags;
> > + int spincnt = 0;
> > + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> > + char work;
> > + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> > +
> > + for (;;) {
> > + wait_event_interruptible(*wqp,
> > + *workp != 0 || kthread_should_stop());
> > + local_bh_disable();
> > + if (rcu_cpu_kthread_should_stop(cpu)) {
> > + local_bh_enable();
> > + break;
> > + }
> > + local_irq_save(flags);
> > + work = *workp;
> > + *workp = 0;
> > + local_irq_restore(flags);
> > + if (work)
> > + rcu_process_callbacks();
> > + local_bh_enable();
> > + if (*workp != 0)
> > + spincnt++;
> > + else
> > + spincnt = 0;
> > + if (spincnt > 10) {
> > + rcu_yield(cpu);
> > + spincnt = 0;
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> > + * kthreads when needed.
> > + */
> > +static int rcu_node_kthread(void *arg)
> > +{
> > + int cpu;
> > + unsigned long flags;
> > + unsigned long mask;
> > + struct rcu_node *rnp = (struct rcu_node *)arg;
> > + struct sched_param sp;
> > + struct task_struct *t;
> > +
> > + for (;;) {
> > + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> > + kthread_should_stop());
> > + if (kthread_should_stop())
> > + break;
> > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > + mask = rnp->wakemask;
> > + rnp->wakemask = 0;
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> > + if ((mask & 0x1) == 0)
> > + continue;
> > + preempt_disable();
> > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t == NULL) {
> > + preempt_enable();
> > + continue;
> > + }
> > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > + sched_setscheduler_nocheck(t, cpu, &sp);
> > + wake_up_process(t);
>
> My (mis?)understanding of the picture is this node kthread is there to
> wake up cpu threads that called rcu_yield(). But actually rcu_yield()
> doesn't make the cpu thread sleeping, instead it switches to SCHED_NORMAL,
> to avoid starving the system with callbacks.
>
> So I wonder if this wake_up_process() is actually relevant.
> sched_setscheduler_nocheck() already handles the per sched policy rq migration
> and the process is not sleeping.
>
> That said, by the time the process may have gone to sleep, because if no other
> SCHED_NORMAL task was there, it has just continued and may have flushed
> every callbacks. So this wake_up_process() may actually wake up the task
> but it will sleep again right away due to the condition in wait_event_interruptible()
> of the cpu thread.
>
> Right?
>
> > + preempt_enable();
> > + }
> > + }
> > + return 0;
> > +}

2011-02-23 19:10:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 11:34:32AM -0600, Christoph Lameter wrote:
> On Wed, 23 Feb 2011, Mathieu Desnoyers wrote:
>
> > > > > +
> > > > > + local_irq_save(flags);
> > > > > + cpu = smp_processor_id();
>
> Drop this line.
>
> > > > > + if (per_cpu(rcu_cpu_kthread_task, cpu) == NULL) {
>
> use this_cpu_read(rcu_cpu_kthread_task)
>
> > > > > + local_irq_restore(flags);
> > > > > + return;
> > > > > + }
> > > > > + per_cpu(rcu_cpu_has_work, cpu) = 1;
>
> this_cpu_write(rcu_cpu_has_work, 1);

I have made these changes, thank you!

These do introduce redundant preempt_disable()/preempt_enable() calls, but
this is not on a fastpath, so should be OK, and the improved readability
is certainly nice. The read and the write do need to happen on the same
CPU, FWIW.

> > This is not quite true on x86_64 and s390 anymore. __get_cpu_var() now
> > uses a segment selector override to get the local CPU variable on x86.
> > See x86's percpu.h for details.
>
> __get_cpu_var cannot use a segment override since there are places where
> the address of the variable is taken. One needs to use this_cpu_ops for
> that.

Thanks for the info!

Thanx, Paul

> > > True, but we could also argue that the multiple checks for being preempt
> > > can also be an issue.
> >
> > At least on x86 preemption don't actually need to be disabled: selection
> > of the right per-cpu memory location is done atomically with the rest of
> > the instruction by the segment selector.
>
> Right.

2011-02-23 19:15:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 11:06:01AM -0800, Paul E. McKenney wrote:
> On Wed, Feb 23, 2011 at 05:50:46PM +0100, Frederic Weisbecker wrote:
> > On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > > +}
> > > +
> > > +/*
> > > + * Drop to non-real-time priority and yield, but only after posting a
> > > + * timer that will cause us to regain our real-time priority if we
> > > + * remain preempted. Either way, we restore our real-time priority
> > > + * before returning.
> > > + */
> > > +static void rcu_yield(int cpu)
> > > +{
> > > + struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
> > > + struct sched_param sp;
> > > + struct timer_list yield_timer;
> > > +
> > > + setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
> > > + mod_timer(&yield_timer, jiffies + 2);
> > > + sp.sched_priority = 0;
> > > + sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
> > > + schedule();
> > > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> > > + del_timer(&yield_timer);
> > > +}
> > > +
> > > +/*
> > > + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> > > + * This can happen while the corresponding CPU is either coming online
> > > + * or going offline. We cannot wait until the CPU is fully online
> > > + * before starting the kthread, because the various notifier functions
> > > + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> > > + * the corresponding CPU is online.
> > > + *
> > > + * Return 1 if the kthread needs to stop, 0 otherwise.
> > > + *
> > > + * Caller must disable bh. This function can momentarily enable it.
> > > + */
> > > +static int rcu_cpu_kthread_should_stop(int cpu)
> > > +{
> > > + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> > > + if (kthread_should_stop())
> > > + return 1;
> > > + local_bh_enable();
> > > + schedule_timeout_uninterruptible(1);
> >
> > Why is it uninterruptible? Well that doesn't change much anyway.
> > It can be a problem for long time sleeping kernel threads because of
> > the hung task detector, but certainly not for 1 jiffy.
>
> Yep, and the next patch does in fact change this to
> schedule_timeout_interruptible().
>
> Good eyes, though!
>
> Thanx, Paul

Ok.

Don't forget what I wrote below ;)


>
> > > + if (smp_processor_id() != cpu)
> > > + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> > > + local_bh_disable();
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> > > + * earlier RCU softirq.
> > > + */
> > > +static int rcu_cpu_kthread(void *arg)
> > > +{
> > > + int cpu = (int)(long)arg;
> > > + unsigned long flags;
> > > + int spincnt = 0;
> > > + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> > > + char work;
> > > + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> > > +
> > > + for (;;) {
> > > + wait_event_interruptible(*wqp,
> > > + *workp != 0 || kthread_should_stop());
> > > + local_bh_disable();
> > > + if (rcu_cpu_kthread_should_stop(cpu)) {
> > > + local_bh_enable();
> > > + break;
> > > + }
> > > + local_irq_save(flags);
> > > + work = *workp;
> > > + *workp = 0;
> > > + local_irq_restore(flags);
> > > + if (work)
> > > + rcu_process_callbacks();
> > > + local_bh_enable();
> > > + if (*workp != 0)
> > > + spincnt++;
> > > + else
> > > + spincnt = 0;
> > > + if (spincnt > 10) {
> > > + rcu_yield(cpu);
> > > + spincnt = 0;
> > > + }
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> > > + * kthreads when needed.
> > > + */
> > > +static int rcu_node_kthread(void *arg)
> > > +{
> > > + int cpu;
> > > + unsigned long flags;
> > > + unsigned long mask;
> > > + struct rcu_node *rnp = (struct rcu_node *)arg;
> > > + struct sched_param sp;
> > > + struct task_struct *t;
> > > +
> > > + for (;;) {
> > > + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> > > + kthread_should_stop());
> > > + if (kthread_should_stop())
> > > + break;
> > > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > > + mask = rnp->wakemask;
> > > + rnp->wakemask = 0;
> > > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> > > + if ((mask & 0x1) == 0)
> > > + continue;
> > > + preempt_disable();
> > > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > > + if (t == NULL) {
> > > + preempt_enable();
> > > + continue;
> > > + }
> > > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > > + sched_setscheduler_nocheck(t, cpu, &sp);
> > > + wake_up_process(t);
> >
> > My (mis?)understanding of the picture is this node kthread is there to
> > wake up cpu threads that called rcu_yield(). But actually rcu_yield()
> > doesn't make the cpu thread sleeping, instead it switches to SCHED_NORMAL,
> > to avoid starving the system with callbacks.
> >
> > So I wonder if this wake_up_process() is actually relevant.
> > sched_setscheduler_nocheck() already handles the per sched policy rq migration
> > and the process is not sleeping.
> >
> > That said, by the time the process may have gone to sleep, because if no other
> > SCHED_NORMAL task was there, it has just continued and may have flushed
> > every callbacks. So this wake_up_process() may actually wake up the task
> > but it will sleep again right away due to the condition in wait_event_interruptible()
> > of the cpu thread.
> >
> > Right?
> >
> > > + preempt_enable();
> > > + }
> > > + }
> > > + return 0;
> > > +}

2011-02-23 19:17:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 12:29:59PM -0600, Christoph Lameter wrote:
> On Wed, 23 Feb 2011, Steven Rostedt wrote:
>
> > On Wed, 2011-02-23 at 11:34 -0600, Christoph Lameter wrote:
> >
> > > > > True, but we could also argue that the multiple checks for being preempt
> > > > > can also be an issue.
> > > >
> > > > At least on x86 preemption don't actually need to be disabled: selection
> > > > of the right per-cpu memory location is done atomically with the rest of
> > > > the instruction by the segment selector.
> > >
> > > Right.
> >
> > But a test still needs to be made. Because three access of this_cpu_*()
> > that gets preempted and scheduled on another CPU can access a different
> > CPU var for each access. This does not matter how atomic the
> > this_cpu_*() code is.
>
> Right if the kthread context can be rescheduled then either preemption
> needs to be disabled to guarantee that all three access the same per cpu
> area data or the code needs to be changed in such a way that a this_cpu
> RMW instructions can do the mods in one go.

Reschedules can happen in this case due to CPU hotplug events. :-(

Thanx, Paul

Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 23 Feb 2011, Steven Rostedt wrote:

> On Wed, 2011-02-23 at 12:29 -0600, Christoph Lameter wrote:
>
> > Right if the kthread context can be rescheduled then either preemption
> > needs to be disabled to guarantee that all three access the same per cpu
> > area data or the code needs to be changed in such a way that a this_cpu
> > RMW instructions can do the mods in one go.
>
> Are you suggesting a this_cpu_atomic_inc()?

this_cpu_inc is already percpu atomic. On x86 it is an instruction that
cannot be interrupted nor preempted while in progress.

There is also this_cpu_cmpxchg() which could be used to do some more
advanced tricks.

Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 23 Feb 2011, Paul E. McKenney wrote:

> These do introduce redundant preempt_disable()/preempt_enable() calls, but
> this is not on a fastpath, so should be OK, and the improved readability
> is certainly nice. The read and the write do need to happen on the same
> CPU, FWIW.

this_cpu_xxx only use preempt_enable/disable() on platforms that do not
support per cpu atomic instructions. On x86 no preempt enable/disable will
be inserted.

You can also use the __this_cpu_xxx operations which never add preempt
disable/enable because they expect the caller to deal with preemption.

2011-02-23 19:24:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 2011-02-23 at 13:19 -0600, Christoph Lameter wrote:

> this_cpu_inc is already percpu atomic. On x86 it is an instruction that
> cannot be interrupted nor preempted while in progress.

On x86, yes, but is this true for all architectures? I guess the
fallback is implemented with local_irq_disable() which might be good
enough for some but not for NMI usage.

Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 23 Feb 2011, Paul E. McKenney wrote:

> Reschedules can happen in this case due to CPU hotplug events. :-(

Lockout cpu hotplug instead of disabling interrupts? Systems that support
cpu hotplug in hardware are rare and so the cpu hotplug locking may be
become noop.

2011-02-23 19:35:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 2011-02-23 at 20:23 +0100, Peter Zijlstra wrote:
> On Wed, 2011-02-23 at 13:19 -0600, Christoph Lameter wrote:
>
> > this_cpu_inc is already percpu atomic. On x86 it is an instruction that
> > cannot be interrupted nor preempted while in progress.
>
> On x86, yes, but is this true for all architectures? I guess the
> fallback is implemented with local_irq_disable() which might be good
> enough for some but not for NMI usage.

IIRC, those archs that can not do an atomic inc or cmpxchg have NMIs
that are basically "The Sh*t hit the fan, bail". Where the NMIs are used
only to dump out the state of the system and not something more advanced
that can do things like profiling.

-- Steve

2011-02-23 19:39:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 01:22:02PM -0600, Christoph Lameter wrote:
> On Wed, 23 Feb 2011, Paul E. McKenney wrote:
>
> > These do introduce redundant preempt_disable()/preempt_enable() calls, but
> > this is not on a fastpath, so should be OK, and the improved readability
> > is certainly nice. The read and the write do need to happen on the same
> > CPU, FWIW.
>
> this_cpu_xxx only use preempt_enable/disable() on platforms that do not
> support per cpu atomic instructions. On x86 no preempt enable/disable will
> be inserted.
>
> You can also use the __this_cpu_xxx operations which never add preempt
> disable/enable because they expect the caller to deal with preemption.

Those make a lot of sense in this case, as I need to use __get_cpu_var()
anyway.

Thank you for all the info!

Thanx, Paul

Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, 23 Feb 2011, Peter Zijlstra wrote:

> On Wed, 2011-02-23 at 13:19 -0600, Christoph Lameter wrote:
>
> > this_cpu_inc is already percpu atomic. On x86 it is an instruction that
> > cannot be interrupted nor preempted while in progress.
>
> On x86, yes, but is this true for all architectures? I guess the
> fallback is implemented with local_irq_disable() which might be good
> enough for some but not for NMI usage.

Fallback is to do preempt_disable(). But you can also disable interrupts
on fallback by putting an irqsafe_ prefix before them. Its just more
expensive and typically not needed.

There is no way to create an nmi safe per cpu atomicity emulation as far
as I can tell. It is possible to check at compile time if the arch has
per cpu atomic instructions available via CONFIG_CMPXCHG_LOCAL which
allows optimized per cpu atomic use for arches that satisfy the safety
requirements.

2011-02-23 20:15:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 08:23:50PM +0100, Peter Zijlstra wrote:
> On Wed, 2011-02-23 at 13:19 -0600, Christoph Lameter wrote:
>
> > this_cpu_inc is already percpu atomic. On x86 it is an instruction that
> > cannot be interrupted nor preempted while in progress.
>
> On x86, yes, but is this true for all architectures? I guess the
> fallback is implemented with local_irq_disable() which might be good
> enough for some but not for NMI usage.

Good point! Thankfully, no worries about NMI usage in this piece of
code. ;-)

Thanx, Paul

2011-02-23 20:41:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 08:13:35PM +0100, Frederic Weisbecker wrote:
> On Wed, Feb 23, 2011 at 11:06:01AM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 23, 2011 at 05:50:46PM +0100, Frederic Weisbecker wrote:
> > > On Tue, Feb 22, 2011 at 05:39:40PM -0800, Paul E. McKenney wrote:
> > > > +}
> > > > +
> > > > +/*
> > > > + * Drop to non-real-time priority and yield, but only after posting a
> > > > + * timer that will cause us to regain our real-time priority if we
> > > > + * remain preempted. Either way, we restore our real-time priority
> > > > + * before returning.
> > > > + */
> > > > +static void rcu_yield(int cpu)
> > > > +{
> > > > + struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
> > > > + struct sched_param sp;
> > > > + struct timer_list yield_timer;
> > > > +
> > > > + setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
> > > > + mod_timer(&yield_timer, jiffies + 2);
> > > > + sp.sched_priority = 0;
> > > > + sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
> > > > + schedule();
> > > > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > > > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> > > > + del_timer(&yield_timer);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> > > > + * This can happen while the corresponding CPU is either coming online
> > > > + * or going offline. We cannot wait until the CPU is fully online
> > > > + * before starting the kthread, because the various notifier functions
> > > > + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> > > > + * the corresponding CPU is online.
> > > > + *
> > > > + * Return 1 if the kthread needs to stop, 0 otherwise.
> > > > + *
> > > > + * Caller must disable bh. This function can momentarily enable it.
> > > > + */
> > > > +static int rcu_cpu_kthread_should_stop(int cpu)
> > > > +{
> > > > + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> > > > + if (kthread_should_stop())
> > > > + return 1;
> > > > + local_bh_enable();
> > > > + schedule_timeout_uninterruptible(1);
> > >
> > > Why is it uninterruptible? Well that doesn't change much anyway.
> > > It can be a problem for long time sleeping kernel threads because of
> > > the hung task detector, but certainly not for 1 jiffy.
> >
> > Yep, and the next patch does in fact change this to
> > schedule_timeout_interruptible().
> >
> > Good eyes, though!
> >
> > Thanx, Paul
>
> Ok.
>
> Don't forget what I wrote below ;)

But... But... I -already- forgot about it!!!

Thank you for the reminder. ;-)

> > > > + if (smp_processor_id() != cpu)
> > > > + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> > > > + local_bh_disable();
> > > > + }
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> > > > + * earlier RCU softirq.
> > > > + */
> > > > +static int rcu_cpu_kthread(void *arg)
> > > > +{
> > > > + int cpu = (int)(long)arg;
> > > > + unsigned long flags;
> > > > + int spincnt = 0;
> > > > + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> > > > + char work;
> > > > + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> > > > +
> > > > + for (;;) {
> > > > + wait_event_interruptible(*wqp,
> > > > + *workp != 0 || kthread_should_stop());
> > > > + local_bh_disable();
> > > > + if (rcu_cpu_kthread_should_stop(cpu)) {
> > > > + local_bh_enable();
> > > > + break;
> > > > + }
> > > > + local_irq_save(flags);
> > > > + work = *workp;
> > > > + *workp = 0;
> > > > + local_irq_restore(flags);
> > > > + if (work)
> > > > + rcu_process_callbacks();
> > > > + local_bh_enable();
> > > > + if (*workp != 0)
> > > > + spincnt++;
> > > > + else
> > > > + spincnt = 0;
> > > > + if (spincnt > 10) {
> > > > + rcu_yield(cpu);
> > > > + spincnt = 0;
> > > > + }
> > > > + }
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> > > > + * kthreads when needed.
> > > > + */
> > > > +static int rcu_node_kthread(void *arg)
> > > > +{
> > > > + int cpu;
> > > > + unsigned long flags;
> > > > + unsigned long mask;
> > > > + struct rcu_node *rnp = (struct rcu_node *)arg;
> > > > + struct sched_param sp;
> > > > + struct task_struct *t;
> > > > +
> > > > + for (;;) {
> > > > + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> > > > + kthread_should_stop());
> > > > + if (kthread_should_stop())
> > > > + break;
> > > > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > > > + mask = rnp->wakemask;
> > > > + rnp->wakemask = 0;
> > > > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > > > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> > > > + if ((mask & 0x1) == 0)
> > > > + continue;
> > > > + preempt_disable();
> > > > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > > > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > > > + if (t == NULL) {
> > > > + preempt_enable();
> > > > + continue;
> > > > + }
> > > > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > > > + sched_setscheduler_nocheck(t, cpu, &sp);
> > > > + wake_up_process(t);
> > >
> > > My (mis?)understanding of the picture is this node kthread is there to
> > > wake up cpu threads that called rcu_yield(). But actually rcu_yield()
> > > doesn't make the cpu thread sleeping, instead it switches to SCHED_NORMAL,
> > > to avoid starving the system with callbacks.

Indeed. My original plan was to make the per-CPU kthreads do RCU priority
boosting, but this turned out to be a non-starter. I apparently failed
to make all the necessary adjustments when backing away from this plan.

> > > So I wonder if this wake_up_process() is actually relevant.
> > > sched_setscheduler_nocheck() already handles the per sched policy rq migration
> > > and the process is not sleeping.
> > >
> > > That said, by the time the process may have gone to sleep, because if no other
> > > SCHED_NORMAL task was there, it has just continued and may have flushed
> > > every callbacks. So this wake_up_process() may actually wake up the task
> > > but it will sleep again right away due to the condition in wait_event_interruptible()
> > > of the cpu thread.

But in that case, someone would have called invoke_rcu_cpu_kthread(),
which would do its own wake_up().

> > > Right?

Yep, removed the redundant wake_up_process(). Thank you!

Hmmm... And that second argument to sched_setscheduler_nocheck() is
bogus as well... Should be SCHED_FIFO.

Thanx, Paul

> > > > + preempt_enable();
> > > > + }
> > > > + }
> > > > + return 0;
> > > > +}

2011-02-23 20:45:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Wed, Feb 23, 2011 at 01:24:34PM -0600, Christoph Lameter wrote:
> On Wed, 23 Feb 2011, Paul E. McKenney wrote:
>
> > Reschedules can happen in this case due to CPU hotplug events. :-(
>
> Lockout cpu hotplug instead of disabling interrupts? Systems that support
> cpu hotplug in hardware are rare and so the cpu hotplug locking may be
> become noop.

Good point!

However, the local_irq_save() is doing two things here: (1) blocking
CPU hotplug and (2) synchronizing with concurrent invocations of
invoke_rcu_cpu_kthread() from irq handlers.

Thanx, Paul

2011-02-25 08:29:45

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/11] rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU

On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
>
> diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
> index 015abae..3cb8e36 100644
> --- a/kernel/rcutiny_plugin.h
> +++ b/kernel/rcutiny_plugin.h

My original patch was just for tree RCU. It is not merged.
I forgot the tiny RCU case then, this change is needed for tree & tiny RCU.

Thanks,
Lai

2011-02-25 08:29:34

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> From: Paul E. McKenney <[email protected]>
>
> If RCU priority boosting is to be meaningful, callback invocation must
> be boosted in addition to preempted RCU readers. Otherwise, in presence
> of CPU real-time threads, the grace period ends, but the callbacks don't
> get invoked. If the callbacks don't get invoked, the associated memory
> doesn't get freed, so the system is still subject to OOM.
>
> But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
> moves the callback invocations to a kthread, which can be boosted easily.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/interrupt.h | 1 -
> include/trace/events/irq.h | 3 +-
> kernel/rcutree.c | 324 ++++++++++++++++++++++++++++++++++-
> kernel/rcutree.h | 8 +
> kernel/rcutree_plugin.h | 4 +-
> tools/perf/util/trace-event-parse.c | 1 -
> 6 files changed, 331 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 79d0c4f..ed47deb 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -385,7 +385,6 @@ enum
> TASKLET_SOFTIRQ,
> SCHED_SOFTIRQ,
> HRTIMER_SOFTIRQ,
> - RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
>
> NR_SOFTIRQS
> };
> diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
> index 1c09820..ae045ca 100644
> --- a/include/trace/events/irq.h
> +++ b/include/trace/events/irq.h
> @@ -20,8 +20,7 @@ struct softirq_action;
> softirq_name(BLOCK_IOPOLL), \
> softirq_name(TASKLET), \
> softirq_name(SCHED), \
> - softirq_name(HRTIMER), \
> - softirq_name(RCU))
> + softirq_name(HRTIMER))
>
> /**
> * irq_handler_entry - called immediately before the irq action handler
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 0ac1cc0..2241f28 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -47,6 +47,8 @@
> #include <linux/mutex.h>
> #include <linux/time.h>
> #include <linux/kernel_stat.h>
> +#include <linux/wait.h>
> +#include <linux/kthread.h>
>
> #include "rcutree.h"
>
> @@ -82,6 +84,18 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
> int rcu_scheduler_active __read_mostly;
> EXPORT_SYMBOL_GPL(rcu_scheduler_active);
>
> +/* Control variables for per-CPU and per-rcu_node kthreads. */

I think "per-leaf-rcu_node" is better. It seems that only the leaf rcu_node
of rcu_sched are used for rcu_node kthreads and they also serve for
other rcu domains(rcu_bh, rcu_preempt)? I think we need to add some
comments for it.

> +/*
> + * Timer handler to initiate the waking up of per-CPU kthreads that
> + * have yielded the CPU due to excess numbers of RCU callbacks.
> + */
> +static void rcu_cpu_kthread_timer(unsigned long arg)
> +{
> + unsigned long flags;
> + struct rcu_data *rdp = (struct rcu_data *)arg;
> + struct rcu_node *rnp = rdp->mynode;
> + struct task_struct *t;
> +
> + raw_spin_lock_irqsave(&rnp->lock, flags);
> + rnp->wakemask |= rdp->grpmask;

I think there is no reason that the rnp->lock also protects the
rnp->node_kthread_task. "raw_spin_unlock_irqrestore(&rnp->lock, flags);"
can be moved up here.

> + t = rnp->node_kthread_task;
> + if (t == NULL) {
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + return;
> + }
> + wake_up_process(t);
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +}
> +
> +/*
> + * Drop to non-real-time priority and yield, but only after posting a
> + * timer that will cause us to regain our real-time priority if we
> + * remain preempted. Either way, we restore our real-time priority
> + * before returning.
> + */
> +static void rcu_yield(int cpu)
> +{
> + struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
> + struct sched_param sp;
> + struct timer_list yield_timer;
> +
> + setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
> + mod_timer(&yield_timer, jiffies + 2);
> + sp.sched_priority = 0;
> + sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
> + schedule();
> + sp.sched_priority = RCU_KTHREAD_PRIO;
> + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> + del_timer(&yield_timer);
> +}
> +
> +/*
> + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> + * This can happen while the corresponding CPU is either coming online
> + * or going offline. We cannot wait until the CPU is fully online
> + * before starting the kthread, because the various notifier functions
> + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> + * the corresponding CPU is online.
> + *
> + * Return 1 if the kthread needs to stop, 0 otherwise.
> + *
> + * Caller must disable bh. This function can momentarily enable it.
> + */
> +static int rcu_cpu_kthread_should_stop(int cpu)
> +{
> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> + if (kthread_should_stop())
> + return 1;
> + local_bh_enable();
> + schedule_timeout_uninterruptible(1);
> + if (smp_processor_id() != cpu)
> + set_cpus_allowed_ptr(current, cpumask_of(cpu));

The current task is PF_THREAD_BOUND,
Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?

> + local_bh_disable();
> + }
> + return 0;
> +}
> +
> +/*
> + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> + * earlier RCU softirq.
> + */
> +static int rcu_cpu_kthread(void *arg)
> +{
> + int cpu = (int)(long)arg;
> + unsigned long flags;
> + int spincnt = 0;
> + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> + char work;
> + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> +
> + for (;;) {
> + wait_event_interruptible(*wqp,
> + *workp != 0 || kthread_should_stop());
> + local_bh_disable();
> + if (rcu_cpu_kthread_should_stop(cpu)) {
> + local_bh_enable();
> + break;
> + }
> + local_irq_save(flags);
> + work = *workp;
> + *workp = 0;
> + local_irq_restore(flags);
> + if (work)
> + rcu_process_callbacks();
> + local_bh_enable();
> + if (*workp != 0)
> + spincnt++;
> + else
> + spincnt = 0;
> + if (spincnt > 10) {

"10" is a magic number here.

> + rcu_yield(cpu);
> + spincnt = 0;
> + }
> + }
> + return 0;
> +}
> +


> +/*
> + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> + * kthreads when needed.
> + */
> +static int rcu_node_kthread(void *arg)
> +{
> + int cpu;
> + unsigned long flags;
> + unsigned long mask;
> + struct rcu_node *rnp = (struct rcu_node *)arg;
> + struct sched_param sp;
> + struct task_struct *t;
> +
> + for (;;) {
> + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> + kthread_should_stop());
> + if (kthread_should_stop())
> + break;
> + raw_spin_lock_irqsave(&rnp->lock, flags);
> + mask = rnp->wakemask;
> + rnp->wakemask = 0;
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> + if ((mask & 0x1) == 0)
> + continue;
> + preempt_disable();
> + per_cpu(rcu_cpu_has_work, cpu) = 1;
> + t = per_cpu(rcu_cpu_kthread_task, cpu);
> + if (t == NULL) {
> + preempt_enable();
> + continue;
> + }

Obviously preempt_disable() is not for protecting remote percpu data.
Is it for disabling cpu hotplug? I am afraid the @t may leave
and become invalid.

2011-02-25 19:40:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/11] rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU

On Fri, Feb 25, 2011 at 04:29:46PM +0800, Lai Jiangshan wrote:
> On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> >
> > diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
> > index 015abae..3cb8e36 100644
> > --- a/kernel/rcutiny_plugin.h
> > +++ b/kernel/rcutiny_plugin.h
>
> My original patch was just for tree RCU. It is not merged.
> I forgot the tiny RCU case then, this change is needed for tree & tiny RCU.

Good catch, I have applied the TREE_RCU patch. I intend to submit
both for the next merge window.

Thank you for checking!

Thanx, Paul

2011-02-25 20:32:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Fri, Feb 25, 2011 at 04:17:58PM +0800, Lai Jiangshan wrote:
> On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> > From: Paul E. McKenney <[email protected]>
> >
> > If RCU priority boosting is to be meaningful, callback invocation must
> > be boosted in addition to preempted RCU readers. Otherwise, in presence
> > of CPU real-time threads, the grace period ends, but the callbacks don't
> > get invoked. If the callbacks don't get invoked, the associated memory
> > doesn't get freed, so the system is still subject to OOM.
> >
> > But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
> > moves the callback invocations to a kthread, which can be boosted easily.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > include/linux/interrupt.h | 1 -
> > include/trace/events/irq.h | 3 +-
> > kernel/rcutree.c | 324 ++++++++++++++++++++++++++++++++++-
> > kernel/rcutree.h | 8 +
> > kernel/rcutree_plugin.h | 4 +-
> > tools/perf/util/trace-event-parse.c | 1 -
> > 6 files changed, 331 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > index 79d0c4f..ed47deb 100644
> > --- a/include/linux/interrupt.h
> > +++ b/include/linux/interrupt.h
> > @@ -385,7 +385,6 @@ enum
> > TASKLET_SOFTIRQ,
> > SCHED_SOFTIRQ,
> > HRTIMER_SOFTIRQ,
> > - RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
> >
> > NR_SOFTIRQS
> > };
> > diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
> > index 1c09820..ae045ca 100644
> > --- a/include/trace/events/irq.h
> > +++ b/include/trace/events/irq.h
> > @@ -20,8 +20,7 @@ struct softirq_action;
> > softirq_name(BLOCK_IOPOLL), \
> > softirq_name(TASKLET), \
> > softirq_name(SCHED), \
> > - softirq_name(HRTIMER), \
> > - softirq_name(RCU))
> > + softirq_name(HRTIMER))
> >
> > /**
> > * irq_handler_entry - called immediately before the irq action handler
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 0ac1cc0..2241f28 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -47,6 +47,8 @@
> > #include <linux/mutex.h>
> > #include <linux/time.h>
> > #include <linux/kernel_stat.h>
> > +#include <linux/wait.h>
> > +#include <linux/kthread.h>
> >
> > #include "rcutree.h"
> >
> > @@ -82,6 +84,18 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
> > int rcu_scheduler_active __read_mostly;
> > EXPORT_SYMBOL_GPL(rcu_scheduler_active);
> >
> > +/* Control variables for per-CPU and per-rcu_node kthreads. */
>
> I think "per-leaf-rcu_node" is better. It seems that only the leaf rcu_node
> of rcu_sched are used for rcu_node kthreads and they also serve for
> other rcu domains(rcu_bh, rcu_preempt)? I think we need to add some
> comments for it.

There is a per-root_rcu_node kthread that is added with priority boosting.

Good point on the scope of the kthreads. I have changed the above
comment to read:

/*
* Control variables for per-CPU and per-rcu_node kthreads. These
* handle all flavors of RCU.
*/

Seem reasonable?

> > +/*
> > + * Timer handler to initiate the waking up of per-CPU kthreads that
> > + * have yielded the CPU due to excess numbers of RCU callbacks.
> > + */
> > +static void rcu_cpu_kthread_timer(unsigned long arg)
> > +{
> > + unsigned long flags;
> > + struct rcu_data *rdp = (struct rcu_data *)arg;
> > + struct rcu_node *rnp = rdp->mynode;
> > + struct task_struct *t;
> > +
> > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > + rnp->wakemask |= rdp->grpmask;
>
> I think there is no reason that the rnp->lock also protects the
> rnp->node_kthread_task. "raw_spin_unlock_irqrestore(&rnp->lock, flags);"
> can be moved up here.

If I am not too confused, the lock needs to cover the statements below
in order to correctly handle races with concurrent CPU-hotplug operations.

> > + t = rnp->node_kthread_task;
> > + if (t == NULL) {
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > + return;
> > + }
> > + wake_up_process(t);
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > +}
> > +
> > +/*
> > + * Drop to non-real-time priority and yield, but only after posting a
> > + * timer that will cause us to regain our real-time priority if we
> > + * remain preempted. Either way, we restore our real-time priority
> > + * before returning.
> > + */
> > +static void rcu_yield(int cpu)
> > +{
> > + struct rcu_data *rdp = per_cpu_ptr(rcu_sched_state.rda, cpu);
> > + struct sched_param sp;
> > + struct timer_list yield_timer;
> > +
> > + setup_timer(&yield_timer, rcu_cpu_kthread_timer, (unsigned long)rdp);
> > + mod_timer(&yield_timer, jiffies + 2);
> > + sp.sched_priority = 0;
> > + sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp);
> > + schedule();
> > + sp.sched_priority = RCU_KTHREAD_PRIO;
> > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> > + del_timer(&yield_timer);
> > +}
> > +
> > +/*
> > + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> > + * This can happen while the corresponding CPU is either coming online
> > + * or going offline. We cannot wait until the CPU is fully online
> > + * before starting the kthread, because the various notifier functions
> > + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> > + * the corresponding CPU is online.
> > + *
> > + * Return 1 if the kthread needs to stop, 0 otherwise.
> > + *
> > + * Caller must disable bh. This function can momentarily enable it.
> > + */
> > +static int rcu_cpu_kthread_should_stop(int cpu)
> > +{
> > + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> > + if (kthread_should_stop())
> > + return 1;
> > + local_bh_enable();
> > + schedule_timeout_uninterruptible(1);
> > + if (smp_processor_id() != cpu)
> > + set_cpus_allowed_ptr(current, cpumask_of(cpu));
>
> The current task is PF_THREAD_BOUND,
> Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?

Because I have seen CPU hotplug operations unbind PF_THREAD_BOUND threads.
In addition, I end up having to spawn the kthread at CPU_UP_PREPARE time,
at which point the thread must run unbound because its CPU isn't online
yet. I cannot invoke kthread_create() within the stop-machine handler
(right?). I cannot wait until CPU_ONLINE time because that results in
hangs when other CPU notifiers wait for grace periods.

Yes, I did find out about the hangs the hard way. Why do you ask? ;-)

Please feel free to suggest improvements in the header comment above
for rcu_cpu_kthread_should_stop(), which is my apparently insufficient
attempt to explain this.

> > + local_bh_disable();
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> > + * earlier RCU softirq.
> > + */
> > +static int rcu_cpu_kthread(void *arg)
> > +{
> > + int cpu = (int)(long)arg;
> > + unsigned long flags;
> > + int spincnt = 0;
> > + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> > + char work;
> > + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> > +
> > + for (;;) {
> > + wait_event_interruptible(*wqp,
> > + *workp != 0 || kthread_should_stop());
> > + local_bh_disable();
> > + if (rcu_cpu_kthread_should_stop(cpu)) {
> > + local_bh_enable();
> > + break;
> > + }
> > + local_irq_save(flags);
> > + work = *workp;
> > + *workp = 0;
> > + local_irq_restore(flags);
> > + if (work)
> > + rcu_process_callbacks();
> > + local_bh_enable();
> > + if (*workp != 0)
> > + spincnt++;
> > + else
> > + spincnt = 0;
> > + if (spincnt > 10) {
>
> "10" is a magic number here.

It is indeed. Suggestions for a cpp macro name to hide it behind?

> > + rcu_yield(cpu);
> > + spincnt = 0;
> > + }
> > + }
> > + return 0;
> > +}
> > +
>
>
> > +/*
> > + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> > + * kthreads when needed.
> > + */
> > +static int rcu_node_kthread(void *arg)
> > +{
> > + int cpu;
> > + unsigned long flags;
> > + unsigned long mask;
> > + struct rcu_node *rnp = (struct rcu_node *)arg;
> > + struct sched_param sp;
> > + struct task_struct *t;
> > +
> > + for (;;) {
> > + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> > + kthread_should_stop());
> > + if (kthread_should_stop())
> > + break;
> > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > + mask = rnp->wakemask;
> > + rnp->wakemask = 0;
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> > + if ((mask & 0x1) == 0)
> > + continue;
> > + preempt_disable();
> > + per_cpu(rcu_cpu_has_work, cpu) = 1;
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t == NULL) {
> > + preempt_enable();
> > + continue;
> > + }
>
> Obviously preempt_disable() is not for protecting remote percpu data.
> Is it for disabling cpu hotplug? I am afraid the @t may leave
> and become invalid.

Indeed, acquiring the rnp->lock is safer, except that I don't trust
calling sched_setscheduler_nocheck() in that state. So I need to check
for the CPU being online after the preempt_disable(). This means that
I ignore requests to do work after CPU_DYING time, but that is OK because
force_quiescent_state() will figure out that the CPU is in fact offline.

Make sense?

In any case, good catch!!!

Thanx, Paul

2011-02-28 03:28:41

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On 02/26/2011 04:32 AM, Paul E. McKenney wrote:
>>> +/*
>>> + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
>>> + * This can happen while the corresponding CPU is either coming online
>>> + * or going offline. We cannot wait until the CPU is fully online
>>> + * before starting the kthread, because the various notifier functions
>>> + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
>>> + * the corresponding CPU is online.
>>> + *
>>> + * Return 1 if the kthread needs to stop, 0 otherwise.
>>> + *
>>> + * Caller must disable bh. This function can momentarily enable it.
>>> + */
>>> +static int rcu_cpu_kthread_should_stop(int cpu)
>>> +{
>>> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
>>> + if (kthread_should_stop())
>>> + return 1;
>>> + local_bh_enable();
>>> + schedule_timeout_uninterruptible(1);
>>> + if (smp_processor_id() != cpu)
>>> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
>>
>> The current task is PF_THREAD_BOUND,
>> Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?
>
> Because I have seen CPU hotplug operations unbind PF_THREAD_BOUND threads.
> In addition, I end up having to spawn the kthread at CPU_UP_PREPARE time,
> at which point the thread must run unbound because its CPU isn't online
> yet. I cannot invoke kthread_create() within the stop-machine handler
> (right?). I cannot wait until CPU_ONLINE time because that results in
> hangs when other CPU notifiers wait for grace periods.
>
> Yes, I did find out about the hangs the hard way. Why do you ask? ;-)

The current task is PF_THREAD_BOUND, "set_cpus_allowed_ptr(current, cpumask_of(cpu))"
will do nothing even it runs on the wrong CPU.

If the task runs on the wrong CPU. We have no API to force/migrate the task
to the bound CPU when the cpu becomes online. But wake_up_process() has
a side affect that it will move a slept task to the correct online CPU.
"schedule_timeout_uninterruptible(1);" will call
wake_up_process() when timeout, so it will do all thing you need.

But "set_cpus_allowed_ptr(current, cpumask_of(cpu));" will do nothing.

The code is a little nasty I think. The proper solution I like:
set the rcu_cpu_notify a proper priority, and wake up the kthread
in the notifier.

Steven, any suggestion? I just known very little about scheduler.

>
> Please feel free to suggest improvements in the header comment above
> for rcu_cpu_kthread_should_stop(), which is my apparently insufficient
> attempt to explain this.
>
>>> + local_bh_disable();
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> +/*
>>> + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
>>> + * earlier RCU softirq.
>>> + */
>>> +static int rcu_cpu_kthread(void *arg)
>>> +{
>>> + int cpu = (int)(long)arg;
>>> + unsigned long flags;
>>> + int spincnt = 0;
>>> + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
>>> + char work;
>>> + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
>>> +
>>> + for (;;) {
>>> + wait_event_interruptible(*wqp,
>>> + *workp != 0 || kthread_should_stop());
>>> + local_bh_disable();
>>> + if (rcu_cpu_kthread_should_stop(cpu)) {
>>> + local_bh_enable();
>>> + break;
>>> + }
>>> + local_irq_save(flags);
>>> + work = *workp;
>>> + *workp = 0;
>>> + local_irq_restore(flags);
>>> + if (work)
>>> + rcu_process_callbacks();
>>> + local_bh_enable();
>>> + if (*workp != 0)
>>> + spincnt++;
>>> + else
>>> + spincnt = 0;
>>> + if (spincnt > 10) {
>>
>> "10" is a magic number here.
>
> It is indeed. Suggestions for a cpp macro name to hide it behind?
>
>>> + rcu_yield(cpu);
>>> + spincnt = 0;
>>> + }
>>> + }
>>> + return 0;
>>> +}
>>> +
>>
>>
>>> +/*
>>> + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
>>> + * kthreads when needed.
>>> + */
>>> +static int rcu_node_kthread(void *arg)
>>> +{
>>> + int cpu;
>>> + unsigned long flags;
>>> + unsigned long mask;
>>> + struct rcu_node *rnp = (struct rcu_node *)arg;
>>> + struct sched_param sp;
>>> + struct task_struct *t;
>>> +
>>> + for (;;) {
>>> + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
>>> + kthread_should_stop());
>>> + if (kthread_should_stop())
>>> + break;
>>> + raw_spin_lock_irqsave(&rnp->lock, flags);
>>> + mask = rnp->wakemask;
>>> + rnp->wakemask = 0;
>>> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
>>> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
>>> + if ((mask & 0x1) == 0)
>>> + continue;
>>> + preempt_disable();
>>> + per_cpu(rcu_cpu_has_work, cpu) = 1;
>>> + t = per_cpu(rcu_cpu_kthread_task, cpu);
>>> + if (t == NULL) {
>>> + preempt_enable();
>>> + continue;
>>> + }
>>
>> Obviously preempt_disable() is not for protecting remote percpu data.
>> Is it for disabling cpu hotplug? I am afraid the @t may leave
>> and become invalid.
>
> Indeed, acquiring the rnp->lock is safer, except that I don't trust
> calling sched_setscheduler_nocheck() in that state. So I need to check
> for the CPU being online after the preempt_disable(). This means that
> I ignore requests to do work after CPU_DYING time, but that is OK because
> force_quiescent_state() will figure out that the CPU is in fact offline.
>
> Make sense?
>

Yes.

Another:

#if CONFIG_HOTPLUG_CPU
get_task_struct() when set bit in wakemask
put_task_struct() when clear bit in wakemask
#endif



> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-02-28 09:48:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Mon, 2011-02-28 at 11:29 +0800, Lai Jiangshan wrote:
> >>> +static int rcu_cpu_kthread_should_stop(int cpu)
> >>> +{
> >>> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> >>> + if (kthread_should_stop())
> >>> + return 1;
> >>> + local_bh_enable();
> >>> + schedule_timeout_uninterruptible(1);
> >>> + if (smp_processor_id() != cpu)
> >>> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> >>
> >> The current task is PF_THREAD_BOUND,
> >> Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?
> >
> > Because I have seen CPU hotplug operations unbind PF_THREAD_BOUND threads.

Correct, but that's on unplug, the rest of the story seems about plug,
so just detatch the thread on down/offline and let it die when its done.

> > In addition, I end up having to spawn the kthread at CPU_UP_PREPARE time,

Sure, that's a common time to create such treads :-), you can
kthread_ceate()+kthread_bind() in UP_PREPARE, just don't wake them yet.

> > at which point the thread must run unbound because its CPU isn't online
> > yet. I cannot invoke kthread_create() within the stop-machine handler
> > (right?).

No you can not ;-)

> I cannot wait until CPU_ONLINE time because that results in
> > hangs when other CPU notifiers wait for grace periods.
> >
> > Yes, I did find out about the hangs the hard way. Why do you ask? ;-)

Right, so I assume that whoever needs the thread will:

1) wake the thread,
2) only do so after the cpu is actually online, how else could it be
executing code? :-)

2011-02-28 23:51:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Mon, Feb 28, 2011 at 11:29:44AM +0800, Lai Jiangshan wrote:
> On 02/26/2011 04:32 AM, Paul E. McKenney wrote:
> >>> +/*
> >>> + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
> >>> + * This can happen while the corresponding CPU is either coming online
> >>> + * or going offline. We cannot wait until the CPU is fully online
> >>> + * before starting the kthread, because the various notifier functions
> >>> + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
> >>> + * the corresponding CPU is online.
> >>> + *
> >>> + * Return 1 if the kthread needs to stop, 0 otherwise.
> >>> + *
> >>> + * Caller must disable bh. This function can momentarily enable it.
> >>> + */
> >>> +static int rcu_cpu_kthread_should_stop(int cpu)
> >>> +{
> >>> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> >>> + if (kthread_should_stop())
> >>> + return 1;
> >>> + local_bh_enable();
> >>> + schedule_timeout_uninterruptible(1);
> >>> + if (smp_processor_id() != cpu)
> >>> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> >>
> >> The current task is PF_THREAD_BOUND,
> >> Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?
> >
> > Because I have seen CPU hotplug operations unbind PF_THREAD_BOUND threads.
> > In addition, I end up having to spawn the kthread at CPU_UP_PREPARE time,
> > at which point the thread must run unbound because its CPU isn't online
> > yet. I cannot invoke kthread_create() within the stop-machine handler
> > (right?). I cannot wait until CPU_ONLINE time because that results in
> > hangs when other CPU notifiers wait for grace periods.
> >
> > Yes, I did find out about the hangs the hard way. Why do you ask? ;-)
>
> The current task is PF_THREAD_BOUND, "set_cpus_allowed_ptr(current, cpumask_of(cpu))"
> will do nothing even it runs on the wrong CPU.

You lost me on this one.

Looking at set_cpus_allowed_ptr()...

The "again" loop won't happen because the task is already running.
The CPU is online, so the cpumask_intersects() check won't kick
us out. We are working with the current task, so the check for
PF_THREAD_BOUND, current, and cpumask_equal() won't kick us out.
If the old and new cpumasks had been the same, we would not have called
set_cpus_allowed_ptr() in the first place. So we should get to
the call to migrate_task().

What am I missing here?

> If the task runs on the wrong CPU. We have no API to force/migrate the task
> to the bound CPU when the cpu becomes online. But wake_up_process() has
> a side affect that it will move a slept task to the correct online CPU.
> "schedule_timeout_uninterruptible(1);" will call
> wake_up_process() when timeout, so it will do all thing you need.
>
> But "set_cpus_allowed_ptr(current, cpumask_of(cpu));" will do nothing.
>
> The code is a little nasty I think. The proper solution I like:
> set the rcu_cpu_notify a proper priority, and wake up the kthread
> in the notifier.

I will be using both belt and suspenders on this one -- too much can
go wrong given slight adjustments in scheduler, CPU hotplug, and so on.

But speaking of paranoia, I should add a check of smp_processor_id()
vs. the local variable "cpu", shouldn't I?

> Steven, any suggestion? I just known very little about scheduler.
>
> >
> > Please feel free to suggest improvements in the header comment above
> > for rcu_cpu_kthread_should_stop(), which is my apparently insufficient
> > attempt to explain this.
> >
> >>> + local_bh_disable();
> >>> + }
> >>> + return 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Per-CPU kernel thread that invokes RCU callbacks. This replaces the
> >>> + * earlier RCU softirq.
> >>> + */
> >>> +static int rcu_cpu_kthread(void *arg)
> >>> +{
> >>> + int cpu = (int)(long)arg;
> >>> + unsigned long flags;
> >>> + int spincnt = 0;
> >>> + wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> >>> + char work;
> >>> + char *workp = &per_cpu(rcu_cpu_has_work, cpu);
> >>> +
> >>> + for (;;) {
> >>> + wait_event_interruptible(*wqp,
> >>> + *workp != 0 || kthread_should_stop());
> >>> + local_bh_disable();
> >>> + if (rcu_cpu_kthread_should_stop(cpu)) {
> >>> + local_bh_enable();
> >>> + break;
> >>> + }
> >>> + local_irq_save(flags);
> >>> + work = *workp;
> >>> + *workp = 0;
> >>> + local_irq_restore(flags);
> >>> + if (work)
> >>> + rcu_process_callbacks();
> >>> + local_bh_enable();
> >>> + if (*workp != 0)
> >>> + spincnt++;
> >>> + else
> >>> + spincnt = 0;
> >>> + if (spincnt > 10) {
> >>
> >> "10" is a magic number here.
> >
> > It is indeed. Suggestions for a cpp macro name to hide it behind?
> >
> >>> + rcu_yield(cpu);
> >>> + spincnt = 0;
> >>> + }
> >>> + }
> >>> + return 0;
> >>> +}
> >>> +
> >>
> >>
> >>> +/*
> >>> + * Per-rcu_node kthread, which is in charge of waking up the per-CPU
> >>> + * kthreads when needed.
> >>> + */
> >>> +static int rcu_node_kthread(void *arg)
> >>> +{
> >>> + int cpu;
> >>> + unsigned long flags;
> >>> + unsigned long mask;
> >>> + struct rcu_node *rnp = (struct rcu_node *)arg;
> >>> + struct sched_param sp;
> >>> + struct task_struct *t;
> >>> +
> >>> + for (;;) {
> >>> + wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0 ||
> >>> + kthread_should_stop());
> >>> + if (kthread_should_stop())
> >>> + break;
> >>> + raw_spin_lock_irqsave(&rnp->lock, flags);
> >>> + mask = rnp->wakemask;
> >>> + rnp->wakemask = 0;
> >>> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >>> + for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
> >>> + if ((mask & 0x1) == 0)
> >>> + continue;
> >>> + preempt_disable();
> >>> + per_cpu(rcu_cpu_has_work, cpu) = 1;
> >>> + t = per_cpu(rcu_cpu_kthread_task, cpu);
> >>> + if (t == NULL) {
> >>> + preempt_enable();
> >>> + continue;
> >>> + }
> >>
> >> Obviously preempt_disable() is not for protecting remote percpu data.
> >> Is it for disabling cpu hotplug? I am afraid the @t may leave
> >> and become invalid.
> >
> > Indeed, acquiring the rnp->lock is safer, except that I don't trust
> > calling sched_setscheduler_nocheck() in that state. So I need to check
> > for the CPU being online after the preempt_disable(). This means that
> > I ignore requests to do work after CPU_DYING time, but that is OK because
> > force_quiescent_state() will figure out that the CPU is in fact offline.
> >
> > Make sense?
>
> Yes.

Good, I will take that approach.

> Another:
>
> #if CONFIG_HOTPLUG_CPU
> get_task_struct() when set bit in wakemask
> put_task_struct() when clear bit in wakemask
> #endif

Good point, but I will pass on the added #ifdef. ;-)

Thanx, Paul

2011-03-01 00:13:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Mon, Feb 28, 2011 at 10:47:17AM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 11:29 +0800, Lai Jiangshan wrote:
> > >>> +static int rcu_cpu_kthread_should_stop(int cpu)
> > >>> +{
> > >>> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
> > >>> + if (kthread_should_stop())
> > >>> + return 1;
> > >>> + local_bh_enable();
> > >>> + schedule_timeout_uninterruptible(1);
> > >>> + if (smp_processor_id() != cpu)
> > >>> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> > >>
> > >> The current task is PF_THREAD_BOUND,
> > >> Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?
> > >
> > > Because I have seen CPU hotplug operations unbind PF_THREAD_BOUND threads.
>
> Correct, but that's on unplug, the rest of the story seems about plug,
> so just detatch the thread on down/offline and let it die when its done.
>
> > > In addition, I end up having to spawn the kthread at CPU_UP_PREPARE time,
>
> Sure, that's a common time to create such treads :-), you can
> kthread_ceate()+kthread_bind() in UP_PREPARE, just don't wake them yet.

I am OK doing the sched_setscheduler_nocheck() in UP_PREPARE, correct?

But yes, I can have the CPU_STARTING notifier wake up any kthreads that
the current CPU might have caused to be created.

> > > at which point the thread must run unbound because its CPU isn't online
> > > yet. I cannot invoke kthread_create() within the stop-machine handler
> > > (right?).
>
> No you can not ;-)

Glad I am maintaining at least a shred of sanity. ;-)

> > I cannot wait until CPU_ONLINE time because that results in
> > > hangs when other CPU notifiers wait for grace periods.
> > >
> > > Yes, I did find out about the hangs the hard way. Why do you ask? ;-)
>
> Right, so I assume that whoever needs the thread will:
>
> 1) wake the thread,
> 2) only do so after the cpu is actually online, how else could it be
> executing code? :-)

Ah, there is the rub -- I am using wait_event(), so I need to wake up the
kthread once before anyone uses it (or at least concurrently with anyone
using it). Which I can presumably do from the CPU_STARTING notifier.

Make sense, or am I still missing something?

Thanx, Paul

2011-03-01 14:36:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Mon, 2011-02-28 at 16:13 -0800, Paul E. McKenney wrote:

> I am OK doing the sched_setscheduler_nocheck() in UP_PREPARE, correct?

Yeah, it should be perfectly fine to call that.

> Ah, there is the rub -- I am using wait_event(), so I need to wake up the
> kthread once before anyone uses it (or at least concurrently with anyone
> using it). Which I can presumably do from the CPU_STARTING notifier.

Right, so your kthread is doing:

static int rcu_cpu_kthread()
{
for (;;) {
wait_event_interruptible();

/* do stuff */

}
return 0;
}

Which means that all folks wanting to make use of this already need to
do a wakeup. So I don't see any reason to do that first wakeup from
CPU_STARTING.

wait_event() will only actually wait if the condition is false, in the
start-up case above it will find the condition true and fall right
through to do stuff.


2011-03-02 00:07:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Tue, Mar 01, 2011 at 03:38:11PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 16:13 -0800, Paul E. McKenney wrote:
>
> > I am OK doing the sched_setscheduler_nocheck() in UP_PREPARE, correct?
>
> Yeah, it should be perfectly fine to call that.

Cool!

> > Ah, there is the rub -- I am using wait_event(), so I need to wake up the
> > kthread once before anyone uses it (or at least concurrently with anyone
> > using it). Which I can presumably do from the CPU_STARTING notifier.
>
> Right, so your kthread is doing:
>
> static int rcu_cpu_kthread()
> {
> for (;;) {
> wait_event_interruptible();
>
> /* do stuff */
>
> }
> return 0;
> }
>
> Which means that all folks wanting to make use of this already need to
> do a wakeup. So I don't see any reason to do that first wakeup from
> CPU_STARTING.

That is good to hear, because doing so seems to result in abject failure.

> wait_event() will only actually wait if the condition is false, in the
> start-up case above it will find the condition true and fall right
> through to do stuff.

So as long as it is OK to call sched_setscheduler_nocheck() before
the kthread is first awakened, we should be OK.

Thanx, Paul

2011-03-02 01:51:43

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On 03/01/2011 07:51 AM, Paul E. McKenney wrote:
> On Mon, Feb 28, 2011 at 11:29:44AM +0800, Lai Jiangshan wrote:
>> On 02/26/2011 04:32 AM, Paul E. McKenney wrote:
>>>>> +/*
>>>>> + * Handle cases where the rcu_cpu_kthread() ends up on the wrong CPU.
>>>>> + * This can happen while the corresponding CPU is either coming online
>>>>> + * or going offline. We cannot wait until the CPU is fully online
>>>>> + * before starting the kthread, because the various notifier functions
>>>>> + * can wait for RCU grace periods. So we park rcu_cpu_kthread() until
>>>>> + * the corresponding CPU is online.
>>>>> + *
>>>>> + * Return 1 if the kthread needs to stop, 0 otherwise.
>>>>> + *
>>>>> + * Caller must disable bh. This function can momentarily enable it.
>>>>> + */
>>>>> +static int rcu_cpu_kthread_should_stop(int cpu)
>>>>> +{
>>>>> + while (cpu_is_offline(cpu) || smp_processor_id() != cpu) {
>>>>> + if (kthread_should_stop())
>>>>> + return 1;
>>>>> + local_bh_enable();
>>>>> + schedule_timeout_uninterruptible(1);
>>>>> + if (smp_processor_id() != cpu)
>>>>> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
>>>>
>>>> The current task is PF_THREAD_BOUND,
>>>> Why do "set_cpus_allowed_ptr(current, cpumask_of(cpu));" ?
>>>
>>> Because I have seen CPU hotplug operations unbind PF_THREAD_BOUND threads.
>>> In addition, I end up having to spawn the kthread at CPU_UP_PREPARE time,
>>> at which point the thread must run unbound because its CPU isn't online
>>> yet. I cannot invoke kthread_create() within the stop-machine handler
>>> (right?). I cannot wait until CPU_ONLINE time because that results in
>>> hangs when other CPU notifiers wait for grace periods.
>>>
>>> Yes, I did find out about the hangs the hard way. Why do you ask? ;-)
>>
>> The current task is PF_THREAD_BOUND, "set_cpus_allowed_ptr(current, cpumask_of(cpu))"
>> will do nothing even it runs on the wrong CPU.
>
> You lost me on this one.
>
> Looking at set_cpus_allowed_ptr()...
>
> The "again" loop won't happen because the task is already running.
> The CPU is online, so the cpumask_intersects() check won't kick
> us out. We are working with the current task, so the check for
> PF_THREAD_BOUND, current, and cpumask_equal() won't kick us out.
> If the old and new cpumasks had been the same, we would not have called
> set_cpus_allowed_ptr() in the first place. So we should get to
> the call to migrate_task().
>
> What am I missing here?

You right. I forgot current tasks can change its cpumask even PF_THREAD_BOUND.

2011-03-02 22:41:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 11/11] rcu: move TREE_RCU from softirq to kthread

On Tue, Mar 01, 2011 at 04:07:03PM -0800, Paul E. McKenney wrote:
> On Tue, Mar 01, 2011 at 03:38:11PM +0100, Peter Zijlstra wrote:
> > On Mon, 2011-02-28 at 16:13 -0800, Paul E. McKenney wrote:

[ . . . ]

> > wait_event() will only actually wait if the condition is false, in the
> > start-up case above it will find the condition true and fall right
> > through to do stuff.
>
> So as long as it is OK to call sched_setscheduler_nocheck() before
> the kthread is first awakened, we should be OK.

So the softlockup code kills me when I do it this way. Or can kill me.
>From what I can se the problem is that it might be the case that no one
needs a particular RCU kthread for some time. In addition, if the kthread
has not yet done the wait_event(), wake_up() won't know to wake it up.
The flag gets set, and all zero of the tasks blocked on the event's
queue are awakened, but the not-yet-awakened task still sleeps because
it has not yet done the wait_event().

So I am falling back to waking each thread up just after creation so
that they can be safely parked regardless of who needs which kthread at
what time in the future.

On the other hand, making these changes and testing them helped me find
and fix a number of subtle and not-so-subtle bugs in the TREE_RCU priority
boosting code, so I most certainly cannot argue with the outcome! ;-)

Thanx, Paul

2011-03-24 03:43:56

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/11] rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU

On 02/26/2011 03:40 AM, Paul E. McKenney wrote:
> On Fri, Feb 25, 2011 at 04:29:46PM +0800, Lai Jiangshan wrote:
>> On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
>>>
>>> diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
>>> index 015abae..3cb8e36 100644
>>> --- a/kernel/rcutiny_plugin.h
>>> +++ b/kernel/rcutiny_plugin.h
>>
>> My original patch was just for tree RCU. It is not merged.
>> I forgot the tiny RCU case then, this change is needed for tree & tiny RCU.
>
> Good catch, I have applied the TREE_RCU patch. I intend to submit
> both for the next merge window.
>

Hi, Paul

It is still not merged, and it is not in your git tree.

I'm writing some patches, to avoid conflict with your code,
which branch I should based on?

Is origin/rcu/next OK?

Thanks,
Lai

2011-03-24 13:07:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/11] rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU

On Thu, Mar 24, 2011 at 11:45:48AM +0800, Lai Jiangshan wrote:
> On 02/26/2011 03:40 AM, Paul E. McKenney wrote:
> > On Fri, Feb 25, 2011 at 04:29:46PM +0800, Lai Jiangshan wrote:
> >> On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> >>>
> >>> diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
> >>> index 015abae..3cb8e36 100644
> >>> --- a/kernel/rcutiny_plugin.h
> >>> +++ b/kernel/rcutiny_plugin.h
> >>
> >> My original patch was just for tree RCU. It is not merged.
> >> I forgot the tiny RCU case then, this change is needed for tree & tiny RCU.
> >
> > Good catch, I have applied the TREE_RCU patch. I intend to submit
> > both for the next merge window.
> >
>
> Hi, Paul
>
> It is still not merged, and it is not in your git tree.

Indeed it is not yet in my external git tree. Here is my normal work
flow:

1. Develop and test continuously.

2. Based on testing, review, etc., determin which commits are
ready for the next merge window.

3. Midway through the -rc releases, push the commits targeted
to the next merge window to rcu/next. Push any additional
commits to rcu/testing.

4. Towards the last -rc, rebase, retest, wait for the -next
tree to take the changes, and, if all is well, send Ingo
a pull request.

Exposing your changes to -next during the merge window would cause
pointless irritation due to random conflicts with rapidly changing
code.

> I'm writing some patches, to avoid conflict with your code,
> which branch I should based on?

But, yes, it would be much easier for us to handle your changes if
I post them via -rcu. So I have created a new branch rcu/kfree_rcu
that is -not- consumed by -next. The rcu/next and rcu/testing
branches include your base kfree_rcu() patch, but not the uses of
it. These branches are based on 2.6.38, and I will move them ahead
at -rc3 or -rc4.

These currently contain your original patches, but I will be pulling
your latest set.

> Is origin/rcu/next OK?

origin/rcu/kfree_rcu for the moment to avoid messing up -next. But
it will be origin/rcu/next after rebasing onto -rc3 or -rc4.

Thanx, Paul

2011-03-25 02:30:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/11] rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU

On Thu, Mar 24, 2011 at 06:07:12AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 24, 2011 at 11:45:48AM +0800, Lai Jiangshan wrote:
> > On 02/26/2011 03:40 AM, Paul E. McKenney wrote:
> > > On Fri, Feb 25, 2011 at 04:29:46PM +0800, Lai Jiangshan wrote:
> > >> On 02/23/2011 09:39 AM, Paul E. McKenney wrote:
> > >>>
> > >>> diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
> > >>> index 015abae..3cb8e36 100644
> > >>> --- a/kernel/rcutiny_plugin.h
> > >>> +++ b/kernel/rcutiny_plugin.h
> > >>
> > >> My original patch was just for tree RCU. It is not merged.
> > >> I forgot the tiny RCU case then, this change is needed for tree & tiny RCU.
> > >
> > > Good catch, I have applied the TREE_RCU patch. I intend to submit
> > > both for the next merge window.
> > >
> >
> > Hi, Paul
> >
> > It is still not merged, and it is not in your git tree.
>
> Indeed it is not yet in my external git tree. Here is my normal work
> flow:
>
> 1. Develop and test continuously.
>
> 2. Based on testing, review, etc., determin which commits are
> ready for the next merge window.
>
> 3. Midway through the -rc releases, push the commits targeted
> to the next merge window to rcu/next. Push any additional
> commits to rcu/testing.
>
> 4. Towards the last -rc, rebase, retest, wait for the -next
> tree to take the changes, and, if all is well, send Ingo
> a pull request.
>
> Exposing your changes to -next during the merge window would cause
> pointless irritation due to random conflicts with rapidly changing
> code.
>
> > I'm writing some patches, to avoid conflict with your code,
> > which branch I should based on?
>
> But, yes, it would be much easier for us to handle your changes if
> I post them via -rcu. So I have created a new branch rcu/kfree_rcu
> that is -not- consumed by -next. The rcu/next and rcu/testing
> branches include your base kfree_rcu() patch, but not the uses of
> it. These branches are based on 2.6.38, and I will move them ahead
> at -rc3 or -rc4.
>
> These currently contain your original patches, but I will be pulling
> your latest set.

Which are now available at rcu/kfree_rcu.

Thanx, Paul

> > Is origin/rcu/next OK?
>
> origin/rcu/kfree_rcu for the moment to avoid messing up -next. But
> it will be origin/rcu/next after rebasing onto -rc3 or -rc4.
>
> Thanx, Paul