Hello!
This patch series is a second attempt to fix the idle-loop uses of RCU,
see https://lkml.org/lkml/2012/2/1/741 for v1. Where the first series
attempted to drive rcu_idle_enter() and rcu_idle_exit() further down
into the Linux kernels multitude of idle loops, this patch instead
marks specific idle-loop operations containing RCU read-side critical
sections, as suggested by Nicolas Pitre and Steven Rostedt. The possibility
of code shared between idle and non-idle also requires the ability to nest
rcu_idle_enter() calls. The individual patches are as follows:
1. Allow nesting of rcu_idle_enter() and rcu_idle_exit().
2. Add an RCU_NONIDLE() macro to enclose idle-loop code that
contains RCU read-side critical sections.
3. Use RCU_NONIDLE() to protect cpuidle_idle_call()'s tracepoints.
This patchset has the distinct advantage of avoiding touching any
architecture-specific code. ;-)
Thanx, Paul
------------------------------------------------------------------------
b/drivers/cpuidle/cpuidle.c | 12 ++++++++----
b/include/linux/rcupdate.h | 27 +++++++++++++++++++++++++++
b/kernel/rcu.h | 18 +++++++++++++++++-
b/kernel/rcutiny.c | 16 ++++++++++++----
b/kernel/rcutree.c | 19 +++++++++++++------
kernel/rcutiny.c | 2 ++
kernel/rcutree.c | 2 ++
7 files changed, 81 insertions(+), 15 deletions(-)
From: "Paul E. McKenney" <[email protected]>
RCU, RCU-bh, and RCU-sched read-side critical sections are forbidden
in the inner idle loop, that is, between the rcu_idle_enter() and the
rcu_idle_exit() -- RCU will happily ignore any such read-side critical
sections. However, things like powertop need tracepoints in the inner
idle loop.
This commit therefore provides an RCU_NONIDLE() macro that can be used to
wrap code in the idle loop that requires RCU read-side critical sections.
Suggested-by: Steven Rostedt <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 27 +++++++++++++++++++++++++++
kernel/rcutiny.c | 2 ++
kernel/rcutree.c | 2 ++
3 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 6ee663c..9372174 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -190,6 +190,33 @@ extern void rcu_idle_exit(void);
extern void rcu_irq_enter(void);
extern void rcu_irq_exit(void);
+/**
+ * RCU_NONIDLE - Indicate idle-loop code that needs RCU readers
+ * @a: Code that RCU needs to pay attention to.
+ *
+ * RCU, RCU-bh, and RCU-sched read-side critical sections are forbidden
+ * in the inner idle loop, that is, between the rcu_idle_enter() and
+ * the rcu_idle_exit() -- RCU will happily ignore any such read-side
+ * critical sections. However, things like powertop need tracepoints
+ * in the inner idle loop.
+ *
+ * This macro provides the way out: RCU_NONIDLE(do_something_with_RCU())
+ * will tell RCU that it needs to pay attending, invoke its argument
+ * (in this example, a call to the do_something_with_RCU() function),
+ * and then tell RCU to go back to ignoring this CPU. It is permissible
+ * to nest RCU_NONIDLE() wrappers, but the nesting level is currently
+ * quite limited. If deeper nesting is required, it will be necessary
+ * to adjust DYNTICK_TASK_NESTING_VALUE accordingly.
+ *
+ * This macro may be used from process-level code only.
+ */
+#define RCU_NONIDLE(a) \
+ do { \
+ rcu_idle_exit(); \
+ do { a; } while (0); \
+ rcu_idle_enter(); \
+ } while (0)
+
/*
* Infrastructure to implement the synchronize_() primitives in
* TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
index 5098aea..b4142a9 100644
--- a/kernel/rcutiny.c
+++ b/kernel/rcutiny.c
@@ -97,6 +97,7 @@ void rcu_idle_enter(void)
rcu_idle_enter_common(oldval);
local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(rcu_idle_enter);
/*
* Exit an interrupt handler towards idle.
@@ -153,6 +154,7 @@ void rcu_idle_exit(void)
rcu_idle_exit_common(oldval);
local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(rcu_idle_exit);
/*
* Enter an interrupt handler, moving away from idle.
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e81298a..25111d3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -402,6 +402,7 @@ void rcu_idle_enter(void)
rcu_idle_enter_common(rdtp, oldval);
local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(rcu_idle_enter);
/**
* rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
@@ -493,6 +494,7 @@ void rcu_idle_exit(void)
rcu_idle_exit_common(rdtp, oldval);
local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(rcu_idle_exit);
/**
* rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
--
1.7.8
From: "Paul E. McKenney" <[email protected]>
The cpuidle_idle_call() function is invoked in the inner idle loop,
after the call to rcu_idle_enter() and before the call to rcu_idle_exit().
This means that RCU is ignoring the CPU at this point. Unfortunately,
cpuidle_idle_call() nevertheless contains tracepoints (important ones
used by powertop) that expect RCU to be paying attention. The consequences
of using RCU read-side critical sections on CPUs that RCU is ignoring
can be severe, including random corruption of random memory.
Therefore, this commit uses the new RCU_NONIDLE() macro to let RCU
do its job with respect to the tracepoints.
Suggested-by: Nicolas Pitre <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Trinabh Gupta <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Deepthi Dharwar <[email protected]>
---
drivers/cpuidle/cpuidle.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 59f4261..cd8a553 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -94,13 +94,17 @@ int cpuidle_idle_call(void)
target_state = &drv->states[next_state];
- trace_power_start(POWER_CSTATE, next_state, dev->cpu);
- trace_cpu_idle(next_state, dev->cpu);
+ RCU_NONIDLE(
+ trace_power_start(POWER_CSTATE, next_state, dev->cpu);
+ trace_cpu_idle(next_state, dev->cpu)
+ );
entered_state = target_state->enter(dev, drv, next_state);
- trace_power_end(dev->cpu);
- trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
+ RCU_NONIDLE(
+ trace_power_end(dev->cpu);
+ trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
+ );
if (entered_state >= 0) {
/* Update cpuidle counters */
--
1.7.8
From: "Paul E. McKenney" <[email protected]>
Use of RCU in the idle loop is incorrect, quite a few instances of
just that have made their way into mainline, primarily event tracing.
The problem with RCU read-side critical sections on CPUs that RCU believes
to be idle is that RCU is completely ignoring the CPU, along with any
attempts and RCU read-side critical sections.
The approaches of eliminating the offending uses and of pushing the
definition of idle down beyond the offending uses have both proved
impractical. The new approach is to encapsulate offending uses of RCU
with rcu_idle_exit() and rcu_idle_enter(), but this requires nesting
for code that is invoked both during idle and and during normal execution.
Therefore, this commit modifies rcu_idle_enter() and rcu_idle_exit() to
permit nesting.
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Conflicts:
kernel/rcutree.c
---
kernel/rcu.h | 18 +++++++++++++++++-
kernel/rcutiny.c | 16 ++++++++++++----
kernel/rcutree.c | 19 +++++++++++++------
3 files changed, 42 insertions(+), 11 deletions(-)
diff --git a/kernel/rcu.h b/kernel/rcu.h
index 30876f4..0d0f9e8 100644
--- a/kernel/rcu.h
+++ b/kernel/rcu.h
@@ -33,8 +33,24 @@
* Process-level increment to ->dynticks_nesting field. This allows for
* architectures that use half-interrupts and half-exceptions from
* process context.
+ *
+ * DYNTICK_TASK_NESTING_MASK is a three-bit field that counts the number
+ * of process-based reasons why RCU cannot consider the corresponding CPU
+ * to be idle, and DYNTICK_TASK_NESTING_VALUE is the value used to increment
+ * or decrement this three-bit field. The rest of the bits could in
+ * principle be used to count interrupts, but this would mean that a
+ * negative-one value in the interrupt field could incorrectly zero out
+ * the DYNTICK_TASK_NESTING_MASK field. We therefore provide a two-bit
+ * guard field defined by DYNTICK_TASK_MASK that is set to DYNTICK_TASK_FLAG
+ * upon initial exit from idle. The DYNTICK_TASK_EXIT_IDLE value is
+ * thus the combined value used upon initial exit from idle.
*/
-#define DYNTICK_TASK_NESTING (LLONG_MAX / 2 - 1)
+#define DYNTICK_TASK_NESTING_VALUE (LLONG_MAX / 8 + 1)
+#define DYNTICK_TASK_NESTING_MASK (LLONG_MAX - DYNTICK_TASK_NESTING_VALUE + 1)
+#define DYNTICK_TASK_FLAG ((DYNTICK_TASK_NESTING_VALUE / 8) * 2)
+#define DYNTICK_TASK_MASK ((DYNTICK_TASK_NESTING_VALUE / 8) * 3)
+#define DYNTICK_TASK_EXIT_IDLE (DYNTICK_TASK_NESTING_VALUE + \
+ DYNTICK_TASK_FLAG)
/*
* debug_rcu_head_queue()/debug_rcu_head_unqueue() are used internally
diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
index 4eb34fc..5098aea 100644
--- a/kernel/rcutiny.c
+++ b/kernel/rcutiny.c
@@ -53,7 +53,7 @@ static void __call_rcu(struct rcu_head *head,
#include "rcutiny_plugin.h"
-static long long rcu_dynticks_nesting = DYNTICK_TASK_NESTING;
+static long long rcu_dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
/* Common code for rcu_idle_enter() and rcu_irq_exit(), see kernel/rcutree.c. */
static void rcu_idle_enter_common(long long oldval)
@@ -88,7 +88,12 @@ void rcu_idle_enter(void)
local_irq_save(flags);
oldval = rcu_dynticks_nesting;
- rcu_dynticks_nesting = 0;
+ WARN_ON_ONCE((rcu_dynticks_nesting & DYNTICK_TASK_NESTING_MASK) == 0);
+ if ((rcu_dynticks_nesting & DYNTICK_TASK_NESTING_MASK) ==
+ DYNTICK_TASK_NESTING_VALUE)
+ rcu_dynticks_nesting = 0;
+ else
+ rcu_dynticks_nesting -= DYNTICK_TASK_NESTING_VALUE;
rcu_idle_enter_common(oldval);
local_irq_restore(flags);
}
@@ -140,8 +145,11 @@ void rcu_idle_exit(void)
local_irq_save(flags);
oldval = rcu_dynticks_nesting;
- WARN_ON_ONCE(oldval != 0);
- rcu_dynticks_nesting = DYNTICK_TASK_NESTING;
+ WARN_ON_ONCE(rcu_dynticks_nesting < 0);
+ if (rcu_dynticks_nesting & DYNTICK_TASK_NESTING_MASK)
+ rcu_dynticks_nesting += DYNTICK_TASK_NESTING_VALUE;
+ else
+ rcu_dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
rcu_idle_exit_common(oldval);
local_irq_restore(flags);
}
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index df0e3c1..e81298a 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -198,7 +198,7 @@ void rcu_note_context_switch(int cpu)
EXPORT_SYMBOL_GPL(rcu_note_context_switch);
DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
- .dynticks_nesting = DYNTICK_TASK_NESTING,
+ .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
.dynticks = ATOMIC_INIT(1),
};
@@ -394,7 +394,11 @@ void rcu_idle_enter(void)
local_irq_save(flags);
rdtp = &__get_cpu_var(rcu_dynticks);
oldval = rdtp->dynticks_nesting;
- rdtp->dynticks_nesting = 0;
+ WARN_ON_ONCE((oldval & DYNTICK_TASK_NESTING_MASK) == 0);
+ if ((oldval & DYNTICK_TASK_NESTING_MASK) == DYNTICK_TASK_NESTING_VALUE)
+ rdtp->dynticks_nesting = 0;
+ else
+ rdtp->dynticks_nesting -= DYNTICK_TASK_NESTING_VALUE;
rcu_idle_enter_common(rdtp, oldval);
local_irq_restore(flags);
}
@@ -481,8 +485,11 @@ void rcu_idle_exit(void)
local_irq_save(flags);
rdtp = &__get_cpu_var(rcu_dynticks);
oldval = rdtp->dynticks_nesting;
- WARN_ON_ONCE(oldval != 0);
- rdtp->dynticks_nesting = DYNTICK_TASK_NESTING;
+ WARN_ON_ONCE(oldval < 0);
+ if (oldval & DYNTICK_TASK_NESTING_MASK)
+ rdtp->dynticks_nesting += DYNTICK_TASK_NESTING_VALUE;
+ else
+ rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
rcu_idle_exit_common(rdtp, oldval);
local_irq_restore(flags);
}
@@ -2253,7 +2260,7 @@ rcu_boot_init_percpu_data(int cpu, struct rcu_state *rsp)
rdp->qlen_lazy = 0;
rdp->qlen = 0;
rdp->dynticks = &per_cpu(rcu_dynticks, cpu);
- WARN_ON_ONCE(rdp->dynticks->dynticks_nesting != DYNTICK_TASK_NESTING);
+ WARN_ON_ONCE(rdp->dynticks->dynticks_nesting != DYNTICK_TASK_EXIT_IDLE);
WARN_ON_ONCE(atomic_read(&rdp->dynticks->dynticks) != 1);
rdp->cpu = cpu;
rdp->rsp = rsp;
@@ -2281,7 +2288,7 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
rdp->qlen_last_fqs_check = 0;
rdp->n_force_qs_snap = rsp->n_force_qs;
rdp->blimit = blimit;
- rdp->dynticks->dynticks_nesting = DYNTICK_TASK_NESTING;
+ rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
atomic_set(&rdp->dynticks->dynticks,
(atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1);
rcu_prepare_for_idle_init(cpu);
--
1.7.8
On Thu, Feb 02, 2012 at 05:12:08PM -0800, Paul E. McKenney wrote:
> This patch series is a second attempt to fix the idle-loop uses of RCU,
> see https://lkml.org/lkml/2012/2/1/741 for v1. Where the first series
> attempted to drive rcu_idle_enter() and rcu_idle_exit() further down
> into the Linux kernels multitude of idle loops, this patch instead
> marks specific idle-loop operations containing RCU read-side critical
> sections, as suggested by Nicolas Pitre and Steven Rostedt. The possibility
> of code shared between idle and non-idle also requires the ability to nest
> rcu_idle_enter() calls. The individual patches are as follows:
>
> 1. Allow nesting of rcu_idle_enter() and rcu_idle_exit().
> 2. Add an RCU_NONIDLE() macro to enclose idle-loop code that
> contains RCU read-side critical sections.
> 3. Use RCU_NONIDLE() to protect cpuidle_idle_call()'s tracepoints.
>
> This patchset has the distinct advantage of avoiding touching any
> architecture-specific code. ;-)
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> b/drivers/cpuidle/cpuidle.c | 12 ++++++++----
> b/include/linux/rcupdate.h | 27 +++++++++++++++++++++++++++
> b/kernel/rcu.h | 18 +++++++++++++++++-
> b/kernel/rcutiny.c | 16 ++++++++++++----
> b/kernel/rcutree.c | 19 +++++++++++++------
> kernel/rcutiny.c | 2 ++
> kernel/rcutree.c | 2 ++
> 7 files changed, 81 insertions(+), 15 deletions(-)
Looks good to me. Much cleaner.
Reviewed-by: Josh Triplett <[email protected]>
On 02/03/2012 06:42 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> The cpuidle_idle_call() function is invoked in the inner idle loop,
> after the call to rcu_idle_enter() and before the call to rcu_idle_exit().
> This means that RCU is ignoring the CPU at this point. Unfortunately,
> cpuidle_idle_call() nevertheless contains tracepoints (important ones
> used by powertop) that expect RCU to be paying attention. The consequences
> of using RCU read-side critical sections on CPUs that RCU is ignoring
> can be severe, including random corruption of random memory.
>
> Therefore, this commit uses the new RCU_NONIDLE() macro to let RCU
> do its job with respect to the tracepoints.
>
> Suggested-by: Nicolas Pitre <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Kevin Hilman <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Trinabh Gupta <[email protected]>
> Cc: Arjan van de Ven <[email protected]>
> Cc: Deepthi Dharwar <[email protected]>
> ---
> drivers/cpuidle/cpuidle.c | 12 ++++++++----
> 1 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 59f4261..cd8a553 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -94,13 +94,17 @@ int cpuidle_idle_call(void)
>
> target_state = &drv->states[next_state];
>
> - trace_power_start(POWER_CSTATE, next_state, dev->cpu);
> - trace_cpu_idle(next_state, dev->cpu);
> + RCU_NONIDLE(
> + trace_power_start(POWER_CSTATE, next_state, dev->cpu);
> + trace_cpu_idle(next_state, dev->cpu)
> + );
>
> entered_state = target_state->enter(dev, drv, next_state);
>
> - trace_power_end(dev->cpu);
> - trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
> + RCU_NONIDLE(
> + trace_power_end(dev->cpu);
> + trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
> + );
>
> if (entered_state >= 0) {
> /* Update cpuidle counters */
Cleaner design esp by moving away from architecture specific
tweaking is much appreciated :)
Acked-by: Deepthi Dharwar <[email protected]>