2011-06-06 03:11:08

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/4] rcu: Detect rcu uses under extended quiescent state, and fix some

I found two examples with the last two patches but I suspect there are more bugs
waiting to be detected under other configs and/or archs.

Frederic Weisbecker (4):
rcu: Detect uses of rcu read side in extended quiescent states
nohz: Split extended quiescent state handling from nohz switch
x86: Don't call idle notifier inside rcu extended QS
x86: Call idle_exit() after irq_enter()

arch/arm/kernel/process.c | 4 +-
arch/avr32/kernel/process.c | 4 +-
arch/blackfin/kernel/process.c | 4 +-
arch/microblaze/kernel/process.c | 4 +-
arch/mips/kernel/process.c | 4 +-
arch/powerpc/kernel/idle.c | 4 +-
arch/powerpc/platforms/iseries/setup.c | 8 +++---
arch/s390/kernel/process.c | 4 +-
arch/sh/kernel/idle.c | 2 +-
arch/sparc/kernel/process_64.c | 4 +-
arch/tile/kernel/process.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/unicore32/kernel/process.c | 4 +-
arch/x86/kernel/apic/apic.c | 6 ++--
arch/x86/kernel/apic/io_apic.c | 3 +-
arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
arch/x86/kernel/cpu/mcheck/therm_throt.c | 2 +-
arch/x86/kernel/cpu/mcheck/threshold.c | 2 +-
arch/x86/kernel/irq.c | 5 +--
arch/x86/kernel/process_32.c | 4 +-
arch/x86/kernel/process_64.c | 7 +++++-
include/linux/lockdep.h | 8 ++++++-
include/linux/rcupdate.h | 20 ++++++++++++-----
include/linux/tick.h | 10 ++++++--
kernel/lockdep.c | 12 ++++++++--
kernel/pid.c | 2 +-
kernel/rcutree.c | 14 ++++++++++++
kernel/time/tick-sched.c | 34 ++++++++++++++++++++++++++++-
28 files changed, 129 insertions(+), 56 deletions(-)

--
1.7.5.2


2011-06-06 03:11:11

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/4] rcu: Detect rcu uses under extended quiescent state, and fix some

I found two examples with the last two patches but I suspect there are more bugs
waiting to be detected under other configs or archs.

Frederic Weisbecker (4):
rcu: Detect uses of rcu read side in extended quiescent states
nohz: Split extended quiescent state handling from nohz switch
x86: Don't call idle notifier inside rcu extended QS
x86: Call idle_exit() after irq_enter()

arch/arm/kernel/process.c | 4 +-
arch/avr32/kernel/process.c | 4 +-
arch/blackfin/kernel/process.c | 4 +-
arch/microblaze/kernel/process.c | 4 +-
arch/mips/kernel/process.c | 4 +-
arch/powerpc/kernel/idle.c | 4 +-
arch/powerpc/platforms/iseries/setup.c | 8 +++---
arch/s390/kernel/process.c | 4 +-
arch/sh/kernel/idle.c | 2 +-
arch/sparc/kernel/process_64.c | 4 +-
arch/tile/kernel/process.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/unicore32/kernel/process.c | 4 +-
arch/x86/kernel/apic/apic.c | 6 ++--
arch/x86/kernel/apic/io_apic.c | 3 +-
arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
arch/x86/kernel/cpu/mcheck/therm_throt.c | 2 +-
arch/x86/kernel/cpu/mcheck/threshold.c | 2 +-
arch/x86/kernel/irq.c | 5 +--
arch/x86/kernel/process_32.c | 4 +-
arch/x86/kernel/process_64.c | 7 +++++-
include/linux/lockdep.h | 8 ++++++-
include/linux/rcupdate.h | 20 ++++++++++++-----
include/linux/tick.h | 10 ++++++--
kernel/lockdep.c | 12 ++++++++--
kernel/pid.c | 2 +-
kernel/rcutree.c | 14 ++++++++++++
kernel/time/tick-sched.c | 34 ++++++++++++++++++++++++++++-
28 files changed, 129 insertions(+), 56 deletions(-)

--
1.7.5.2

2011-06-06 03:12:02

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/4] rcu: Detect uses of rcu read side in extended quiescent states

Detect uses of rcu that are not supposed to happen when we
are in an extended quiescent state.

This can happen for example if we use rcu between the time we
stop the tick and the time we restart it. Or inside an irq that
didn't use rcu_irq_enter,exit() or other possible kind of rcu API
misuse.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
include/linux/lockdep.h | 8 +++++++-
include/linux/rcupdate.h | 20 ++++++++++++++------
kernel/lockdep.c | 12 +++++++++---
kernel/pid.c | 2 +-
kernel/rcutree.c | 14 ++++++++++++++
5 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index ef820a3..452d547 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -547,8 +547,14 @@ do { \
# define might_lock_read(lock) do { } while (0)
#endif

+enum rcu_warn {
+ RCU_WARN_UNPROTECTED,
+ RCU_WARN_EXT_QS
+};
+
#ifdef CONFIG_PROVE_RCU
-extern void lockdep_rcu_dereference(const char *file, const int line);
+extern void
+lockdep_rcu_dereference(const char *file, const int line, enum rcu_warn type);
#endif

#endif /* __LINUX_LOCKDEP_H */
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 99f9aa7..651d90b 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -297,22 +297,29 @@ extern int rcu_my_thread_group_empty(void);
/**
* rcu_lockdep_assert - emit lockdep splat if specified condition not met
* @c: condition to check
+ * @t: type of the rcu problem detected
*/
-#define rcu_lockdep_assert(c) \
+#define rcu_lockdep_assert(c, t) \
do { \
static bool __warned; \
if (debug_lockdep_rcu_enabled() && !__warned && !(c)) { \
__warned = true; \
- lockdep_rcu_dereference(__FILE__, __LINE__); \
+ lockdep_rcu_dereference(__FILE__, __LINE__, t); \
} \
} while (0)

#else /* #ifdef CONFIG_PROVE_RCU */

-#define rcu_lockdep_assert(c) do { } while (0)
+#define rcu_lockdep_assert(c, t) do { } while (0)

#endif /* #else #ifdef CONFIG_PROVE_RCU */

+#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_NO_HZ)
+extern bool rcu_check_extended_qs(void);
+#else
+static inline bool rcu_check_extended_qs(void) { return false; }
+#endif
+
/*
* Helper functions for rcu_dereference_check(), rcu_dereference_protected()
* and rcu_assign_pointer(). Some of these could be folded into their
@@ -338,14 +345,15 @@ extern int rcu_my_thread_group_empty(void);
#define __rcu_dereference_check(p, c, space) \
({ \
typeof(*p) *_________p1 = (typeof(*p)*__force )ACCESS_ONCE(p); \
- rcu_lockdep_assert(c); \
+ rcu_lockdep_assert(c, RCU_WARN_UNPROTECTED); \
+ rcu_lockdep_assert(!rcu_check_extended_qs(), RCU_WARN_EXT_QS); \
rcu_dereference_sparse(p, space); \
smp_read_barrier_depends(); \
((typeof(*p) __force __kernel *)(_________p1)); \
})
#define __rcu_dereference_protected(p, c, space) \
({ \
- rcu_lockdep_assert(c); \
+ rcu_lockdep_assert(c, RCU_WARN_UNPROTECTED); \
rcu_dereference_sparse(p, space); \
((typeof(*p) __force __kernel *)(p)); \
})
@@ -359,7 +367,7 @@ extern int rcu_my_thread_group_empty(void);
#define __rcu_dereference_index_check(p, c) \
({ \
typeof(p) _________p1 = ACCESS_ONCE(p); \
- rcu_lockdep_assert(c); \
+ rcu_lockdep_assert(c, RCU_WARN_UNPROTECTED); \
smp_read_barrier_depends(); \
(_________p1); \
})
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 63437d0..eccfede 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3982,7 +3982,8 @@ void lockdep_sys_exit(void)
}
}

-void lockdep_rcu_dereference(const char *file, const int line)
+void lockdep_rcu_dereference(const char *file, const int line,
+ enum rcu_warn type)
{
struct task_struct *curr = current;

@@ -3994,8 +3995,13 @@ void lockdep_rcu_dereference(const char *file, const int line)
printk("\n===================================================\n");
printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
printk( "---------------------------------------------------\n");
- printk("%s:%d invoked rcu_dereference_check() without protection!\n",
- file, line);
+ printk("%s:%d invoked rcu_dereference_check() ", file, line);
+
+ if (type == RCU_WARN_UNPROTECTED)
+ printk("without protection!\n");
+ else if (type == RCU_WARN_EXT_QS)
+ printk("while in RCU extended quiescent state!\n");
+
printk("\nother info that might help us debug this:\n\n");
printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks);
lockdep_print_held_locks(curr);
diff --git a/kernel/pid.c b/kernel/pid.c
index 57a8346..87dd3c5 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -419,7 +419,7 @@ EXPORT_SYMBOL(pid_task);
*/
struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
{
- rcu_lockdep_assert(rcu_read_lock_held());
+ rcu_lockdep_assert(rcu_read_lock_held(), RCU_WARN_UNPROTECTED);
return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
}

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 89419ff..992ec56 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -433,6 +433,20 @@ void rcu_irq_exit(void)
rcu_enter_nohz();
}

+#ifdef CONFIG_PROVE_RCU
+
+bool rcu_check_extended_qs(void)
+{
+ struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
+
+ if (atomic_read(&rdtp->dynticks) & 0x1)
+ return false;
+
+ return true;
+}
+
+#endif /* CONFIG_PROVE_RCU */
+
#ifdef CONFIG_SMP

/*
--
1.7.5.2

2011-06-06 03:11:17

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/4] nohz: Split extended quiescent state handling from nohz switch

It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.

To prepare for fixing this, split the tickless mode switching and
RCU extended quiescent state logics.
Make tick_nohz_stop/restart_sched_tick() RCU agnostic but provide
a new pair of APIs tick_nohz_enter/exit_idle() that keep the
old behaviour by handling both the nohz mode and RCU extended
quiescent states, then convert every archs to use these.

Archs that want to switch to extended QS to some custom points
can do it later by using tick_nohz_stop_sched_tick() and
rcu_enter_nohz() seperately.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: David Miller <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Guan Xuetao <[email protected]>
---
arch/arm/kernel/process.c | 4 +-
arch/avr32/kernel/process.c | 4 +-
arch/blackfin/kernel/process.c | 4 +-
arch/microblaze/kernel/process.c | 4 +-
arch/mips/kernel/process.c | 4 +-
arch/powerpc/kernel/idle.c | 4 +-
arch/powerpc/platforms/iseries/setup.c | 8 +++---
arch/s390/kernel/process.c | 4 +-
arch/sh/kernel/idle.c | 2 +-
arch/sparc/kernel/process_64.c | 4 +-
arch/tile/kernel/process.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/unicore32/kernel/process.c | 4 +-
arch/x86/kernel/process_32.c | 4 +-
arch/x86/kernel/process_64.c | 4 +-
include/linux/tick.h | 10 ++++++--
kernel/time/tick-sched.c | 34 ++++++++++++++++++++++++++++++-
17 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 5e1e541..27b68b0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -182,7 +182,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
leds_event(led_idle_start);
while (!need_resched()) {
#ifdef CONFIG_HOTPLUG_CPU
@@ -208,7 +208,7 @@ void cpu_idle(void)
}
}
leds_event(led_idle_end);
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index ef5a2a0..e683a34 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,10 +34,10 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
cpu_idle_sleep();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 6a660fa..8082a8f 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,10 +88,10 @@ void cpu_idle(void)
#endif
if (!idle)
idle = default_idle;
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 968648a..1b295b2 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,10 +103,10 @@ void cpu_idle(void)
if (!idle)
idle = default_idle;

- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

preempt_enable_no_resched();
schedule();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d2112d3..cdbfa52 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,7 +56,7 @@ void __noreturn cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && cpu_online(cpu)) {
#ifdef CONFIG_MIPS_MT_SMTC
extern void smtc_idle_loop_hook(void);
@@ -77,7 +77,7 @@ void __noreturn cpu_idle(void)
system_state == SYSTEM_BOOTING))
play_dead();
#endif
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 39a2baa..1108260 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -56,7 +56,7 @@ void cpu_idle(void)

set_thread_flag(TIF_POLLING_NRFLAG);
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && !cpu_should_die()) {
ppc64_runlatch_off();

@@ -93,7 +93,7 @@ void cpu_idle(void)

HMT_medium();
ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
if (cpu_should_die())
cpu_die();
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index c25a081..d40dcd9 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -562,7 +562,7 @@ static void yield_shared_processor(void)
static void iseries_shared_idle(void)
{
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && !hvlpevent_is_pending()) {
local_irq_disable();
ppc64_runlatch_off();
@@ -576,7 +576,7 @@ static void iseries_shared_idle(void)
}

ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

if (hvlpevent_is_pending())
process_iSeries_events();
@@ -592,7 +592,7 @@ static void iseries_dedicated_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
if (!need_resched()) {
while (!need_resched()) {
ppc64_runlatch_off();
@@ -609,7 +609,7 @@ static void iseries_dedicated_idle(void)
}

ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 541a750..560cd94 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -90,10 +90,10 @@ static void default_idle(void)
void cpu_idle(void)
{
for (;;) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
default_idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index 425d604..3957972 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -88,7 +88,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();

while (!need_resched()) {
check_pgt_cache();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index c158a95..5c36632 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,12 +95,12 @@ void cpu_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while(1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();

while (!need_resched() && !cpu_is_offline(cpu))
sparc64_yield(cpu);

- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

preempt_enable_no_resched();

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 9c45d8b..cc1bd4f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,7 +85,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {
if (cpu_is_offline(cpu))
BUG(); /* no HOTPLUG_CPU */
@@ -105,7 +105,7 @@ void cpu_idle(void)
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index fab4371..f1b3864 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -245,10 +245,10 @@ void default_idle(void)
if (need_resched())
schedule();

- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
nsecs = disable_timer();
idle_sleep(nsecs);
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
}
}

diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index ba401df..e2df91a 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,7 +55,7 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {
local_irq_disable();
stop_critical_timings();
@@ -63,7 +63,7 @@ void cpu_idle(void)
local_irq_enable();
start_critical_timings();
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8d12878..41e7d1b 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -97,7 +97,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {

check_pgt_cache();
@@ -112,7 +112,7 @@ void cpu_idle(void)
pm_idle();
start_critical_timings();
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6c9dd92..3fe0883 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {

rmb();
@@ -145,7 +145,7 @@ void cpu_idle(void)
__exit_idle();
}

- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..ff31a71 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -121,14 +121,18 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
#endif /* !CONFIG_GENERIC_CLOCKEVENTS */

# ifdef CONFIG_NO_HZ
-extern void tick_nohz_stop_sched_tick(int inidle);
+extern bool tick_nohz_stop_sched_tick(int inidle);
extern void tick_nohz_restart_sched_tick(void);
+extern void tick_nohz_enter_idle(void);
+extern void tick_nohz_exit_idle(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
# else
-static inline void tick_nohz_stop_sched_tick(int inidle) { }
-static inline void tick_nohz_restart_sched_tick(void) { }
+static inline bool tick_nohz_stop_sched_tick(int inidle) { return false; }
+static inline void tick_nohz_restart_sched_tick(void) { return false; }
+static inline void tick_nohz_enter_idle(void) { }
+static inline void tick_nohz_exit_idle(void) { }
static inline ktime_t tick_nohz_get_sleep_length(void)
{
ktime_t len = { .tv64 = NSEC_PER_SEC/HZ };
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d5097c4..9437af2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -254,12 +254,13 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
* Called either from the idle loop or from irq_exit() when an idle period was
* just interrupted by an interrupt which did not cause a reschedule.
*/
-void tick_nohz_stop_sched_tick(int inidle)
+bool tick_nohz_stop_sched_tick(int inidle)
{
unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
struct tick_sched *ts;
ktime_t last_update, expires, now;
struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
+ int stopped = false;
u64 time_delta;
int cpu;

@@ -409,7 +410,7 @@ void tick_nohz_stop_sched_tick(int inidle)
ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
- rcu_enter_nohz();
+ stopped = true;
}

ts->idle_sleeps++;
@@ -450,6 +451,22 @@ out:
ts->sleep_length = ktime_sub(dev->next_event, now);
end:
local_irq_restore(flags);
+
+ return stopped;
+}
+
+
+/**
+ * tick_nohz_enter_idle - stop the tick and enter extended quiescent state
+ *
+ * Most arch may want to enter RCU extended state right after they switched
+ * to nohz mode. Beware though, no read side use of RCU can be done until we
+ * call tick_nohz_exit_idle().
+ */
+void tick_nohz_enter_idle(void)
+{
+ if (tick_nohz_stop_sched_tick(1))
+ rcu_enter_nohz();
}

/**
@@ -552,6 +569,19 @@ void tick_nohz_restart_sched_tick(void)
local_irq_enable();
}

+/**
+ * tick_nohz_exit_idle - restart the tick and exit extended quiescent state
+ */
+void tick_nohz_exit_idle(void)
+{
+ struct tick_sched *ts = &__raw_get_cpu_var(tick_cpu_sched);
+
+ if (ts->tick_stopped)
+ rcu_exit_nohz();
+
+ tick_nohz_restart_sched_tick();
+}
+
static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
{
hrtimer_forward(&ts->sched_timer, now, tick_period);
--
1.7.5.2

2011-06-06 03:11:21

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/4] x86: Don't call idle notifier inside rcu extended QS

The idle notifier in enter_idle / __exit_idle is called
inside the RCU extended quiescent state.

This results in the following warning:

[ 2.081278] ===================================================
[ 2.081281] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 2.081282] ---------------------------------------------------
[ 2.081284] kernel/notifier.c:81 invoked rcu_dereference_check() while in RCU extended quiescent state!
[ 2.081286]
[ 2.081287] other info that might help us debug this:
[ 2.081288]
[ 2.081289]
[ 2.081289] rcu_scheduler_active = 1, debug_locks = 0
[ 2.081292] 1 lock held by kworker/0:0/0:
[ 2.081293] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff81785c60>] __atomic_notifier_call_chain+0x0/0xa0
[ 2.081301]
[ 2.081302] stack backtrace:
[ 2.081305] Pid: 0, comm: kworker/0:0 Not tainted 3.0.0-rc1+ #103
[ 2.081307] Call Trace:
[ 2.081313] [<ffffffff8108e0e4>] lockdep_rcu_dereference+0xd4/0x100
[ 2.081316] [<ffffffff81785c4a>] notifier_call_chain+0x13a/0x150
[ 2.081319] [<ffffffff81785cc7>] __atomic_notifier_call_chain+0x67/0xa0
[ 2.081322] [<ffffffff81785c60>] ? notifier_call_chain+0x150/0x150
[ 2.081325] [<ffffffff81785d11>] atomic_notifier_call_chain+0x11/0x20
[ 2.081329] [<ffffffff8100afa0>] enter_idle+0x20/0x30
[ 2.081332] [<ffffffff8100b05e>] cpu_idle+0xae/0x100
[ 2.081335] [<ffffffff8177951f>] start_secondary+0x1cf/0x1d6

Fix this by entering RCU extended QS later, after we call the
notifier. Exit it also sooner to call the exit idle notifier.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/process_64.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3fe0883..a8bd222 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_enter_idle();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched()) {

rmb();
@@ -134,18 +134,23 @@ void cpu_idle(void)
*/
local_irq_disable();
enter_idle();
+ /*
+ * We can't enter extended QS before due to idle
+ * notifier that uses RCU.
+ */
+ rcu_enter_nohz();
/* Don't trace irqs off for idle */
stop_critical_timings();
pm_idle();
start_critical_timings();
-
+ rcu_exit_nohz();
/* In many cases the interrupt that ended idle
has already called exit_idle. But some idle
loops can be woken up without interrupt. */
__exit_idle();
}

- tick_nohz_exit_idle();
+ tick_nohz_restart_sched_tick();
preempt_enable_no_resched();
schedule();
preempt_disable();
--
1.7.5.2

2011-06-06 03:11:42

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/4] x86: Call idle_exit() after irq_enter()

idle_exit() calls the idle notifier, which uses RCU. But
this is called before we call irq_enter(), thus before we
exit the RCU extended quiescent state if we are in idle.

Fix this by calling the idle exit notifier after irq_enter().

This fixes the following warning:

[ 0.559953] ===================================================
[ 0.559954] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 0.559956] ---------------------------------------------------
[ 0.559958] kernel/notifier.c:81 invoked rcu_dereference_check() while in RCU extended quiescent state!
[ 0.559960]
[ 0.559961] other info that might help us debug this:
[ 0.559961]
[ 0.559962]
[ 0.559963] rcu_scheduler_active = 1, debug_locks = 1
[ 0.559965] 1 lock held by kworker/0:0/0:
[ 0.559966] #0: (rcu_read_lock){......}, at: [<ffffffff81785ce0>] __atomic_notifier_call_chain+0x0/0xa0
[ 0.559976]
[ 0.559977] stack backtrace:
[ 0.559980] Pid: 0, comm: kworker/0:0 Not tainted 3.0.0-rc1+ #108
[ 0.559981] Call Trace:
[ 0.559983] <IRQ> [<ffffffff8108e176>] lockdep_rcu_dereference+0xb6/0xf0
[ 0.559990] [<ffffffff81785cca>] notifier_call_chain+0x13a/0x150
[ 0.559993] [<ffffffff81785d47>] __atomic_notifier_call_chain+0x67/0xa0
[ 0.559996] [<ffffffff81785ce0>] ? notifier_call_chain+0x150/0x150
[ 0.560000] [<ffffffff812a454e>] ? do_raw_spin_unlock+0x5e/0xb0
[ 0.560000] [<ffffffff81785d91>] atomic_notifier_call_chain+0x11/0x20
[ 0.560000] [<ffffffff8100af73>] exit_idle+0x43/0x50
[ 0.560000] [<ffffffff81027869>] smp_apic_timer_interrupt+0x39/0xa0
[ 0.560000] [<ffffffff81789e13>] apic_timer_interrupt+0x13/0x20
[ 0.560000] <EOI> [<ffffffff810301e6>] ? native_safe_halt+0x6/0x10
[ 0.560000] [<ffffffff8108fecd>] ? trace_hardirqs_on+0xd/0x10
[ 0.560000] [<ffffffff810130ac>] default_idle+0x2c/0x50
[ 0.560000] [<ffffffff810131c8>] amd_e400_idle+0x58/0x130
[ 0.560000] [<ffffffff8100b069>] cpu_idle+0xb9/0x110
[ 0.560000] [<ffffffff817795af>] start_secondary+0x1cf/0x1d6

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/apic/apic.c | 6 +++---
arch/x86/kernel/apic/io_apic.c | 3 ++-
arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
arch/x86/kernel/cpu/mcheck/therm_throt.c | 2 +-
arch/x86/kernel/cpu/mcheck/threshold.c | 2 +-
arch/x86/kernel/irq.c | 5 ++---
6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b961af8..73b61ff 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -855,8 +855,8 @@ void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
* Besides, if we don't timer interrupts ignore the global
* interrupt lock, which is the WrongThing (tm) to do.
*/
- exit_idle();
irq_enter();
+ exit_idle();
local_apic_timer_interrupt();
irq_exit();

@@ -1789,8 +1789,8 @@ void smp_spurious_interrupt(struct pt_regs *regs)
{
u32 v;

- exit_idle();
irq_enter();
+ exit_idle();
/*
* Check if this really is a spurious interrupt and ACK it
* if it is a vectored one. Just in case...
@@ -1826,8 +1826,8 @@ void smp_error_interrupt(struct pt_regs *regs)
"Illegal register address", /* APIC Error Bit 7 */
};

- exit_idle();
irq_enter();
+ exit_idle();
/* First tickle the hardware, only then report what went on. -- REW */
v0 = apic_read(APIC_ESR);
apic_write(APIC_ESR, 0);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index e529339..f60bc30 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2275,8 +2275,9 @@ asmlinkage void smp_irq_move_cleanup_interrupt(void)
unsigned vector, me;

ack_APIC_irq();
- exit_idle();
+
irq_enter();
+ exit_idle();

me = smp_processor_id();
for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ff1ae9b..7ba9757 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -470,8 +470,8 @@ static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
{
ack_APIC_irq();
- exit_idle();
irq_enter();
+ exit_idle();
mce_notify_irq();
mce_schedule_work();
irq_exit();
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 27c6251..f6bbc64 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -396,8 +396,8 @@ static void (*smp_thermal_vector)(void) = unexpected_thermal_interrupt;

asmlinkage void smp_thermal_interrupt(struct pt_regs *regs)
{
- exit_idle();
irq_enter();
+ exit_idle();
inc_irq_stat(irq_thermal_count);
smp_thermal_vector();
irq_exit();
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index d746df2..aa578ca 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -19,8 +19,8 @@ void (*mce_threshold_vector)(void) = default_threshold_interrupt;

asmlinkage void smp_threshold_interrupt(void)
{
- exit_idle();
irq_enter();
+ exit_idle();
inc_irq_stat(irq_threshold_count);
mce_threshold_vector();
irq_exit();
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 6c0802e..b01c5ed 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -180,8 +180,8 @@ unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
unsigned vector = ~regs->orig_ax;
unsigned irq;

- exit_idle();
irq_enter();
+ exit_idle();

irq = __this_cpu_read(vector_irq[vector]);

@@ -208,9 +208,8 @@ void smp_x86_platform_ipi(struct pt_regs *regs)

ack_APIC_irq();

- exit_idle();
-
irq_enter();
+ exit_idle();

inc_irq_stat(x86_platform_ipis);

--
1.7.5.2

2011-06-06 03:20:50

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/4 v2] nohz: Split extended quiescent state handling from nohz switch

It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.

To prepare for fixing this, split the tickless mode switching and
RCU extended quiescent state logics.
Make tick_nohz_stop/restart_sched_tick() RCU agnostic but provide
a new pair of APIs tick_nohz_enter/exit_idle() that keep the
old behaviour by handling both the nohz mode and RCU extended
quiescent states, then convert every archs to use these.

Archs that want to switch to extended QS to some custom points
can do it later by using tick_nohz_stop_sched_tick() and
rcu_enter_nohz() seperately.

v2: Remove rcu_exit_nohz() from tick_nohz_restart_sched_tick()

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: David Miller <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Guan Xuetao <[email protected]>
---
arch/arm/kernel/process.c | 4 +-
arch/avr32/kernel/process.c | 4 +-
arch/blackfin/kernel/process.c | 4 +-
arch/microblaze/kernel/process.c | 4 +-
arch/mips/kernel/process.c | 4 +-
arch/powerpc/kernel/idle.c | 4 +-
arch/powerpc/platforms/iseries/setup.c | 8 +++---
arch/s390/kernel/process.c | 4 +-
arch/sh/kernel/idle.c | 2 +-
arch/sparc/kernel/process_64.c | 4 +-
arch/tile/kernel/process.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/unicore32/kernel/process.c | 4 +-
arch/x86/kernel/process_32.c | 4 +-
arch/x86/kernel/process_64.c | 4 +-
include/linux/tick.h | 10 ++++++--
kernel/time/tick-sched.c | 36 ++++++++++++++++++++++++++++---
17 files changed, 70 insertions(+), 38 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 5e1e541..27b68b0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -182,7 +182,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
leds_event(led_idle_start);
while (!need_resched()) {
#ifdef CONFIG_HOTPLUG_CPU
@@ -208,7 +208,7 @@ void cpu_idle(void)
}
}
leds_event(led_idle_end);
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index ef5a2a0..e683a34 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,10 +34,10 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
cpu_idle_sleep();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 6a660fa..8082a8f 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,10 +88,10 @@ void cpu_idle(void)
#endif
if (!idle)
idle = default_idle;
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 968648a..1b295b2 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,10 +103,10 @@ void cpu_idle(void)
if (!idle)
idle = default_idle;

- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

preempt_enable_no_resched();
schedule();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d2112d3..cdbfa52 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,7 +56,7 @@ void __noreturn cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && cpu_online(cpu)) {
#ifdef CONFIG_MIPS_MT_SMTC
extern void smtc_idle_loop_hook(void);
@@ -77,7 +77,7 @@ void __noreturn cpu_idle(void)
system_state == SYSTEM_BOOTING))
play_dead();
#endif
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 39a2baa..1108260 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -56,7 +56,7 @@ void cpu_idle(void)

set_thread_flag(TIF_POLLING_NRFLAG);
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && !cpu_should_die()) {
ppc64_runlatch_off();

@@ -93,7 +93,7 @@ void cpu_idle(void)

HMT_medium();
ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
if (cpu_should_die())
cpu_die();
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index c25a081..d40dcd9 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -562,7 +562,7 @@ static void yield_shared_processor(void)
static void iseries_shared_idle(void)
{
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && !hvlpevent_is_pending()) {
local_irq_disable();
ppc64_runlatch_off();
@@ -576,7 +576,7 @@ static void iseries_shared_idle(void)
}

ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

if (hvlpevent_is_pending())
process_iSeries_events();
@@ -592,7 +592,7 @@ static void iseries_dedicated_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
if (!need_resched()) {
while (!need_resched()) {
ppc64_runlatch_off();
@@ -609,7 +609,7 @@ static void iseries_dedicated_idle(void)
}

ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 541a750..560cd94 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -90,10 +90,10 @@ static void default_idle(void)
void cpu_idle(void)
{
for (;;) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
default_idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index 425d604..3957972 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -88,7 +88,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();

while (!need_resched()) {
check_pgt_cache();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index c158a95..5c36632 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,12 +95,12 @@ void cpu_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while(1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();

while (!need_resched() && !cpu_is_offline(cpu))
sparc64_yield(cpu);

- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

preempt_enable_no_resched();

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 9c45d8b..cc1bd4f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,7 +85,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {
if (cpu_is_offline(cpu))
BUG(); /* no HOTPLUG_CPU */
@@ -105,7 +105,7 @@ void cpu_idle(void)
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index fab4371..f1b3864 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -245,10 +245,10 @@ void default_idle(void)
if (need_resched())
schedule();

- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
nsecs = disable_timer();
idle_sleep(nsecs);
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
}
}

diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index ba401df..e2df91a 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,7 +55,7 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {
local_irq_disable();
stop_critical_timings();
@@ -63,7 +63,7 @@ void cpu_idle(void)
local_irq_enable();
start_critical_timings();
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8d12878..41e7d1b 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -97,7 +97,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {

check_pgt_cache();
@@ -112,7 +112,7 @@ void cpu_idle(void)
pm_idle();
start_critical_timings();
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6c9dd92..3fe0883 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {

rmb();
@@ -145,7 +145,7 @@ void cpu_idle(void)
__exit_idle();
}

- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..ff31a71 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -121,14 +121,18 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
#endif /* !CONFIG_GENERIC_CLOCKEVENTS */

# ifdef CONFIG_NO_HZ
-extern void tick_nohz_stop_sched_tick(int inidle);
+extern bool tick_nohz_stop_sched_tick(int inidle);
extern void tick_nohz_restart_sched_tick(void);
+extern void tick_nohz_enter_idle(void);
+extern void tick_nohz_exit_idle(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
# else
-static inline void tick_nohz_stop_sched_tick(int inidle) { }
-static inline void tick_nohz_restart_sched_tick(void) { }
+static inline bool tick_nohz_stop_sched_tick(int inidle) { return false; }
+static inline void tick_nohz_restart_sched_tick(void) { return false; }
+static inline void tick_nohz_enter_idle(void) { }
+static inline void tick_nohz_exit_idle(void) { }
static inline ktime_t tick_nohz_get_sleep_length(void)
{
ktime_t len = { .tv64 = NSEC_PER_SEC/HZ };
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d5097c4..8e81e9f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -254,12 +254,13 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
* Called either from the idle loop or from irq_exit() when an idle period was
* just interrupted by an interrupt which did not cause a reschedule.
*/
-void tick_nohz_stop_sched_tick(int inidle)
+bool tick_nohz_stop_sched_tick(int inidle)
{
unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
struct tick_sched *ts;
ktime_t last_update, expires, now;
struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
+ int stopped = false;
u64 time_delta;
int cpu;

@@ -409,7 +410,7 @@ void tick_nohz_stop_sched_tick(int inidle)
ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
- rcu_enter_nohz();
+ stopped = true;
}

ts->idle_sleeps++;
@@ -450,6 +451,22 @@ out:
ts->sleep_length = ktime_sub(dev->next_event, now);
end:
local_irq_restore(flags);
+
+ return stopped;
+}
+
+
+/**
+ * tick_nohz_enter_idle - stop the tick and enter extended quiescent state
+ *
+ * Most arch may want to enter RCU extended state right after they switched
+ * to nohz mode. Beware though, no read side use of RCU can be done until we
+ * call tick_nohz_exit_idle().
+ */
+void tick_nohz_enter_idle(void)
+{
+ if (tick_nohz_stop_sched_tick(1))
+ rcu_enter_nohz();
}

/**
@@ -519,8 +536,6 @@ void tick_nohz_restart_sched_tick(void)

ts->inidle = 0;

- rcu_exit_nohz();
-
/* Update jiffies first */
select_nohz_load_balancer(0);
tick_do_update_jiffies64(now);
@@ -552,6 +567,19 @@ void tick_nohz_restart_sched_tick(void)
local_irq_enable();
}

+/**
+ * tick_nohz_exit_idle - restart the tick and exit extended quiescent state
+ */
+void tick_nohz_exit_idle(void)
+{
+ struct tick_sched *ts = &__raw_get_cpu_var(tick_cpu_sched);
+
+ if (ts->tick_stopped)
+ rcu_exit_nohz();
+
+ tick_nohz_restart_sched_tick();
+}
+
static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
{
hrtimer_forward(&ts->sched_timer, now, tick_period);
--
1.7.5.4

2011-06-06 03:21:12

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/4] nohz: Split extended quiescent state handling from nohz switch

It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.

To prepare for fixing this, split the tickless mode switching and
RCU extended quiescent state logics.
Make tick_nohz_stop/restart_sched_tick() RCU agnostic but provide
a new pair of APIs tick_nohz_enter/exit_idle() that keep the
old behaviour by handling both the nohz mode and RCU extended
quiescent states, then convert every archs to use these.

Archs that want to switch to extended QS to some custom points
can do it later by using tick_nohz_stop_sched_tick() and
rcu_enter_nohz() seperately.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: David Miller <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Guan Xuetao <[email protected]>
---
arch/arm/kernel/process.c | 4 +-
arch/avr32/kernel/process.c | 4 +-
arch/blackfin/kernel/process.c | 4 +-
arch/microblaze/kernel/process.c | 4 +-
arch/mips/kernel/process.c | 4 +-
arch/powerpc/kernel/idle.c | 4 +-
arch/powerpc/platforms/iseries/setup.c | 8 +++---
arch/s390/kernel/process.c | 4 +-
arch/sh/kernel/idle.c | 2 +-
arch/sparc/kernel/process_64.c | 4 +-
arch/tile/kernel/process.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/unicore32/kernel/process.c | 4 +-
arch/x86/kernel/process_32.c | 4 +-
arch/x86/kernel/process_64.c | 4 +-
include/linux/tick.h | 10 ++++++--
kernel/time/tick-sched.c | 36 ++++++++++++++++++++++++++++---
17 files changed, 70 insertions(+), 38 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 5e1e541..27b68b0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -182,7 +182,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
leds_event(led_idle_start);
while (!need_resched()) {
#ifdef CONFIG_HOTPLUG_CPU
@@ -208,7 +208,7 @@ void cpu_idle(void)
}
}
leds_event(led_idle_end);
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index ef5a2a0..e683a34 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,10 +34,10 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
cpu_idle_sleep();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 6a660fa..8082a8f 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,10 +88,10 @@ void cpu_idle(void)
#endif
if (!idle)
idle = default_idle;
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 968648a..1b295b2 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,10 +103,10 @@ void cpu_idle(void)
if (!idle)
idle = default_idle;

- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

preempt_enable_no_resched();
schedule();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d2112d3..cdbfa52 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,7 +56,7 @@ void __noreturn cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && cpu_online(cpu)) {
#ifdef CONFIG_MIPS_MT_SMTC
extern void smtc_idle_loop_hook(void);
@@ -77,7 +77,7 @@ void __noreturn cpu_idle(void)
system_state == SYSTEM_BOOTING))
play_dead();
#endif
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 39a2baa..1108260 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -56,7 +56,7 @@ void cpu_idle(void)

set_thread_flag(TIF_POLLING_NRFLAG);
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && !cpu_should_die()) {
ppc64_runlatch_off();

@@ -93,7 +93,7 @@ void cpu_idle(void)

HMT_medium();
ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
if (cpu_should_die())
cpu_die();
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index c25a081..d40dcd9 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -562,7 +562,7 @@ static void yield_shared_processor(void)
static void iseries_shared_idle(void)
{
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched() && !hvlpevent_is_pending()) {
local_irq_disable();
ppc64_runlatch_off();
@@ -576,7 +576,7 @@ static void iseries_shared_idle(void)
}

ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

if (hvlpevent_is_pending())
process_iSeries_events();
@@ -592,7 +592,7 @@ static void iseries_dedicated_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
if (!need_resched()) {
while (!need_resched()) {
ppc64_runlatch_off();
@@ -609,7 +609,7 @@ static void iseries_dedicated_idle(void)
}

ppc64_runlatch_on();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 541a750..560cd94 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -90,10 +90,10 @@ static void default_idle(void)
void cpu_idle(void)
{
for (;;) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched())
default_idle();
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index 425d604..3957972 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -88,7 +88,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();

while (!need_resched()) {
check_pgt_cache();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index c158a95..5c36632 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,12 +95,12 @@ void cpu_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while(1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();

while (!need_resched() && !cpu_is_offline(cpu))
sparc64_yield(cpu);

- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();

preempt_enable_no_resched();

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 9c45d8b..cc1bd4f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,7 +85,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {
if (cpu_is_offline(cpu))
BUG(); /* no HOTPLUG_CPU */
@@ -105,7 +105,7 @@ void cpu_idle(void)
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index fab4371..f1b3864 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -245,10 +245,10 @@ void default_idle(void)
if (need_resched())
schedule();

- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
nsecs = disable_timer();
idle_sleep(nsecs);
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
}
}

diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index ba401df..e2df91a 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,7 +55,7 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {
local_irq_disable();
stop_critical_timings();
@@ -63,7 +63,7 @@ void cpu_idle(void)
local_irq_enable();
start_critical_timings();
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8d12878..41e7d1b 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -97,7 +97,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {

check_pgt_cache();
@@ -112,7 +112,7 @@ void cpu_idle(void)
pm_idle();
start_critical_timings();
}
- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6c9dd92..3fe0883 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick(1);
+ tick_nohz_enter_idle();
while (!need_resched()) {

rmb();
@@ -145,7 +145,7 @@ void cpu_idle(void)
__exit_idle();
}

- tick_nohz_restart_sched_tick();
+ tick_nohz_exit_idle();
preempt_enable_no_resched();
schedule();
preempt_disable();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..ff31a71 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -121,14 +121,18 @@ static inline int tick_oneshot_mode_active(void) { return 0; }
#endif /* !CONFIG_GENERIC_CLOCKEVENTS */

# ifdef CONFIG_NO_HZ
-extern void tick_nohz_stop_sched_tick(int inidle);
+extern bool tick_nohz_stop_sched_tick(int inidle);
extern void tick_nohz_restart_sched_tick(void);
+extern void tick_nohz_enter_idle(void);
+extern void tick_nohz_exit_idle(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
# else
-static inline void tick_nohz_stop_sched_tick(int inidle) { }
-static inline void tick_nohz_restart_sched_tick(void) { }
+static inline bool tick_nohz_stop_sched_tick(int inidle) { return false; }
+static inline void tick_nohz_restart_sched_tick(void) { return false; }
+static inline void tick_nohz_enter_idle(void) { }
+static inline void tick_nohz_exit_idle(void) { }
static inline ktime_t tick_nohz_get_sleep_length(void)
{
ktime_t len = { .tv64 = NSEC_PER_SEC/HZ };
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d5097c4..8e81e9f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -254,12 +254,13 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
* Called either from the idle loop or from irq_exit() when an idle period was
* just interrupted by an interrupt which did not cause a reschedule.
*/
-void tick_nohz_stop_sched_tick(int inidle)
+bool tick_nohz_stop_sched_tick(int inidle)
{
unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
struct tick_sched *ts;
ktime_t last_update, expires, now;
struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
+ int stopped = false;
u64 time_delta;
int cpu;

@@ -409,7 +410,7 @@ void tick_nohz_stop_sched_tick(int inidle)
ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
- rcu_enter_nohz();
+ stopped = true;
}

ts->idle_sleeps++;
@@ -450,6 +451,22 @@ out:
ts->sleep_length = ktime_sub(dev->next_event, now);
end:
local_irq_restore(flags);
+
+ return stopped;
+}
+
+
+/**
+ * tick_nohz_enter_idle - stop the tick and enter extended quiescent state
+ *
+ * Most arch may want to enter RCU extended state right after they switched
+ * to nohz mode. Beware though, no read side use of RCU can be done until we
+ * call tick_nohz_exit_idle().
+ */
+void tick_nohz_enter_idle(void)
+{
+ if (tick_nohz_stop_sched_tick(1))
+ rcu_enter_nohz();
}

/**
@@ -519,8 +536,6 @@ void tick_nohz_restart_sched_tick(void)

ts->inidle = 0;

- rcu_exit_nohz();
-
/* Update jiffies first */
select_nohz_load_balancer(0);
tick_do_update_jiffies64(now);
@@ -552,6 +567,19 @@ void tick_nohz_restart_sched_tick(void)
local_irq_enable();
}

+/**
+ * tick_nohz_exit_idle - restart the tick and exit extended quiescent state
+ */
+void tick_nohz_exit_idle(void)
+{
+ struct tick_sched *ts = &__raw_get_cpu_var(tick_cpu_sched);
+
+ if (ts->tick_stopped)
+ rcu_exit_nohz();
+
+ tick_nohz_restart_sched_tick();
+}
+
static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
{
hrtimer_forward(&ts->sched_timer, now, tick_period);
--
1.7.5.4

2011-06-06 03:44:36

by Milton Miller

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, 06 Jun 2011 about 03:10:55 -0000, Frederic Weisbecker wrote:
> @@ -3994,8 +3995,13 @@ void lockdep_rcu_dereference(const char *file, const int line)
> printk("\n===================================================\n");
> printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
> printk( "---------------------------------------------------\n");
> - printk("%s:%d invoked rcu_dereference_check() without protection!\n",
> - file, line);
> + printk("%s:%d invoked rcu_dereference_check() ", file, line);
> +
> + if (type == RCU_WARN_UNPROTECTED)
> + printk("without protection!\n");
> + else if (type == RCU_WARN_EXT_QS)
> + printk("while in RCU extended quiescent state!\n");
> +
> printk("\nother info that might help us debug this:\n\n");
> printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks);
> lockdep_print_held_locks(curr);

Can we keep the above in one printk? That way the printing is
guaranteed to come out on one line. Probably the easiest way would
be add char *why = "" then assign a string based on the current
conditions. Do all of that before the first printk which gets the
a %s added.

Thanks,
milton

2011-06-06 04:01:32

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2/4] nohz: Split extended quiescent state handling from nohz switch

From: Frederic Weisbecker <[email protected]>
Date: Mon, 6 Jun 2011 05:10:56 +0200

> It is assumed that rcu won't be used once we switch to tickless
> mode and until we restart the tick. However this is not always
> true, as in x86-64 where we dereference the idle notifiers after
> the tick is stopped.
>
> To prepare for fixing this, split the tickless mode switching and
> RCU extended quiescent state logics.
> Make tick_nohz_stop/restart_sched_tick() RCU agnostic but provide
> a new pair of APIs tick_nohz_enter/exit_idle() that keep the
> old behaviour by handling both the nohz mode and RCU extended
> quiescent states, then convert every archs to use these.
>
> Archs that want to switch to extended QS to some custom points
> can do it later by using tick_nohz_stop_sched_tick() and
> rcu_enter_nohz() seperately.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Acked-by: David S. Miller <[email protected]>

2011-06-06 15:18:14

by Hans-Christian Egtvedt

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] nohz: Split extended quiescent state handling from nohz switch

On Mon, 2011-06-06 at 05:20 +0200, Frederic Weisbecker wrote:
> It is assumed that rcu won't be used once we switch to tickless
> mode and until we restart the tick. However this is not always
> true, as in x86-64 where we dereference the idle notifiers after
> the tick is stopped.
>
> To prepare for fixing this, split the tickless mode switching and
> RCU extended quiescent state logics.
> Make tick_nohz_stop/restart_sched_tick() RCU agnostic but provide
> a new pair of APIs tick_nohz_enter/exit_idle() that keep the
> old behaviour by handling both the nohz mode and RCU extended
> quiescent states, then convert every archs to use these.
>
> Archs that want to switch to extended QS to some custom points
> can do it later by using tick_nohz_stop_sched_tick() and
> rcu_enter_nohz() seperately.
>
> v2: Remove rcu_exit_nohz() from tick_nohz_restart_sched_tick()
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Hans-Christian Egtvedt <[email protected]>
> Cc: Mike Frysinger <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: David Miller <[email protected]>
> Cc: Chris Metcalf <[email protected]>
> Cc: Guan Xuetao <[email protected]>

<snipp>

> arch/avr32/kernel/process.c | 4 +-

For the AVR32 related bits

Acked-by: Hans-Christian Egtvedt <[email protected]>

<snipp>

--
Hans-Christian Egtvedt

2011-06-06 15:25:35

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] nohz: Split extended quiescent state handling from nohz switch

On Mon, Jun 06, 2011 at 05:20:36AM +0200, Frederic Weisbecker wrote:

Acked-by: Ralf Baechle <[email protected]>

Thanks,

Ralf

2011-06-06 18:10:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Sun, Jun 05, 2011 at 10:44:33PM -0500, Milton Miller wrote:
> On Mon, 06 Jun 2011 about 03:10:55 -0000, Frederic Weisbecker wrote:
> > @@ -3994,8 +3995,13 @@ void lockdep_rcu_dereference(const char *file, const int line)
> > printk("\n===================================================\n");
> > printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
> > printk( "---------------------------------------------------\n");
> > - printk("%s:%d invoked rcu_dereference_check() without protection!\n",
> > - file, line);
> > + printk("%s:%d invoked rcu_dereference_check() ", file, line);
> > +
> > + if (type == RCU_WARN_UNPROTECTED)
> > + printk("without protection!\n");
> > + else if (type == RCU_WARN_EXT_QS)
> > + printk("while in RCU extended quiescent state!\n");
> > +
> > printk("\nother info that might help us debug this:\n\n");
> > printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks);
> > lockdep_print_held_locks(curr);
>
> Can we keep the above in one printk? That way the printing is
> guaranteed to come out on one line. Probably the easiest way would
> be add char *why = "" then assign a string based on the current
> conditions. Do all of that before the first printk which gets the
> a %s added.

I have the following queued the -rcu tree which does add the string.

Frederic, would it be possible to base on this patch?

Thanx, Paul

------------------------------------------------------------------------

commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9
Author: Paul E. McKenney <[email protected]>
Date: Tue May 24 08:31:09 2011 -0700

rcu: Restore checks for blocking in RCU read-side critical sections

Long ago, using TREE_RCU with PREEMPT would result in "scheduling
while atomic" diagnostics if you blocked in an RCU read-side critical
section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
this diagnostic. This commit therefore adds a replacement diagnostic
based on PROVE_RCU.

Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
used for things that have nothing to do with rcu_dereference(), rename
lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
argument that is a string indicating what is suspicious. This third
argument is passed in from a new third argument to rcu_lockdep_assert().
Update all calls to rcu_lockdep_assert() to add an informative third
argument.

Finally, add a pair of rcu_lockdep_assert() calls from within
rcu_note_context_switch(), one complaining if a context switch occurs
in an RCU-bh read-side critical section and another complaining if a
context switch occurs in an RCU-sched read-side critical section.
These are present only if the PROVE_RCU kernel parameter is enabled.

Again, you must enable PROVE_RCU to see these new diagnostics. But you
are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 4aef1dd..bfa66e7 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -545,7 +545,7 @@ do { \
#endif

#ifdef CONFIG_PROVE_RCU
-extern void lockdep_rcu_dereference(const char *file, const int line);
+void lockdep_rcu_suspicious(const char *file, const int line, const char *s);
#endif

#endif /* __LINUX_LOCKDEP_H */
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 58b13f1..fb2933d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -297,19 +297,20 @@ extern int rcu_my_thread_group_empty(void);
/**
* rcu_lockdep_assert - emit lockdep splat if specified condition not met
* @c: condition to check
+ * @s: informative message
*/
-#define rcu_lockdep_assert(c) \
+#define rcu_lockdep_assert(c, s) \
do { \
static bool __warned; \
if (debug_lockdep_rcu_enabled() && !__warned && !(c)) { \
__warned = true; \
- lockdep_rcu_dereference(__FILE__, __LINE__); \
+ lockdep_rcu_suspicious(__FILE__, __LINE__, s); \
} \
} while (0)

#else /* #ifdef CONFIG_PROVE_RCU */

-#define rcu_lockdep_assert(c) do { } while (0)
+#define rcu_lockdep_assert(c, s) do { } while (0)

#endif /* #else #ifdef CONFIG_PROVE_RCU */

@@ -338,14 +339,16 @@ extern int rcu_my_thread_group_empty(void);
#define __rcu_dereference_check(p, c, space) \
({ \
typeof(*p) *_________p1 = (typeof(*p)*__force )ACCESS_ONCE(p); \
- rcu_lockdep_assert(c); \
+ rcu_lockdep_assert(c, "suspicious rcu_dereference_check()" \
+ " usage"); \
rcu_dereference_sparse(p, space); \
smp_read_barrier_depends(); \
((typeof(*p) __force __kernel *)(_________p1)); \
})
#define __rcu_dereference_protected(p, c, space) \
({ \
- rcu_lockdep_assert(c); \
+ rcu_lockdep_assert(c, "suspicious rcu_dereference_protected()" \
+ " usage"); \
rcu_dereference_sparse(p, space); \
((typeof(*p) __force __kernel *)(p)); \
})
@@ -359,7 +362,9 @@ extern int rcu_my_thread_group_empty(void);
#define __rcu_dereference_index_check(p, c) \
({ \
typeof(p) _________p1 = ACCESS_ONCE(p); \
- rcu_lockdep_assert(c); \
+ rcu_lockdep_assert(c, \
+ "suspicious rcu_dereference_index_check()" \
+ " usage"); \
smp_read_barrier_depends(); \
(_________p1); \
})
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 53a6895..9789028 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3798,7 +3798,7 @@ void lockdep_sys_exit(void)
}
}

-void lockdep_rcu_dereference(const char *file, const int line)
+void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
{
struct task_struct *curr = current;

@@ -3807,9 +3807,10 @@ void lockdep_rcu_dereference(const char *file, const int line)
return;
#endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
/* Note: the following can be executed concurrently, so be careful. */
- printk("\n===================================================\n");
- printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
- printk( "---------------------------------------------------\n");
+ printk("\n===============================\n");
+ printk( "[ INFO: suspicious RCU usage. ]\n");
+ printk( "-------------------------------\n");
+ printk( "%s\n", s);
printk("%s:%d invoked rcu_dereference_check() without protection!\n",
file, line);
printk("\nother info that might help us debug this:\n\n");
@@ -3818,4 +3819,4 @@ void lockdep_rcu_dereference(const char *file, const int line)
printk("\nstack backtrace:\n");
dump_stack();
}
-EXPORT_SYMBOL_GPL(lockdep_rcu_dereference);
+EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
diff --git a/kernel/pid.c b/kernel/pid.c
index 57a8346..a7577b3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -419,7 +419,9 @@ EXPORT_SYMBOL(pid_task);
*/
struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
{
- rcu_lockdep_assert(rcu_read_lock_held());
+ rcu_lockdep_assert(rcu_read_lock_held(),
+ "find_task_by_pid_ns() needs rcu_read_lock()"
+ " protection");
return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
}

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 88547c8..8b4b3da 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu)
*/
void rcu_note_context_switch(int cpu)
{
+ rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
+ "Illegal context switch in RCU-bh"
+ " read-side critical section");
+ rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
+ "Illegal context switch in RCU-sched"
+ " read-side critical section");
rcu_sched_qs(cpu);
rcu_preempt_note_context_switch(cpu);
}

2011-06-06 18:12:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/4] rcu: Detect rcu uses under extended quiescent state, and fix some

On Mon, Jun 06, 2011 at 05:10:53AM +0200, Frederic Weisbecker wrote:
> I found two examples with the last two patches but I suspect there are more bugs
> waiting to be detected under other configs and/or archs.

Good catch!!!

Thanx, Paul

> Frederic Weisbecker (4):
> rcu: Detect uses of rcu read side in extended quiescent states
> nohz: Split extended quiescent state handling from nohz switch
> x86: Don't call idle notifier inside rcu extended QS
> x86: Call idle_exit() after irq_enter()
>
> arch/arm/kernel/process.c | 4 +-
> arch/avr32/kernel/process.c | 4 +-
> arch/blackfin/kernel/process.c | 4 +-
> arch/microblaze/kernel/process.c | 4 +-
> arch/mips/kernel/process.c | 4 +-
> arch/powerpc/kernel/idle.c | 4 +-
> arch/powerpc/platforms/iseries/setup.c | 8 +++---
> arch/s390/kernel/process.c | 4 +-
> arch/sh/kernel/idle.c | 2 +-
> arch/sparc/kernel/process_64.c | 4 +-
> arch/tile/kernel/process.c | 4 +-
> arch/um/kernel/process.c | 4 +-
> arch/unicore32/kernel/process.c | 4 +-
> arch/x86/kernel/apic/apic.c | 6 ++--
> arch/x86/kernel/apic/io_apic.c | 3 +-
> arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
> arch/x86/kernel/cpu/mcheck/therm_throt.c | 2 +-
> arch/x86/kernel/cpu/mcheck/threshold.c | 2 +-
> arch/x86/kernel/irq.c | 5 +--
> arch/x86/kernel/process_32.c | 4 +-
> arch/x86/kernel/process_64.c | 7 +++++-
> include/linux/lockdep.h | 8 ++++++-
> include/linux/rcupdate.h | 20 ++++++++++++-----
> include/linux/tick.h | 10 ++++++--
> kernel/lockdep.c | 12 ++++++++--
> kernel/pid.c | 2 +-
> kernel/rcutree.c | 14 ++++++++++++
> kernel/time/tick-sched.c | 34 ++++++++++++++++++++++++++++-
> 28 files changed, 129 insertions(+), 56 deletions(-)
>
> --
> 1.7.5.2
>

2011-06-06 18:21:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, Jun 06, 2011 at 11:10:21AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 05, 2011 at 10:44:33PM -0500, Milton Miller wrote:
> > On Mon, 06 Jun 2011 about 03:10:55 -0000, Frederic Weisbecker wrote:
> > > @@ -3994,8 +3995,13 @@ void lockdep_rcu_dereference(const char *file, const int line)
> > > printk("\n===================================================\n");
> > > printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
> > > printk( "---------------------------------------------------\n");
> > > - printk("%s:%d invoked rcu_dereference_check() without protection!\n",
> > > - file, line);
> > > + printk("%s:%d invoked rcu_dereference_check() ", file, line);
> > > +
> > > + if (type == RCU_WARN_UNPROTECTED)
> > > + printk("without protection!\n");
> > > + else if (type == RCU_WARN_EXT_QS)
> > > + printk("while in RCU extended quiescent state!\n");
> > > +
> > > printk("\nother info that might help us debug this:\n\n");
> > > printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks);
> > > lockdep_print_held_locks(curr);
> >
> > Can we keep the above in one printk? That way the printing is
> > guaranteed to come out on one line. Probably the easiest way would
> > be add char *why = "" then assign a string based on the current
> > conditions. Do all of that before the first printk which gets the
> > a %s added.
>
> I have the following queued the -rcu tree which does add the string.
>
> Frederic, would it be possible to base on this patch?
>
> Thanx, Paul

No problem. Will respin soon.

What's the branch in your tree? rcu/next ?

Thanks.

2011-06-06 18:37:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, Jun 06, 2011 at 08:20:54PM +0200, Frederic Weisbecker wrote:
> On Mon, Jun 06, 2011 at 11:10:21AM -0700, Paul E. McKenney wrote:
> > On Sun, Jun 05, 2011 at 10:44:33PM -0500, Milton Miller wrote:
> > > On Mon, 06 Jun 2011 about 03:10:55 -0000, Frederic Weisbecker wrote:
> > > > @@ -3994,8 +3995,13 @@ void lockdep_rcu_dereference(const char *file, const int line)
> > > > printk("\n===================================================\n");
> > > > printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
> > > > printk( "---------------------------------------------------\n");
> > > > - printk("%s:%d invoked rcu_dereference_check() without protection!\n",
> > > > - file, line);
> > > > + printk("%s:%d invoked rcu_dereference_check() ", file, line);
> > > > +
> > > > + if (type == RCU_WARN_UNPROTECTED)
> > > > + printk("without protection!\n");
> > > > + else if (type == RCU_WARN_EXT_QS)
> > > > + printk("while in RCU extended quiescent state!\n");
> > > > +
> > > > printk("\nother info that might help us debug this:\n\n");
> > > > printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks);
> > > > lockdep_print_held_locks(curr);
> > >
> > > Can we keep the above in one printk? That way the printing is
> > > guaranteed to come out on one line. Probably the easiest way would
> > > be add char *why = "" then assign a string based on the current
> > > conditions. Do all of that before the first printk which gets the
> > > a %s added.
> >
> > I have the following queued the -rcu tree which does add the string.
> >
> > Frederic, would it be possible to base on this patch?
> >
> > Thanx, Paul
>
> No problem. Will respin soon.

Thank you!

> What's the branch in your tree? rcu/next ?

That is the one!

Thanx, Paul

2011-06-06 18:44:00

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] nohz: Split extended quiescent state handling from nohz switch

On Sun, Jun 5, 2011 at 23:20, Frederic Weisbecker wrote:
>  arch/blackfin/kernel/process.c         |    4 +-

Acked-by: Mike Frysinger <[email protected]>
-mike

2011-06-06 20:30:04

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] nohz: Split extended quiescent state handling from nohz switch

On 6/5/2011 11:20 PM, Frederic Weisbecker wrote:
> It is assumed that rcu won't be used once we switch to tickless
> mode and until we restart the tick. However this is not always
> true, as in x86-64 where we dereference the idle notifiers after
> the tick is stopped.
>
> To prepare for fixing this, split the tickless mode switching and
> RCU extended quiescent state logics.
> Make tick_nohz_stop/restart_sched_tick() RCU agnostic but provide
> a new pair of APIs tick_nohz_enter/exit_idle() that keep the
> old behaviour by handling both the nohz mode and RCU extended
> quiescent states, then convert every archs to use these.
>
> Archs that want to switch to extended QS to some custom points
> can do it later by using tick_nohz_stop_sched_tick() and
> rcu_enter_nohz() seperately.
>
> v2: Remove rcu_exit_nohz() from tick_nohz_restart_sched_tick()
>
> [...]
> arch/tile/kernel/process.c | 4 +-

Acked-by: Chris Metcalf <[email protected]>

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2011-06-07 00:19:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, Jun 06, 2011 at 11:10:21AM -0700, Paul E. McKenney wrote:
> commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9
> Author: Paul E. McKenney <[email protected]>
> Date: Tue May 24 08:31:09 2011 -0700
>
> rcu: Restore checks for blocking in RCU read-side critical sections
>
> Long ago, using TREE_RCU with PREEMPT would result in "scheduling
> while atomic" diagnostics if you blocked in an RCU read-side critical
> section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
> this diagnostic. This commit therefore adds a replacement diagnostic
> based on PROVE_RCU.
>
> Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
> used for things that have nothing to do with rcu_dereference(), rename
> lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
> argument that is a string indicating what is suspicious. This third
> argument is passed in from a new third argument to rcu_lockdep_assert().
> Update all calls to rcu_lockdep_assert() to add an informative third
> argument.
>
> Finally, add a pair of rcu_lockdep_assert() calls from within
> rcu_note_context_switch(), one complaining if a context switch occurs
> in an RCU-bh read-side critical section and another complaining if a
> context switch occurs in an RCU-sched read-side critical section.
> These are present only if the PROVE_RCU kernel parameter is enabled.
>
> Again, you must enable PROVE_RCU to see these new diagnostics. But you
> are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
>
> Signed-off-by: Paul E. McKenney <[email protected]>

A little comment about this patch:

<snip>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 88547c8..8b4b3da 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu)
> */
> void rcu_note_context_switch(int cpu)
> {
> + rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
> + "Illegal context switch in RCU-bh"
> + " read-side critical section");
> + rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
> + "Illegal context switch in RCU-sched"
> + " read-side critical section");

This looks like more a check to make inside might_sleep().
It's better because might_sleep() triggers the check even if
we don't actually go to sleep.

In fact I believe might_sleep() already does the job fine:

If !PREEMPT, might_sleep() detects that preemption is disabled
by rcu_read_lock().

If PREEMPT, might_sleep() checks rcu_preempt_depth().

2011-06-07 00:42:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Tue, Jun 07, 2011 at 02:19:07AM +0200, Frederic Weisbecker wrote:
> On Mon, Jun 06, 2011 at 11:10:21AM -0700, Paul E. McKenney wrote:
> > commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9
> > Author: Paul E. McKenney <[email protected]>
> > Date: Tue May 24 08:31:09 2011 -0700
> >
> > rcu: Restore checks for blocking in RCU read-side critical sections
> >
> > Long ago, using TREE_RCU with PREEMPT would result in "scheduling
> > while atomic" diagnostics if you blocked in an RCU read-side critical
> > section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
> > this diagnostic. This commit therefore adds a replacement diagnostic
> > based on PROVE_RCU.
> >
> > Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
> > used for things that have nothing to do with rcu_dereference(), rename
> > lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
> > argument that is a string indicating what is suspicious. This third
> > argument is passed in from a new third argument to rcu_lockdep_assert().
> > Update all calls to rcu_lockdep_assert() to add an informative third
> > argument.
> >
> > Finally, add a pair of rcu_lockdep_assert() calls from within
> > rcu_note_context_switch(), one complaining if a context switch occurs
> > in an RCU-bh read-side critical section and another complaining if a
> > context switch occurs in an RCU-sched read-side critical section.
> > These are present only if the PROVE_RCU kernel parameter is enabled.
> >
> > Again, you must enable PROVE_RCU to see these new diagnostics. But you
> > are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> A little comment about this patch:
>
> <snip>
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 88547c8..8b4b3da 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu)
> > */
> > void rcu_note_context_switch(int cpu)
> > {
> > + rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
> > + "Illegal context switch in RCU-bh"
> > + " read-side critical section");
> > + rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
> > + "Illegal context switch in RCU-sched"
> > + " read-side critical section");
>
> This looks like more a check to make inside might_sleep().
> It's better because might_sleep() triggers the check even if
> we don't actually go to sleep.

This does make quite a bit of sense.

> In fact I believe might_sleep() already does the job fine:
>
> If !PREEMPT, might_sleep() detects that preemption is disabled
> by rcu_read_lock().

If !PREEMPT, isn't the preempt_disable() called by rcu_read_lock()
implemented as follows?

#define preempt_disable() do { } while (0)

Unless I am missing something, __might_sleep() won't detect that.

> If PREEMPT, might_sleep() checks rcu_preempt_depth().

Agreed, for CONFIG_TREE_PREEMPT_RCU and CONFIG_TINY_PREEMPT_RCU,
the existing might_sleep() checks do cover it.

So I could export an rcu_might_sleep() or some such that contained
the above two rcu_lockdep_assert()s, but only if !PREEMPT_RCU.
If PREEMPT_RCU, rcu_might_sleep() would be a no-op.

Seem reasonable, or am I missing something?

Thanx, Paul

2011-06-07 01:36:38

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, Jun 06, 2011 at 05:42:50PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 07, 2011 at 02:19:07AM +0200, Frederic Weisbecker wrote:
> > On Mon, Jun 06, 2011 at 11:10:21AM -0700, Paul E. McKenney wrote:
> > > commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9
> > > Author: Paul E. McKenney <[email protected]>
> > > Date: Tue May 24 08:31:09 2011 -0700
> > >
> > > rcu: Restore checks for blocking in RCU read-side critical sections
> > >
> > > Long ago, using TREE_RCU with PREEMPT would result in "scheduling
> > > while atomic" diagnostics if you blocked in an RCU read-side critical
> > > section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
> > > this diagnostic. This commit therefore adds a replacement diagnostic
> > > based on PROVE_RCU.
> > >
> > > Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
> > > used for things that have nothing to do with rcu_dereference(), rename
> > > lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
> > > argument that is a string indicating what is suspicious. This third
> > > argument is passed in from a new third argument to rcu_lockdep_assert().
> > > Update all calls to rcu_lockdep_assert() to add an informative third
> > > argument.
> > >
> > > Finally, add a pair of rcu_lockdep_assert() calls from within
> > > rcu_note_context_switch(), one complaining if a context switch occurs
> > > in an RCU-bh read-side critical section and another complaining if a
> > > context switch occurs in an RCU-sched read-side critical section.
> > > These are present only if the PROVE_RCU kernel parameter is enabled.
> > >
> > > Again, you must enable PROVE_RCU to see these new diagnostics. But you
> > > are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > A little comment about this patch:
> >
> > <snip>
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index 88547c8..8b4b3da 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu)
> > > */
> > > void rcu_note_context_switch(int cpu)
> > > {
> > > + rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
> > > + "Illegal context switch in RCU-bh"
> > > + " read-side critical section");
> > > + rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
> > > + "Illegal context switch in RCU-sched"
> > > + " read-side critical section");
> >
> > This looks like more a check to make inside might_sleep().
> > It's better because might_sleep() triggers the check even if
> > we don't actually go to sleep.
>
> This does make quite a bit of sense.
>
> > In fact I believe might_sleep() already does the job fine:
> >
> > If !PREEMPT, might_sleep() detects that preemption is disabled
> > by rcu_read_lock().
>
> If !PREEMPT, isn't the preempt_disable() called by rcu_read_lock()
> implemented as follows?
>
> #define preempt_disable() do { } while (0)
>
> Unless I am missing something, __might_sleep() won't detect that.

Ah, right.

> > If PREEMPT, might_sleep() checks rcu_preempt_depth().
>
> Agreed, for CONFIG_TREE_PREEMPT_RCU and CONFIG_TINY_PREEMPT_RCU,
> the existing might_sleep() checks do cover it.
>
> So I could export an rcu_might_sleep() or some such that contained
> the above two rcu_lockdep_assert()s, but only if !PREEMPT_RCU.
> If PREEMPT_RCU, rcu_might_sleep() would be a no-op.
>
> Seem reasonable, or am I missing something?

Ok but that only improves the rcu debugging. What about instead improving
might_sleep() to also work in !PREEMPT, so that it profits to any detection
of forbidden sleeping (sleep inside spinlock, preempt_disable, might_fault, etc...)

We could define a new config:

config PREEMPT_COUNT
default PREEMPT || DEBUG_SPINLOCK_SLEEP

and build preempt_disable()/preempt_enable() on top of that instead
of using CONFIG_PREEMPT directly.

Does that look sane?

2011-06-07 04:40:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Tue, Jun 07, 2011 at 03:36:32AM +0200, Frederic Weisbecker wrote:
> On Mon, Jun 06, 2011 at 05:42:50PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 07, 2011 at 02:19:07AM +0200, Frederic Weisbecker wrote:
> > > On Mon, Jun 06, 2011 at 11:10:21AM -0700, Paul E. McKenney wrote:
> > > > commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9
> > > > Author: Paul E. McKenney <[email protected]>
> > > > Date: Tue May 24 08:31:09 2011 -0700
> > > >
> > > > rcu: Restore checks for blocking in RCU read-side critical sections
> > > >
> > > > Long ago, using TREE_RCU with PREEMPT would result in "scheduling
> > > > while atomic" diagnostics if you blocked in an RCU read-side critical
> > > > section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
> > > > this diagnostic. This commit therefore adds a replacement diagnostic
> > > > based on PROVE_RCU.
> > > >
> > > > Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
> > > > used for things that have nothing to do with rcu_dereference(), rename
> > > > lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
> > > > argument that is a string indicating what is suspicious. This third
> > > > argument is passed in from a new third argument to rcu_lockdep_assert().
> > > > Update all calls to rcu_lockdep_assert() to add an informative third
> > > > argument.
> > > >
> > > > Finally, add a pair of rcu_lockdep_assert() calls from within
> > > > rcu_note_context_switch(), one complaining if a context switch occurs
> > > > in an RCU-bh read-side critical section and another complaining if a
> > > > context switch occurs in an RCU-sched read-side critical section.
> > > > These are present only if the PROVE_RCU kernel parameter is enabled.
> > > >
> > > > Again, you must enable PROVE_RCU to see these new diagnostics. But you
> > > > are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > A little comment about this patch:
> > >
> > > <snip>
> > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > index 88547c8..8b4b3da 100644
> > > > --- a/kernel/rcutree.c
> > > > +++ b/kernel/rcutree.c
> > > > @@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu)
> > > > */
> > > > void rcu_note_context_switch(int cpu)
> > > > {
> > > > + rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
> > > > + "Illegal context switch in RCU-bh"
> > > > + " read-side critical section");
> > > > + rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
> > > > + "Illegal context switch in RCU-sched"
> > > > + " read-side critical section");
> > >
> > > This looks like more a check to make inside might_sleep().
> > > It's better because might_sleep() triggers the check even if
> > > we don't actually go to sleep.
> >
> > This does make quite a bit of sense.
> >
> > > In fact I believe might_sleep() already does the job fine:
> > >
> > > If !PREEMPT, might_sleep() detects that preemption is disabled
> > > by rcu_read_lock().
> >
> > If !PREEMPT, isn't the preempt_disable() called by rcu_read_lock()
> > implemented as follows?
> >
> > #define preempt_disable() do { } while (0)
> >
> > Unless I am missing something, __might_sleep() won't detect that.
>
> Ah, right.
>
> > > If PREEMPT, might_sleep() checks rcu_preempt_depth().
> >
> > Agreed, for CONFIG_TREE_PREEMPT_RCU and CONFIG_TINY_PREEMPT_RCU,
> > the existing might_sleep() checks do cover it.
> >
> > So I could export an rcu_might_sleep() or some such that contained
> > the above two rcu_lockdep_assert()s, but only if !PREEMPT_RCU.
> > If PREEMPT_RCU, rcu_might_sleep() would be a no-op.
> >
> > Seem reasonable, or am I missing something?
>
> Ok but that only improves the rcu debugging. What about instead improving
> might_sleep() to also work in !PREEMPT, so that it profits to any detection
> of forbidden sleeping (sleep inside spinlock, preempt_disable, might_fault, etc...)
>
> We could define a new config:
>
> config PREEMPT_COUNT
> default PREEMPT || DEBUG_SPINLOCK_SLEEP
>
> and build preempt_disable()/preempt_enable() on top of that instead
> of using CONFIG_PREEMPT directly.
>
> Does that look sane?

The bit I am missing is how to distinguish between spinlocks (where
sleeping is illegal) and mutexes (where sleeping is perfectly fine).
We could teach lockdep the difference, I suppose, but it is not clear
to me that it is worth it.

In contrast, with RCU, this is straightforward -- check for rcu_sched
and rcu_bh, but not SRCU.

Thanx, Paul

2011-06-07 12:58:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, Jun 06, 2011 at 09:40:05PM -0700, Paul E. McKenney wrote:>
> The bit I am missing is how to distinguish between spinlocks (where
> sleeping is illegal) and mutexes (where sleeping is perfectly fine).
> We could teach lockdep the difference, I suppose, but it is not clear
> to me that it is worth it.

Ah, in fact it doesn't pass through any lockdep check.

It's only a function called might_sleep() that is placed in functions
that can sleep. And inside might_sleep() it checks whether it is in a preemptible
area. So it's actually locking-agnostic, it only relies on the preempt_count
and some more for the preempt rcu cases.

I think it is called CONFIG_DEBUG_SPINLOCK_SLEEP because it was first used
for spinlock debugging purposes. But then it has a broader use now: sleep
inside preemptible section, sleep inside interrupts, sleep inside rcu.

It certainly deserves a rename, like CONFIG_DEBUG_ILLEGAL_SLEEP.

>
> In contrast, with RCU, this is straightforward -- check for rcu_sched
> and rcu_bh, but not SRCU.
>
> Thanx, Paul
>

2011-06-07 18:34:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Tue, Jun 07, 2011 at 02:58:13PM +0200, Frederic Weisbecker wrote:
> On Mon, Jun 06, 2011 at 09:40:05PM -0700, Paul E. McKenney wrote:>
> > The bit I am missing is how to distinguish between spinlocks (where
> > sleeping is illegal) and mutexes (where sleeping is perfectly fine).
> > We could teach lockdep the difference, I suppose, but it is not clear
> > to me that it is worth it.
>
> Ah, in fact it doesn't pass through any lockdep check.
>
> It's only a function called might_sleep() that is placed in functions
> that can sleep. And inside might_sleep() it checks whether it is in a preemptible
> area. So it's actually locking-agnostic, it only relies on the preempt_count
> and some more for the preempt rcu cases.
>
> I think it is called CONFIG_DEBUG_SPINLOCK_SLEEP because it was first used
> for spinlock debugging purposes. But then it has a broader use now: sleep
> inside preemptible section, sleep inside interrupts, sleep inside rcu.

But the __might_sleep() function can only differentiate between
spinlocks and sleeplocks if CONFIG_PREEMPT=y.

> It certainly deserves a rename, like CONFIG_DEBUG_ILLEGAL_SLEEP.

Hmmm... It already checks for sleeping in the middle of a
preempt_disable() as well as in a spinlock critical section.
So the need for a rename is independent of any RCU checking.

> > In contrast, with RCU, this is straightforward -- check for rcu_sched
> > and rcu_bh, but not SRCU.

Actually it makes sense to keep the checks in rcu_note_context_switch(),
as there are places that call schedule() directly without a might_sleep().
Perhaps having checks in both places is the correct approach?

Thanx, Paul

2011-06-07 18:49:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Tue, Jun 07, 2011 at 11:34:14AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 07, 2011 at 02:58:13PM +0200, Frederic Weisbecker wrote:
> > On Mon, Jun 06, 2011 at 09:40:05PM -0700, Paul E. McKenney wrote:>
> > > The bit I am missing is how to distinguish between spinlocks (where
> > > sleeping is illegal) and mutexes (where sleeping is perfectly fine).
> > > We could teach lockdep the difference, I suppose, but it is not clear
> > > to me that it is worth it.
> >
> > Ah, in fact it doesn't pass through any lockdep check.
> >
> > It's only a function called might_sleep() that is placed in functions
> > that can sleep. And inside might_sleep() it checks whether it is in a preemptible
> > area. So it's actually locking-agnostic, it only relies on the preempt_count
> > and some more for the preempt rcu cases.
> >
> > I think it is called CONFIG_DEBUG_SPINLOCK_SLEEP because it was first used
> > for spinlock debugging purposes. But then it has a broader use now: sleep
> > inside preemptible section, sleep inside interrupts, sleep inside rcu.
>
> But the __might_sleep() function can only differentiate between
> spinlocks and sleeplocks if CONFIG_PREEMPT=y.

It doesn't differentiate between locks but checks on the lowest level
by looking at the preempt count. But yeah it only works if CONFIG_PREEMPT,
which is why I proposed to inc/dec the preempt count also when we have
that DEBUG_SPINLOCK_SLEEP.

>
> > It certainly deserves a rename, like CONFIG_DEBUG_ILLEGAL_SLEEP.
>
> Hmmm... It already checks for sleeping in the middle of a
> preempt_disable() as well as in a spinlock critical section.
> So the need for a rename is independent of any RCU checking.

Sure, rcu just adds itself to the pile of users of might_sleep(), thus
it would be a nice cleanup to rename the option to something more
generic. But that rename is not necessary to improve RCU checking.

>
> > > In contrast, with RCU, this is straightforward -- check for rcu_sched
> > > and rcu_bh, but not SRCU.
>
> Actually it makes sense to keep the checks in rcu_note_context_switch(),
> as there are places that call schedule() directly without a might_sleep().
> Perhaps having checks in both places is the correct approach?

In this case it makes more sense to add your checks in schedule_debug(),
so that we don't wait for a context switch to detect the bug.

2011-06-07 19:23:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Tue, Jun 07, 2011 at 08:49:01PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 07, 2011 at 11:34:14AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 07, 2011 at 02:58:13PM +0200, Frederic Weisbecker wrote:
> > > On Mon, Jun 06, 2011 at 09:40:05PM -0700, Paul E. McKenney wrote:>
> > > > The bit I am missing is how to distinguish between spinlocks (where
> > > > sleeping is illegal) and mutexes (where sleeping is perfectly fine).
> > > > We could teach lockdep the difference, I suppose, but it is not clear
> > > > to me that it is worth it.
> > >
> > > Ah, in fact it doesn't pass through any lockdep check.
> > >
> > > It's only a function called might_sleep() that is placed in functions
> > > that can sleep. And inside might_sleep() it checks whether it is in a preemptible
> > > area. So it's actually locking-agnostic, it only relies on the preempt_count
> > > and some more for the preempt rcu cases.
> > >
> > > I think it is called CONFIG_DEBUG_SPINLOCK_SLEEP because it was first used
> > > for spinlock debugging purposes. But then it has a broader use now: sleep
> > > inside preemptible section, sleep inside interrupts, sleep inside rcu.
> >
> > But the __might_sleep() function can only differentiate between
> > spinlocks and sleeplocks if CONFIG_PREEMPT=y.
>
> It doesn't differentiate between locks but checks on the lowest level
> by looking at the preempt count. But yeah it only works if CONFIG_PREEMPT,
> which is why I proposed to inc/dec the preempt count also when we have
> that DEBUG_SPINLOCK_SLEEP.

Ah, I missed your proposal to inc/dec preempt_count for PREEMPT=n
and DEBUG_SPINLOCK_SLEEP=y.

> > > It certainly deserves a rename, like CONFIG_DEBUG_ILLEGAL_SLEEP.
> >
> > Hmmm... It already checks for sleeping in the middle of a
> > preempt_disable() as well as in a spinlock critical section.
> > So the need for a rename is independent of any RCU checking.
>
> Sure, rcu just adds itself to the pile of users of might_sleep(), thus
> it would be a nice cleanup to rename the option to something more
> generic. But that rename is not necessary to improve RCU checking.

Agreed!

> > > > In contrast, with RCU, this is straightforward -- check for rcu_sched
> > > > and rcu_bh, but not SRCU.
> >
> > Actually it makes sense to keep the checks in rcu_note_context_switch(),
> > as there are places that call schedule() directly without a might_sleep().
> > Perhaps having checks in both places is the correct approach?
>
> In this case it makes more sense to add your checks in schedule_debug(),
> so that we don't wait for a context switch to detect the bug.

You might well be right -- looking at it.

Thanx, Paul

2011-06-08 01:20:08

by Xuetao Guan

[permalink] [raw]
Subject: Re: [PATCH 2/4] nohz: Split extended quiescent state handling from nohz switch

For UniCore32 related code,

Acked-by: Guan Xuetao <[email protected]>

Thanks,

Guan Xuetao

2011-06-09 23:08:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/4] nohz: Split extended quiescent state handling from nohz switch

Thanks a lot for all your reviews and acks, I'm resending
the v2 soon with your acks included.

2011-06-10 08:58:46

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [1/4] rcu: Detect uses of rcu read side in extended quiescent states

On Mon, Jun 6, 2011 at 6:36 PM, Frederic Weisbecker <[email protected]> wrote:
> Ok but that only improves the rcu debugging. What about instead improving
> might_sleep() to also work in !PREEMPT, so that it profits to any detection
> of forbidden sleeping (sleep inside spinlock, preempt_disable, might_fault, etc...)
>
> We could define a new config:
>
> config PREEMPT_COUNT
> ? ? ? default PREEMPT || DEBUG_SPINLOCK_SLEEP
>
> and build preempt_disable()/preempt_enable() on top of that instead
> of using CONFIG_PREEMPT directly.
>
> Does that look sane?

Yes, I think this would be helpful.

I actually sent out a patch for that about a year ago - message title
was "Stronger CONFIG_DEBUG_SPINLOCK_SLEEP without CONFIG_PREEMPT".
However that code has rotted since, and my attempt at a trivial port
ended up spewing complaints in dmesg at boot time...

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.