2009-01-04 19:41:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH] Make treercu safe for suspend and resume

Hello!

Kudos to both Dhaval Giani and Jens Axboe for finding a bug in treercu
that causes warnings after suspend-resume cycles in Dhaval's case and
during stress tests in Jens's case. It would also probably cause failures
if heavily stressed. The solution, ironically enough, is to revert to
rcupreempt's code for initializing the dynticks state. And the patch
even results in smaller code -- so what was I thinking???

This is 2.6.29 material, given that people really do suspend and resume
Linux these days. ;-)

Located-by: Dhaval Giani <[email protected]>
Located-by: Jens Axboe <[email protected]>
Tested-by: Dhaval Giani <[email protected]>
Tested-by: Jens Axboe <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---

rcutree.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a342b03..e0a347f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -79,7 +79,10 @@ struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh_state);
DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);

#ifdef CONFIG_NO_HZ
-DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks);
+DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
+ .dynticks_nesting = 1,
+ .dynticks = 1,
+};
#endif /* #ifdef CONFIG_NO_HZ */

static int blimit = 10; /* Maximum callbacks per softirq. */
@@ -1379,13 +1382,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)

static void __cpuinit rcu_online_cpu(int cpu)
{
-#ifdef CONFIG_NO_HZ
- struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
-
- rdtp->dynticks_nesting = 1;
- rdtp->dynticks |= 1; /* need consecutive #s even for hotplug. */
- rdtp->dynticks_nmi = (rdtp->dynticks_nmi + 1) & ~0x1;
-#endif /* #ifdef CONFIG_NO_HZ */
rcu_init_percpu_data(cpu, &rcu_state);
rcu_init_percpu_data(cpu, &rcu_bh_state);
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);


2009-01-04 20:41:29

by Eric Sesterhenn

[permalink] [raw]
Subject: Re: [PATCH] Make treercu safe for suspend and resume

* Paul E. McKenney ([email protected]) wrote:
> Hello!
>
> Kudos to both Dhaval Giani and Jens Axboe for finding a bug in treercu
> that causes warnings after suspend-resume cycles in Dhaval's case and
> during stress tests in Jens's case. It would also probably cause failures
> if heavily stressed. The solution, ironically enough, is to revert to
> rcupreempt's code for initializing the dynticks state. And the patch
> even results in smaller code -- so what was I thinking???
>
> This is 2.6.29 material, given that people really do suspend and resume
> Linux these days. ;-)

sadly even with this patch i still get this oops when doing
modprobe rcutorture; sleep 2s; rmmod rcutorture

[ 74.413097] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 74.413424] IP: [<(null)>] (null)
[ 74.413651] Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
[ 74.413956] last sysfs file: /sys/block/ram9/range
[ 74.414039] Modules linked in: [last unloaded: rcutorture]
[ 74.414039]
[ 74.414039] Pid: 4997, comm: rcu_torture_wri Tainted: G W
(2.6.28-05692-g7d3b56b-dirty #167) System Name
[ 74.414039] EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 0
[ 74.414039] EIP is at 0x0
[ 74.414039] EAX: d0afd130 EBX: 00000000 ECX: c01612a6 EDX: 00000006
[ 74.414039] ESI: d0afd130 EDI: 0000001c EBP: c0b03fe0 ESP: c0b03fd4
[ 74.414039] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 74.414039] Process rcu_torture_wri (pid: 4997, ti=c0b03000
task=c98bce00 task.ti=c988b000)
[ 74.414039] Stack:
[ 74.414039] c01612ad 00000200 00000001 c0b03ff8 c012aa97 0000000a
c988beac 00000046
[ 74.414039] c012aa28 c988bebc c01042c2
[ 74.414039] Call Trace:
[ 74.414039] [<c01612ad>] ? rcu_process_callbacks+0x65/0x79
[ 74.414039] [<c012aa97>] ? __do_softirq+0x6f/0xf6
[ 74.414039] [<c012aa28>] ? __do_softirq+0x0/0xf6
[ 74.414039] <IRQ> <0> [<c012a9a5>] ? irq_exit+0x40/0x7c
[ 74.414039] [<c0110ce1>] ? smp_apic_timer_interrupt+0x68/0x73
[ 74.414039] [<c0103521>] ? apic_timer_interrupt+0x2d/0x34
[ 74.414039] [<c01219f7>] ? finish_task_switch+0x4d/0x8b
[ 74.414039] [<c014007b>] ? tick_check_oneshot_change+0xb1/0xf9
[ 74.414039] [<c07a091f>] ? _spin_unlock_irq+0x2d/0x47
[ 74.414039] [<c01219f7>] ? finish_task_switch+0x4d/0x8b
[ 74.414039] [<c01219aa>] ? finish_task_switch+0x0/0x8b
[ 74.414039] [<c079e366>] ? schedule+0x404/0x450
[ 74.414039] [<c079e582>] ? schedule_timeout+0x70/0x95
[ 74.414039] [<c012e13a>] ? process_timeout+0x0/0xf
[ 74.414039] [<c079e57d>] ? schedule_timeout+0x6b/0x95
[ 74.414039] [<c079e5c0>] ?
schedule_timeout_uninterruptible+0x19/0x1b
[ 74.414039] [<c0136bcc>] ? kthread+0x3e/0x66
[ 74.414039] [<c0136b8e>] ? kthread+0x0/0x66
[ 74.414039] [<c0103643>] ? kernel_thread_helper+0x7/0x10
[ 74.414039] Code: Bad EIP value.
[ 74.414039] EIP: [<00000000>] 0x0 SS:ESP 0068:c0b03fd4
[ 74.422275] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 74.422406] Kernel panic - not syncing: Fatal exception in interrupt

Greetings Eric

2009-01-04 21:14:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Make treercu safe for suspend and resume

On Sun, Jan 04, 2009 at 09:41:08PM +0100, Eric Sesterhenn wrote:
> * Paul E. McKenney ([email protected]) wrote:
> > Hello!
> >
> > Kudos to both Dhaval Giani and Jens Axboe for finding a bug in treercu
> > that causes warnings after suspend-resume cycles in Dhaval's case and
> > during stress tests in Jens's case. It would also probably cause failures
> > if heavily stressed. The solution, ironically enough, is to revert to
> > rcupreempt's code for initializing the dynticks state. And the patch
> > even results in smaller code -- so what was I thinking???
> >
> > This is 2.6.29 material, given that people really do suspend and resume
> > Linux these days. ;-)
>
> sadly even with this patch i still get this oops when doing
> modprobe rcutorture; sleep 2s; rmmod rcutorture

I would have been extremely surprised had that patch fixed this problem,
but thank you very much for trying it out! What can I say, I worked on
the easy ones first. ;-)

Thanx, Paul

> [ 74.413097] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [ 74.413424] IP: [<(null)>] (null)
> [ 74.413651] Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
> [ 74.413956] last sysfs file: /sys/block/ram9/range
> [ 74.414039] Modules linked in: [last unloaded: rcutorture]
> [ 74.414039]
> [ 74.414039] Pid: 4997, comm: rcu_torture_wri Tainted: G W
> (2.6.28-05692-g7d3b56b-dirty #167) System Name
> [ 74.414039] EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 0
> [ 74.414039] EIP is at 0x0
> [ 74.414039] EAX: d0afd130 EBX: 00000000 ECX: c01612a6 EDX: 00000006
> [ 74.414039] ESI: d0afd130 EDI: 0000001c EBP: c0b03fe0 ESP: c0b03fd4
> [ 74.414039] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
> [ 74.414039] Process rcu_torture_wri (pid: 4997, ti=c0b03000
> task=c98bce00 task.ti=c988b000)
> [ 74.414039] Stack:
> [ 74.414039] c01612ad 00000200 00000001 c0b03ff8 c012aa97 0000000a
> c988beac 00000046
> [ 74.414039] c012aa28 c988bebc c01042c2
> [ 74.414039] Call Trace:
> [ 74.414039] [<c01612ad>] ? rcu_process_callbacks+0x65/0x79
> [ 74.414039] [<c012aa97>] ? __do_softirq+0x6f/0xf6
> [ 74.414039] [<c012aa28>] ? __do_softirq+0x0/0xf6
> [ 74.414039] <IRQ> <0> [<c012a9a5>] ? irq_exit+0x40/0x7c
> [ 74.414039] [<c0110ce1>] ? smp_apic_timer_interrupt+0x68/0x73
> [ 74.414039] [<c0103521>] ? apic_timer_interrupt+0x2d/0x34
> [ 74.414039] [<c01219f7>] ? finish_task_switch+0x4d/0x8b
> [ 74.414039] [<c014007b>] ? tick_check_oneshot_change+0xb1/0xf9
> [ 74.414039] [<c07a091f>] ? _spin_unlock_irq+0x2d/0x47
> [ 74.414039] [<c01219f7>] ? finish_task_switch+0x4d/0x8b
> [ 74.414039] [<c01219aa>] ? finish_task_switch+0x0/0x8b
> [ 74.414039] [<c079e366>] ? schedule+0x404/0x450
> [ 74.414039] [<c079e582>] ? schedule_timeout+0x70/0x95
> [ 74.414039] [<c012e13a>] ? process_timeout+0x0/0xf
> [ 74.414039] [<c079e57d>] ? schedule_timeout+0x6b/0x95
> [ 74.414039] [<c079e5c0>] ?
> schedule_timeout_uninterruptible+0x19/0x1b
> [ 74.414039] [<c0136bcc>] ? kthread+0x3e/0x66
> [ 74.414039] [<c0136b8e>] ? kthread+0x0/0x66
> [ 74.414039] [<c0103643>] ? kernel_thread_helper+0x7/0x10
> [ 74.414039] Code: Bad EIP value.
> [ 74.414039] EIP: [<00000000>] 0x0 SS:ESP 0068:c0b03fd4
> [ 74.422275] ---[ end trace 4eaa2a86a8e2da22 ]---
> [ 74.422406] Kernel panic - not syncing: Fatal exception in interrupt
>
> Greetings Eric

2009-01-05 09:13:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Make treercu safe for suspend and resume


* Paul E. McKenney <[email protected]> wrote:

> Hello!
>
> Kudos to both Dhaval Giani and Jens Axboe for finding a bug in treercu
> that causes warnings after suspend-resume cycles in Dhaval's case and
> during stress tests in Jens's case. It would also probably cause failures
> if heavily stressed. The solution, ironically enough, is to revert to
> rcupreempt's code for initializing the dynticks state. And the patch
> even results in smaller code -- so what was I thinking???
>
> This is 2.6.29 material, given that people really do suspend and resume
> Linux these days. ;-)
>
> Located-by: Dhaval Giani <[email protected]>
> Located-by: Jens Axboe <[email protected]>
> Tested-by: Dhaval Giani <[email protected]>
> Tested-by: Jens Axboe <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
>
> rcutree.c | 12 ++++--------
> 1 file changed, 4 insertions(+), 8 deletions(-)

applied to tip/core/urgent, thanks guys!

Ingo