2016-11-24 17:33:55

by Vince Weaver

[permalink] [raw]
Subject: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start


This is on a skylake machine, linus git as of yesterday after the various
kasan-related fixes went in. Not sure if there were any that hadn't hit
upstream yet.

Anyway I can't tell from this one what the actual trigger is. After this
mess the fuzzer process was locked, udev started complaining, and it
eventually died completely after a few hours of repeated messages like
this.

[38898.373183] INFO: rcu_sched self-detected stall on CPU
[38898.378452] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
[38898.381211] INFO: rcu_sched detected stalls on CPUs/tasks:
[38898.381214] 0-...: (1 GPs behind) idle=05f/140000000000001/2 softirq=3285458/3285459 fqs=2625
[38898.381217] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
[38898.381218] (detected by 1, t=5252 jiffies, g=3685053, c=3685052, q=32)
[38898.381244] ==================================================================
[38898.381247] BUG: KASAN: stack-out-of-bounds in __unwind_start+0x1a2/0x1c0 at addr ffff8801e9727c28
[38898.381248] Read of size 8 by task swapper/1/0
[38898.381250] page:ffffea0007a5c9c0 count:0 mapcount:0 mapping: (null) index:0x0
[38898.381251] flags: 0x2ffff8000000000()
[38898.381251] page dumped because: kasan: bad access detected
[38898.381328] Memory state around the buggy address:
[38898.381330] ffff8801e9727b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[38898.381331] ffff8801e9727b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[38898.381332] >ffff8801e9727c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 f3 f3 f3 f3
[38898.381333] ^
[38898.381334] ffff8801e9727c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
[38898.381335] ffff8801e9727d00: f1 f1 f1 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 f3
[38898.381335] ==================================================================
[38898.510702] (t=5284 jiffies g=3685053 c=3685052 q=32)

(That's all, the report above repeats but no useful things like a
backtrace are ever printed)


2016-11-28 21:54:20

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Thu, Nov 24, 2016 at 12:33:48PM -0500, Vince Weaver wrote:
>
> This is on a skylake machine, linus git as of yesterday after the various
> kasan-related fixes went in. Not sure if there were any that hadn't hit
> upstream yet.
>
> Anyway I can't tell from this one what the actual trigger is. After this
> mess the fuzzer process was locked, udev started complaining, and it
> eventually died completely after a few hours of repeated messages like
> this.
>
> [38898.373183] INFO: rcu_sched self-detected stall on CPU
> [38898.378452] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> [38898.381211] INFO: rcu_sched detected stalls on CPUs/tasks:
> [38898.381214] 0-...: (1 GPs behind) idle=05f/140000000000001/2 softirq=3285458/3285459 fqs=2625
> [38898.381217] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> [38898.381218] (detected by 1, t=5252 jiffies, g=3685053, c=3685052, q=32)
> [38898.381244] ==================================================================
> [38898.381247] BUG: KASAN: stack-out-of-bounds in __unwind_start+0x1a2/0x1c0 at addr ffff8801e9727c28
> [38898.381248] Read of size 8 by task swapper/1/0
> [38898.381250] page:ffffea0007a5c9c0 count:0 mapcount:0 mapping: (null) index:0x0
> [38898.381251] flags: 0x2ffff8000000000()
> [38898.381251] page dumped because: kasan: bad access detected
> [38898.381328] Memory state around the buggy address:
> [38898.381330] ffff8801e9727b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [38898.381331] ffff8801e9727b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [38898.381332] >ffff8801e9727c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 f3 f3 f3 f3
> [38898.381333] ^
> [38898.381334] ffff8801e9727c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
> [38898.381335] ffff8801e9727d00: f1 f1 f1 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 f3
> [38898.381335] ==================================================================
> [38898.510702] (t=5284 jiffies g=3685053 c=3685052 q=32)
>
> (That's all, the report above repeats but no useful things like a
> backtrace are ever printed)

After looking at the RCU stall detection code, I think the KASAN error
and missing stack dump aren't very surprising. RCU calls the scheduler
dump_cpu_task() function, which seems inherently problematic: it tries
to dump the stack of a task while it's running on another CPU.

There are some issues with that:

1) There's no way to find the starting frame of a currently running task
from another CPU.

In fact, I'm wondering how dump_cpu_task() ever worked at all? It
seems like you'd have to get lucky that the sp/bp registers stored by
the last call to schedule() happen to point to a currently valid
stack frame.

2) Even if there were a way to find the starting frame, it's racy
because the target task could be overwriting the stack while we're
reading it.

3) IRQ/exception stack dumps would be missing anyway because the stack
dump code only looks at the current CPU's interrupt stacks.

Maybe dump_cpu_task() should instead run the stack dump directly from
the target CPU, e.g. with trigger_single_cpu_backtrace() or
smp_call_function_single()?

Paul, Peter, Ingo, any thoughts?

--
Josh

2016-11-29 00:40:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Mon, Nov 28, 2016 at 03:54:11PM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 24, 2016 at 12:33:48PM -0500, Vince Weaver wrote:
> >
> > This is on a skylake machine, linus git as of yesterday after the various
> > kasan-related fixes went in. Not sure if there were any that hadn't hit
> > upstream yet.
> >
> > Anyway I can't tell from this one what the actual trigger is. After this
> > mess the fuzzer process was locked, udev started complaining, and it
> > eventually died completely after a few hours of repeated messages like
> > this.
> >
> > [38898.373183] INFO: rcu_sched self-detected stall on CPU
> > [38898.378452] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > [38898.381211] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [38898.381214] 0-...: (1 GPs behind) idle=05f/140000000000001/2 softirq=3285458/3285459 fqs=2625
> > [38898.381217] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > [38898.381218] (detected by 1, t=5252 jiffies, g=3685053, c=3685052, q=32)
> > [38898.381244] ==================================================================
> > [38898.381247] BUG: KASAN: stack-out-of-bounds in __unwind_start+0x1a2/0x1c0 at addr ffff8801e9727c28
> > [38898.381248] Read of size 8 by task swapper/1/0
> > [38898.381250] page:ffffea0007a5c9c0 count:0 mapcount:0 mapping: (null) index:0x0
> > [38898.381251] flags: 0x2ffff8000000000()
> > [38898.381251] page dumped because: kasan: bad access detected
> > [38898.381328] Memory state around the buggy address:
> > [38898.381330] ffff8801e9727b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [38898.381331] ffff8801e9727b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [38898.381332] >ffff8801e9727c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 f3 f3 f3 f3
> > [38898.381333] ^
> > [38898.381334] ffff8801e9727c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
> > [38898.381335] ffff8801e9727d00: f1 f1 f1 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 f3
> > [38898.381335] ==================================================================
> > [38898.510702] (t=5284 jiffies g=3685053 c=3685052 q=32)
> >
> > (That's all, the report above repeats but no useful things like a
> > backtrace are ever printed)
>
> After looking at the RCU stall detection code, I think the KASAN error
> and missing stack dump aren't very surprising. RCU calls the scheduler
> dump_cpu_task() function, which seems inherently problematic: it tries
> to dump the stack of a task while it's running on another CPU.
>
> There are some issues with that:
>
> 1) There's no way to find the starting frame of a currently running task
> from another CPU.
>
> In fact, I'm wondering how dump_cpu_task() ever worked at all? It
> seems like you'd have to get lucky that the sp/bp registers stored by
> the last call to schedule() happen to point to a currently valid
> stack frame.
>
> 2) Even if there were a way to find the starting frame, it's racy
> because the target task could be overwriting the stack while we're
> reading it.
>
> 3) IRQ/exception stack dumps would be missing anyway because the stack
> dump code only looks at the current CPU's interrupt stacks.
>
> Maybe dump_cpu_task() should instead run the stack dump directly from
> the target CPU, e.g. with trigger_single_cpu_backtrace() or
> smp_call_function_single()?
>
> Paul, Peter, Ingo, any thoughts?

We used to do that, but the resulting NMIs were problematic on some
platforms. Perhaps things have gotten better?

Thaxn, Paul

2016-11-29 05:52:51

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Mon, Nov 28, 2016 at 04:40:21PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 28, 2016 at 03:54:11PM -0600, Josh Poimboeuf wrote:
> > On Thu, Nov 24, 2016 at 12:33:48PM -0500, Vince Weaver wrote:
> > >
> > > This is on a skylake machine, linus git as of yesterday after the various
> > > kasan-related fixes went in. Not sure if there were any that hadn't hit
> > > upstream yet.
> > >
> > > Anyway I can't tell from this one what the actual trigger is. After this
> > > mess the fuzzer process was locked, udev started complaining, and it
> > > eventually died completely after a few hours of repeated messages like
> > > this.
> > >
> > > [38898.373183] INFO: rcu_sched self-detected stall on CPU
> > > [38898.378452] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > > [38898.381211] INFO: rcu_sched detected stalls on CPUs/tasks:
> > > [38898.381214] 0-...: (1 GPs behind) idle=05f/140000000000001/2 softirq=3285458/3285459 fqs=2625
> > > [38898.381217] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > > [38898.381218] (detected by 1, t=5252 jiffies, g=3685053, c=3685052, q=32)
> > > [38898.381244] ==================================================================
> > > [38898.381247] BUG: KASAN: stack-out-of-bounds in __unwind_start+0x1a2/0x1c0 at addr ffff8801e9727c28
> > > [38898.381248] Read of size 8 by task swapper/1/0
> > > [38898.381250] page:ffffea0007a5c9c0 count:0 mapcount:0 mapping: (null) index:0x0
> > > [38898.381251] flags: 0x2ffff8000000000()
> > > [38898.381251] page dumped because: kasan: bad access detected
> > > [38898.381328] Memory state around the buggy address:
> > > [38898.381330] ffff8801e9727b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > [38898.381331] ffff8801e9727b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > [38898.381332] >ffff8801e9727c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 f3 f3 f3 f3
> > > [38898.381333] ^
> > > [38898.381334] ffff8801e9727c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
> > > [38898.381335] ffff8801e9727d00: f1 f1 f1 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 f3
> > > [38898.381335] ==================================================================
> > > [38898.510702] (t=5284 jiffies g=3685053 c=3685052 q=32)
> > >
> > > (That's all, the report above repeats but no useful things like a
> > > backtrace are ever printed)
> >
> > After looking at the RCU stall detection code, I think the KASAN error
> > and missing stack dump aren't very surprising. RCU calls the scheduler
> > dump_cpu_task() function, which seems inherently problematic: it tries
> > to dump the stack of a task while it's running on another CPU.
> >
> > There are some issues with that:
> >
> > 1) There's no way to find the starting frame of a currently running task
> > from another CPU.
> >
> > In fact, I'm wondering how dump_cpu_task() ever worked at all? It
> > seems like you'd have to get lucky that the sp/bp registers stored by
> > the last call to schedule() happen to point to a currently valid
> > stack frame.
> >
> > 2) Even if there were a way to find the starting frame, it's racy
> > because the target task could be overwriting the stack while we're
> > reading it.
> >
> > 3) IRQ/exception stack dumps would be missing anyway because the stack
> > dump code only looks at the current CPU's interrupt stacks.
> >
> > Maybe dump_cpu_task() should instead run the stack dump directly from
> > the target CPU, e.g. with trigger_single_cpu_backtrace() or
> > smp_call_function_single()?
> >
> > Paul, Peter, Ingo, any thoughts?
>
> We used to do that, but the resulting NMIs were problematic on some
> platforms. Perhaps things have gotten better?

Did a little digging on git blame and found the following commit (which
seems to be the cause of the KASAN warning and missing stack dump):

bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")

I presume this commit is still needed because of the NMI printk deadlock
issues which were discussed at Kernel Summit. I guess those issues need
to be sorted out before the above commit can be reverted.

--
Josh

2016-11-29 09:17:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > We used to do that, but the resulting NMIs were problematic on some
> > platforms. Perhaps things have gotten better?
>
> Did a little digging on git blame and found the following commit (which
> seems to be the cause of the KASAN warning and missing stack dump):
>
> bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
>
> I presume this commit is still needed because of the NMI printk deadlock
> issues which were discussed at Kernel Summit. I guess those issues need
> to be sorted out before the above commit can be reverted.

so printk should more or less work from NMI, esp. after:

42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")


2016-11-29 10:28:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> On Mon, Nov 28, 2016 at 04:40:21PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 28, 2016 at 03:54:11PM -0600, Josh Poimboeuf wrote:
> > > On Thu, Nov 24, 2016 at 12:33:48PM -0500, Vince Weaver wrote:
> > > >
> > > > This is on a skylake machine, linus git as of yesterday after the various
> > > > kasan-related fixes went in. Not sure if there were any that hadn't hit
> > > > upstream yet.
> > > >
> > > > Anyway I can't tell from this one what the actual trigger is. After this
> > > > mess the fuzzer process was locked, udev started complaining, and it
> > > > eventually died completely after a few hours of repeated messages like
> > > > this.
> > > >
> > > > [38898.373183] INFO: rcu_sched self-detected stall on CPU
> > > > [38898.378452] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > > > [38898.381211] INFO: rcu_sched detected stalls on CPUs/tasks:
> > > > [38898.381214] 0-...: (1 GPs behind) idle=05f/140000000000001/2 softirq=3285458/3285459 fqs=2625
> > > > [38898.381217] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > > > [38898.381218] (detected by 1, t=5252 jiffies, g=3685053, c=3685052, q=32)
> > > > [38898.381244] ==================================================================
> > > > [38898.381247] BUG: KASAN: stack-out-of-bounds in __unwind_start+0x1a2/0x1c0 at addr ffff8801e9727c28
> > > > [38898.381248] Read of size 8 by task swapper/1/0
> > > > [38898.381250] page:ffffea0007a5c9c0 count:0 mapcount:0 mapping: (null) index:0x0
> > > > [38898.381251] flags: 0x2ffff8000000000()
> > > > [38898.381251] page dumped because: kasan: bad access detected
> > > > [38898.381328] Memory state around the buggy address:
> > > > [38898.381330] ffff8801e9727b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > [38898.381331] ffff8801e9727b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > [38898.381332] >ffff8801e9727c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 f3 f3 f3 f3
> > > > [38898.381333] ^
> > > > [38898.381334] ffff8801e9727c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
> > > > [38898.381335] ffff8801e9727d00: f1 f1 f1 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 f3
> > > > [38898.381335] ==================================================================
> > > > [38898.510702] (t=5284 jiffies g=3685053 c=3685052 q=32)
> > > >
> > > > (That's all, the report above repeats but no useful things like a
> > > > backtrace are ever printed)
> > >
> > > After looking at the RCU stall detection code, I think the KASAN error
> > > and missing stack dump aren't very surprising. RCU calls the scheduler
> > > dump_cpu_task() function, which seems inherently problematic: it tries
> > > to dump the stack of a task while it's running on another CPU.
> > >
> > > There are some issues with that:
> > >
> > > 1) There's no way to find the starting frame of a currently running task
> > > from another CPU.
> > >
> > > In fact, I'm wondering how dump_cpu_task() ever worked at all? It
> > > seems like you'd have to get lucky that the sp/bp registers stored by
> > > the last call to schedule() happen to point to a currently valid
> > > stack frame.
> > >
> > > 2) Even if there were a way to find the starting frame, it's racy
> > > because the target task could be overwriting the stack while we're
> > > reading it.
> > >
> > > 3) IRQ/exception stack dumps would be missing anyway because the stack
> > > dump code only looks at the current CPU's interrupt stacks.
> > >
> > > Maybe dump_cpu_task() should instead run the stack dump directly from
> > > the target CPU, e.g. with trigger_single_cpu_backtrace() or
> > > smp_call_function_single()?
> > >
> > > Paul, Peter, Ingo, any thoughts?
> >
> > We used to do that, but the resulting NMIs were problematic on some
> > platforms. Perhaps things have gotten better?
>
> Did a little digging on git blame and found the following commit (which
> seems to be the cause of the KASAN warning and missing stack dump):
>
> bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
>
> I presume this commit is still needed because of the NMI printk deadlock
> issues which were discussed at Kernel Summit. I guess those issues need
> to be sorted out before the above commit can be reverted.

Agreed -- it would be very good to revert that commit, but not until
it is safe to do so. ;-)

Thanx, Paul

2016-11-29 12:43:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:

> Did a little digging on git blame and found the following commit (which
> seems to be the cause of the KASAN warning and missing stack dump):
>
> bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
>
> I presume this commit is still needed because of the NMI printk deadlock
> issues which were discussed at Kernel Summit. I guess those issues need
> to be sorted out before the above commit can be reverted.

Also, I most always run with these here patches applied:

https://lkml.kernel.org/r/[email protected]

People are very busy polishing the turd we call printk, but from where
I'm sitting its terminally and unfixably broken.

I should certainly add a revert of the above commit to the stack of
patches I carry.

2016-11-29 14:07:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > We used to do that, but the resulting NMIs were problematic on some
> > > platforms. Perhaps things have gotten better?
> >
> > Did a little digging on git blame and found the following commit (which
> > seems to be the cause of the KASAN warning and missing stack dump):
> >
> > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> >
> > I presume this commit is still needed because of the NMI printk deadlock
> > issues which were discussed at Kernel Summit. I guess those issues need
> > to be sorted out before the above commit can be reverted.
>
> so printk should more or less work from NMI, esp. after:
>
> 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")

And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
needing more work. Has that happened?

But I really like the fact that RCU CPU stall warnings dump only those
stacks that are likely to be involved, and the patch below goes back
to dumping everyone. Shouldn't be that hard to fix, though...

Thanx, Paul

------------------------------------------------------------------------

commit e7c9d76ed508fe978c6657e33f4de1b160ee4efe
Author: Paul E. McKenney <[email protected]>
Date: Tue Nov 29 05:49:06 2016 -0800

rcu: Once again use NMI-based stack traces in stall warnings

This commit is for all intents and purposes a revert of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to
suppose that this can now safely be reverted is the presence of
42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI"),
which is said to have made NMI-based stack dumps safe.

Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 91a68e4e6671..d73ccd4bed86 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1396,7 +1396,10 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
}

/*
- * Dump stacks of all tasks running on stalled CPUs.
+ * Dump stacks of all tasks running on stalled CPUs. First try using
+ * NMIs, but fall back to manual remote stack tracing on architectures
+ * that don't support NMI-based stack dumps. The NMI-triggered stack
+ * traces are more accurate because they are printed by the target CPU.
*/
static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
{
@@ -1404,6 +1407,8 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
unsigned long flags;
struct rcu_node *rnp;

+ if (trigger_all_cpu_backtrace())
+ return;
rcu_for_each_leaf_node(rsp, rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->qsmask != 0) {

2016-11-29 15:09:51

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 06:07:34AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > > We used to do that, but the resulting NMIs were problematic on some
> > > > platforms. Perhaps things have gotten better?
> > >
> > > Did a little digging on git blame and found the following commit (which
> > > seems to be the cause of the KASAN warning and missing stack dump):
> > >
> > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > >
> > > I presume this commit is still needed because of the NMI printk deadlock
> > > issues which were discussed at Kernel Summit. I guess those issues need
> > > to be sorted out before the above commit can be reverted.
> >
> > so printk should more or less work from NMI, esp. after:
> >
> > 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
>
> And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
> below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
> needing more work. Has that happened?

Petr M, any idea?

> But I really like the fact that RCU CPU stall warnings dump only those
> stacks that are likely to be involved, and the patch below goes back
> to dumping everyone. Shouldn't be that hard to fix, though...

There's a new trigger_single_cpu_backtrace() function which can be used
for that.

> ------------------------------------------------------------------------
>
> commit e7c9d76ed508fe978c6657e33f4de1b160ee4efe
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Nov 29 05:49:06 2016 -0800
>
> rcu: Once again use NMI-based stack traces in stall warnings
>
> This commit is for all intents and purposes a revert of bc1dce514e9b
> ("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to
> suppose that this can now safely be reverted is the presence of
> 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI"),
> which is said to have made NMI-based stack dumps safe.
>
> Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Petr Mladek <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 91a68e4e6671..d73ccd4bed86 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1396,7 +1396,10 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
> }
>
> /*
> - * Dump stacks of all tasks running on stalled CPUs.
> + * Dump stacks of all tasks running on stalled CPUs. First try using
> + * NMIs, but fall back to manual remote stack tracing on architectures
> + * that don't support NMI-based stack dumps. The NMI-triggered stack
> + * traces are more accurate because they are printed by the target CPU.
> */
> static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
> {
> @@ -1404,6 +1407,8 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
> unsigned long flags;
> struct rcu_node *rnp;
>
> + if (trigger_all_cpu_backtrace())
> + return;
> rcu_for_each_leaf_node(rsp, rnp) {
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> if (rnp->qsmask != 0) {
>

--
Josh

2016-11-29 15:10:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 01:43:23PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
>
> > Did a little digging on git blame and found the following commit (which
> > seems to be the cause of the KASAN warning and missing stack dump):
> >
> > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> >
> > I presume this commit is still needed because of the NMI printk deadlock
> > issues which were discussed at Kernel Summit. I guess those issues need
> > to be sorted out before the above commit can be reverted.
>
> Also, I most always run with these here patches applied:
>
> https://lkml.kernel.org/r/[email protected]
>
> People are very busy polishing the turd we call printk, but from where
> I'm sitting its terminally and unfixably broken.
>
> I should certainly add a revert of the above commit to the stack of
> patches I carry.

This isn't making me feel particularly confident about switching RCU
CPU stall warnings back to NMIs... ;-)

Thanx, Paul

2016-11-29 16:12:56

by Petr Mladek

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue 2016-11-29 09:09:17, Josh Poimboeuf wrote:
> On Tue, Nov 29, 2016 at 06:07:34AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > > > We used to do that, but the resulting NMIs were problematic on some
> > > > > platforms. Perhaps things have gotten better?
> > > >
> > > > Did a little digging on git blame and found the following commit (which
> > > > seems to be the cause of the KASAN warning and missing stack dump):
> > > >
> > > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > > >
> > > > I presume this commit is still needed because of the NMI printk deadlock
> > > > issues which were discussed at Kernel Summit. I guess those issues need
> > > > to be sorted out before the above commit can be reverted.
> > >
> > > so printk should more or less work from NMI, esp. after:
> > >
> > > 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
> >
> > And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
> > below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
> > needing more work. Has that happened?
>
> Petr M, any idea?

These two architectures do not support the safe printk in NMI. But
these architectures also do not implement trigger_all_cpu_backtrace()
and other trigger_*_backtrace() functions. Therefore these functions
return false there.

In fact, only very few architectures implement trigger_*_backtrace().
And only few of them use NMI (x86, arm, tile). I have just double
checked that these all use the safe printk in NMI.

By other words, if trigger_all_cpu_backtrace() or
trigger_single_cpu_backtrace() returns true, it should be NMI safe
and you could use it here.


> > But I really like the fact that RCU CPU stall warnings dump only those
> > stacks that are likely to be involved, and the patch below goes back
> > to dumping everyone. Shouldn't be that hard to fix, though...
>
> There's a new trigger_single_cpu_backtrace() function which can be used
> for that.

There is newly also trigger_cpumask_backtrace(struct cpumask *mask)
where you could select more CPUs using the mask. If this is of any help.

Best Regards,
Petr

2016-11-29 16:29:30

by Petr Mladek

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue 2016-11-29 07:10:04, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 01:43:23PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> >
> > > Did a little digging on git blame and found the following commit (which
> > > seems to be the cause of the KASAN warning and missing stack dump):
> > >
> > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > >
> > > I presume this commit is still needed because of the NMI printk deadlock
> > > issues which were discussed at Kernel Summit. I guess those issues need
> > > to be sorted out before the above commit can be reverted.
> >
> > Also, I most always run with these here patches applied:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > People are very busy polishing the turd we call printk, but from where
> > I'm sitting its terminally and unfixably broken.

I still hope that we could do better :-)


> > I should certainly add a revert of the above commit to the stack of
> > patches I carry.
>
> This isn't making me feel particularly confident about switching RCU
> CPU stall warnings back to NMIs... ;-)

IMHO, trigger_single_cpu_backtrace() is pretty safe at the moment.
It uses per-CPU buffers a lockless way in NMI context. It even makes
sure that the buffers are flushed to the main log buffer and console
once it is back from NMI.

By other words, the deadlocks in NMI context should be gone. The
NMI buffers are flushed using the classic printk(). Therefore
the risk is the same as when you use printk() directly
in rcu_dump_cpu_stacks() now.

Best Regards,
Petr

2016-11-29 16:52:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 09:09:17AM -0600, Josh Poimboeuf wrote:
> On Tue, Nov 29, 2016 at 06:07:34AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > > > We used to do that, but the resulting NMIs were problematic on some
> > > > > platforms. Perhaps things have gotten better?
> > > >
> > > > Did a little digging on git blame and found the following commit (which
> > > > seems to be the cause of the KASAN warning and missing stack dump):
> > > >
> > > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > > >
> > > > I presume this commit is still needed because of the NMI printk deadlock
> > > > issues which were discussed at Kernel Summit. I guess those issues need
> > > > to be sorted out before the above commit can be reverted.
> > >
> > > so printk should more or less work from NMI, esp. after:
> > >
> > > 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
> >
> > And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
> > below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
> > needing more work. Has that happened?
>
> Petr M, any idea?

My Not-yet-signed-off-by is due to this concern, FWIW.

> > But I really like the fact that RCU CPU stall warnings dump only those
> > stacks that are likely to be involved, and the patch below goes back
> > to dumping everyone. Shouldn't be that hard to fix, though...
>
> There's a new trigger_single_cpu_backtrace() function which can be used
> for that.

Even better, thank you! Killed an hour or so of coding, but I must
confess that it was a mercy killing. ;-)

Much nicer (but completely untested) patch below.

Thanx, Paul

------------------------------------------------------------------------

commit d3515ee46e0cff880170e48a05e8f2791b507758
Author: Paul E. McKenney <[email protected]>
Date: Tue Nov 29 05:49:06 2016 -0800

rcu: Once again use NMI-based stack traces in stall warnings

This commit is for all intents and purposes a revert of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose
that this can now safely be reverted is the presence of 42a0bb3f7138
("printk/nmi: generic solution for safe printk in NMI"), which is said
to have made NMI-based stack dumps safe.

However, this reversion keeps one nice property of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
only those CPUs blocking the grace period are dumped. The new
trigger_single_cpu_backtrace() is used to make this happen, as
suggested by Josh Poimboeuf.

Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 91a68e4e6671..ba0e4825be9d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1396,7 +1396,10 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
}

/*
- * Dump stacks of all tasks running on stalled CPUs.
+ * Dump stacks of all tasks running on stalled CPUs. First try using
+ * NMIs, but fall back to manual remote stack tracing on architectures
+ * that don't support NMI-based stack dumps. The NMI-triggered stack
+ * traces are more accurate because they are printed by the target CPU.
*/
static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
{
@@ -1406,11 +1409,10 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)

rcu_for_each_leaf_node(rsp, rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
- if (rnp->qsmask != 0) {
- for_each_leaf_node_possible_cpu(rnp, cpu)
- if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
+ for_each_leaf_node_possible_cpu(rnp, cpu)
+ if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
+ if (!trigger_single_cpu_backtrace(cpu))
dump_cpu_task(cpu);
- }
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7dcdd59d894c..c0a4bf8f1ed0 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -691,18 +691,6 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
#endif /* #ifdef CONFIG_RCU_TRACE */

/*
- * Place this after a lock-acquisition primitive to guarantee that
- * an UNLOCK+LOCK pair act as a full barrier. This guarantee applies
- * if the UNLOCK and LOCK are executed by the same CPU or if the
- * UNLOCK and LOCK operate on the same lock variable.
- */
-#ifdef CONFIG_PPC
-#define smp_mb__after_unlock_lock() smp_mb() /* Full ordering for lock. */
-#else /* #ifdef CONFIG_PPC */
-#define smp_mb__after_unlock_lock() do { } while (0)
-#endif /* #else #ifdef CONFIG_PPC */
-
-/*
* Wrappers for the rcu_node::lock acquire and release.
*
* Because the rcu_nodes form a tree, the tree traversal locking will observe

2016-11-29 17:11:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 05:29:20PM +0100, Petr Mladek wrote:

> > > People are very busy polishing the turd we call printk, but from where
> > > I'm sitting its terminally and unfixably broken.
>
> I still hope that we could do better :-)

How? The console drivers are a complete trainwreck, you simply cannot
build anything sensible ontop of a trainwreck.

And from what I understood from talking to someone (I again forgot who)
at LPC, the whole reason people were poking at this is that the block
layer (or something thereabouts) prints a gazillion lines of crap when
you attach a stupid amount of devices (through FC or other SAN like
things).

The way we've 'fixed' that in the scheduler (a fairly long time ago)
when SGI complained about our printks taking too long (because they had
4096 CPUs), is to simply remove the printks (they're now hidden behind
the sched_debug boot param).


In any case, as long as printk has a globally serialized 'log', it, per
design, will be worse than the console drivers its build upon. And them
being shit precludes the entire stack from being useful.

It mostly works, most of the time, and that seems to be what Linus
wants, since its really the best we can have given the constraints. But
for debugging, when you have a UART, it totally blows.


2016-11-29 17:18:01

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 08:51:52AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 09:09:17AM -0600, Josh Poimboeuf wrote:
> > On Tue, Nov 29, 2016 at 06:07:34AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> > > > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > > > > We used to do that, but the resulting NMIs were problematic on some
> > > > > > platforms. Perhaps things have gotten better?
> > > > >
> > > > > Did a little digging on git blame and found the following commit (which
> > > > > seems to be the cause of the KASAN warning and missing stack dump):
> > > > >
> > > > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > > > >
> > > > > I presume this commit is still needed because of the NMI printk deadlock
> > > > > issues which were discussed at Kernel Summit. I guess those issues need
> > > > > to be sorted out before the above commit can be reverted.
> > > >
> > > > so printk should more or less work from NMI, esp. after:
> > > >
> > > > 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
> > >
> > > And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
> > > below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
> > > needing more work. Has that happened?
> >
> > Petr M, any idea?
>
> My Not-yet-signed-off-by is due to this concern, FWIW.

I think Petr's replies have addressed that now.

> > > But I really like the fact that RCU CPU stall warnings dump only those
> > > stacks that are likely to be involved, and the patch below goes back
> > > to dumping everyone. Shouldn't be that hard to fix, though...
> >
> > There's a new trigger_single_cpu_backtrace() function which can be used
> > for that.
>
> Even better, thank you! Killed an hour or so of coding, but I must
> confess that it was a mercy killing. ;-)

Ha :-)

> Much nicer (but completely untested) patch below.

The kernel/rcu/tree.h changes seem intended for another patch?

Otherwise:

Reviewed-by: Josh Poimboeuf <[email protected]>

Also I think this will fix the KASAN warnings reported by Vince, so you
might add:

Reported-by: Vince Weaver <[email protected]>

>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit d3515ee46e0cff880170e48a05e8f2791b507758
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Nov 29 05:49:06 2016 -0800
>
> rcu: Once again use NMI-based stack traces in stall warnings
>
> This commit is for all intents and purposes a revert of bc1dce514e9b
> ("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose
> that this can now safely be reverted is the presence of 42a0bb3f7138
> ("printk/nmi: generic solution for safe printk in NMI"), which is said
> to have made NMI-based stack dumps safe.
>
> However, this reversion keeps one nice property of bc1dce514e9b
> ("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
> only those CPUs blocking the grace period are dumped. The new
> trigger_single_cpu_backtrace() is used to make this happen, as
> suggested by Josh Poimboeuf.
>
> Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Petr Mladek <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 91a68e4e6671..ba0e4825be9d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1396,7 +1396,10 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
> }
>
> /*
> - * Dump stacks of all tasks running on stalled CPUs.
> + * Dump stacks of all tasks running on stalled CPUs. First try using
> + * NMIs, but fall back to manual remote stack tracing on architectures
> + * that don't support NMI-based stack dumps. The NMI-triggered stack
> + * traces are more accurate because they are printed by the target CPU.
> */
> static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
> {
> @@ -1406,11 +1409,10 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
>
> rcu_for_each_leaf_node(rsp, rnp) {
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> - if (rnp->qsmask != 0) {
> - for_each_leaf_node_possible_cpu(rnp, cpu)
> - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
> + for_each_leaf_node_possible_cpu(rnp, cpu)
> + if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
> + if (!trigger_single_cpu_backtrace(cpu))
> dump_cpu_task(cpu);
> - }
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
> }
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 7dcdd59d894c..c0a4bf8f1ed0 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -691,18 +691,6 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
> #endif /* #ifdef CONFIG_RCU_TRACE */
>
> /*
> - * Place this after a lock-acquisition primitive to guarantee that
> - * an UNLOCK+LOCK pair act as a full barrier. This guarantee applies
> - * if the UNLOCK and LOCK are executed by the same CPU or if the
> - * UNLOCK and LOCK operate on the same lock variable.
> - */
> -#ifdef CONFIG_PPC
> -#define smp_mb__after_unlock_lock() smp_mb() /* Full ordering for lock. */
> -#else /* #ifdef CONFIG_PPC */
> -#define smp_mb__after_unlock_lock() do { } while (0)
> -#endif /* #else #ifdef CONFIG_PPC */
> -
> -/*
> * Wrappers for the rcu_node::lock acquire and release.
> *
> * Because the rcu_nodes form a tree, the tree traversal locking will observe
>

2016-11-29 17:36:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 11:17:25AM -0600, Josh Poimboeuf wrote:
> On Tue, Nov 29, 2016 at 08:51:52AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 29, 2016 at 09:09:17AM -0600, Josh Poimboeuf wrote:
> > > On Tue, Nov 29, 2016 at 06:07:34AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> > > > > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > > > > > We used to do that, but the resulting NMIs were problematic on some
> > > > > > > platforms. Perhaps things have gotten better?
> > > > > >
> > > > > > Did a little digging on git blame and found the following commit (which
> > > > > > seems to be the cause of the KASAN warning and missing stack dump):
> > > > > >
> > > > > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > > > > >
> > > > > > I presume this commit is still needed because of the NMI printk deadlock
> > > > > > issues which were discussed at Kernel Summit. I guess those issues need
> > > > > > to be sorted out before the above commit can be reverted.
> > > > >
> > > > > so printk should more or less work from NMI, esp. after:
> > > > >
> > > > > 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
> > > >
> > > > And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
> > > > below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
> > > > needing more work. Has that happened?
> > >
> > > Petr M, any idea?
> >
> > My Not-yet-signed-off-by is due to this concern, FWIW.
>
> I think Petr's replies have addressed that now.
>
> > > > But I really like the fact that RCU CPU stall warnings dump only those
> > > > stacks that are likely to be involved, and the patch below goes back
> > > > to dumping everyone. Shouldn't be that hard to fix, though...
> > >
> > > There's a new trigger_single_cpu_backtrace() function which can be used
> > > for that.
> >
> > Even better, thank you! Killed an hour or so of coding, but I must
> > confess that it was a mercy killing. ;-)
>
> Ha :-)
>
> > Much nicer (but completely untested) patch below.
>
> The kernel/rcu/tree.h changes seem intended for another patch?

Indeed it was, thank you for catching this, fixed.

> Otherwise:
>
> Reviewed-by: Josh Poimboeuf <[email protected]>
>
> Also I think this will fix the KASAN warnings reported by Vince, so you
> might add:
>
> Reported-by: Vince Weaver <[email protected]>

Added both of these, thank you!

Updated (but still untested) commit below.


Thanx, Paul

------------------------------------------------------------------------

commit d3df9bc5fb5d838b049f32a476721eadbc349553
Author: Paul E. McKenney <[email protected]>
Date: Tue Nov 29 05:49:06 2016 -0800

rcu: Once again use NMI-based stack traces in stall warnings

This commit is for all intents and purposes a revert of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose
that this can now safely be reverted is the presence of 42a0bb3f7138
("printk/nmi: generic solution for safe printk in NMI"), which is said
to have made NMI-based stack dumps safe.

However, this reversion keeps one nice property of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
only those CPUs blocking the grace period are dumped. The new
trigger_single_cpu_backtrace() is used to make this happen, as
suggested by Josh Poimboeuf.

Reported-by: Vince Weaver <[email protected]>
Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reviewed-by: Josh Poimboeuf <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 91a68e4e6671..ba0e4825be9d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1396,7 +1396,10 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
}

/*
- * Dump stacks of all tasks running on stalled CPUs.
+ * Dump stacks of all tasks running on stalled CPUs. First try using
+ * NMIs, but fall back to manual remote stack tracing on architectures
+ * that don't support NMI-based stack dumps. The NMI-triggered stack
+ * traces are more accurate because they are printed by the target CPU.
*/
static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
{
@@ -1406,11 +1409,10 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)

rcu_for_each_leaf_node(rsp, rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
- if (rnp->qsmask != 0) {
- for_each_leaf_node_possible_cpu(rnp, cpu)
- if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
+ for_each_leaf_node_possible_cpu(rnp, cpu)
+ if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
+ if (!trigger_single_cpu_backtrace(cpu))
dump_cpu_task(cpu);
- }
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}
}

2016-11-29 18:02:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 05:12:46PM +0100, Petr Mladek wrote:
> On Tue 2016-11-29 09:09:17, Josh Poimboeuf wrote:
> > On Tue, Nov 29, 2016 at 06:07:34AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 29, 2016 at 10:16:50AM +0100, Peter Zijlstra wrote:
> > > > On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> > > > > > We used to do that, but the resulting NMIs were problematic on some
> > > > > > platforms. Perhaps things have gotten better?
> > > > >
> > > > > Did a little digging on git blame and found the following commit (which
> > > > > seems to be the cause of the KASAN warning and missing stack dump):
> > > > >
> > > > > bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
> > > > >
> > > > > I presume this commit is still needed because of the NMI printk deadlock
> > > > > issues which were discussed at Kernel Summit. I guess those issues need
> > > > > to be sorted out before the above commit can be reverted.
> > > >
> > > > so printk should more or less work from NMI, esp. after:
> > > >
> > > > 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
> > >
> > > And of course bc1dce514e9b doesn't revert cleanly, but see hand reversion
> > > below. Also, 42a0bb3f7138's commit log calls out MN10300 and Xtensa as
> > > needing more work. Has that happened?
> >
> > Petr M, any idea?
>
> These two architectures do not support the safe printk in NMI. But
> these architectures also do not implement trigger_all_cpu_backtrace()
> and other trigger_*_backtrace() functions. Therefore these functions
> return false there.
>
> In fact, only very few architectures implement trigger_*_backtrace().
> And only few of them use NMI (x86, arm, tile). I have just double
> checked that these all use the safe printk in NMI.
>
> By other words, if trigger_all_cpu_backtrace() or
> trigger_single_cpu_backtrace() returns true, it should be NMI safe
> and you could use it here.

Good, I will upgrade my commit to Signed-off-by, then.

> > > But I really like the fact that RCU CPU stall warnings dump only those
> > > stacks that are likely to be involved, and the patch below goes back
> > > to dumping everyone. Shouldn't be that hard to fix, though...
> >
> > There's a new trigger_single_cpu_backtrace() function which can be used
> > for that.
>
> There is newly also trigger_cpumask_backtrace(struct cpumask *mask)
> where you could select more CPUs using the mask. If this is of any help.

In my experience, there is almost never a large number of CPUs stalling
a given RCU grace period. But thank you for letting me know about
trigger_cpumask_backtrace(), as it might be useful in the future.

Thanx, Paul

2016-11-29 19:40:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 06:10:38PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 05:29:20PM +0100, Petr Mladek wrote:
>
> > > > People are very busy polishing the turd we call printk, but from where
> > > > I'm sitting its terminally and unfixably broken.
> >
> > I still hope that we could do better :-)
>
> How? The console drivers are a complete trainwreck, you simply cannot
> build anything sensible ontop of a trainwreck.
>
> And from what I understood from talking to someone (I again forgot who)
> at LPC, the whole reason people were poking at this is that the block
> layer (or something thereabouts) prints a gazillion lines of crap when
> you attach a stupid amount of devices (through FC or other SAN like
> things).
>
> The way we've 'fixed' that in the scheduler (a fairly long time ago)
> when SGI complained about our printks taking too long (because they had
> 4096 CPUs), is to simply remove the printks (they're now hidden behind
> the sched_debug boot param).
>
>
> In any case, as long as printk has a globally serialized 'log', it, per
> design, will be worse than the console drivers its build upon. And them
> being shit precludes the entire stack from being useful.
>
> It mostly works, most of the time, and that seems to be what Linus
> wants, since its really the best we can have given the constraints. But
> for debugging, when you have a UART, it totally blows.

UART??? They still make those things??? ;-)

Thanx, Paul

2016-11-29 19:52:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 11:39:35AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 06:10:38PM +0100, Peter Zijlstra wrote:

> > It mostly works, most of the time, and that seems to be what Linus
> > wants, since its really the best we can have given the constraints. But
> > for debugging, when you have a UART, it totally blows.
>
> UART??? They still make those things??? ;-)

Yes, most computer like devices actually have them, trouble is, most
consumer devices don't have the pins exposed. Luckily most server class
hardware still does.

And they're absolutely _awesome_ for debugging; getting data out is a
matter of trivial MMIO poll loops. Rock solid stuff.

2016-11-29 20:07:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 08:52:04PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 11:39:35AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 29, 2016 at 06:10:38PM +0100, Peter Zijlstra wrote:
>
> > > It mostly works, most of the time, and that seems to be what Linus
> > > wants, since its really the best we can have given the constraints. But
> > > for debugging, when you have a UART, it totally blows.
> >
> > UART??? They still make those things??? ;-)
>
> Yes, most computer like devices actually have them, trouble is, most
> consumer devices don't have the pins exposed. Luckily most server class
> hardware still does.
>
> And they're absolutely _awesome_ for debugging; getting data out is a
> matter of trivial MMIO poll loops. Rock solid stuff.

They very clearly need to bring the baud rate into the current millenium,
many tens of Mbaud at the -very- least.

Thanx, Paul

2016-11-29 20:33:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 12:07:11PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 08:52:04PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 29, 2016 at 11:39:35AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 29, 2016 at 06:10:38PM +0100, Peter Zijlstra wrote:
> >
> > > > It mostly works, most of the time, and that seems to be what Linus
> > > > wants, since its really the best we can have given the constraints. But
> > > > for debugging, when you have a UART, it totally blows.
> > >
> > > UART??? They still make those things??? ;-)
> >
> > Yes, most computer like devices actually have them, trouble is, most
> > consumer devices don't have the pins exposed. Luckily most server class
> > hardware still does.
> >
> > And they're absolutely _awesome_ for debugging; getting data out is a
> > matter of trivial MMIO poll loops. Rock solid stuff.
>
> They very clearly need to bring the baud rate into the current millenium,
> many tens of Mbaud at the -very- least.

On a more practical note...

Currently, cond_resched_rcu_qs() is not permitted to be invoked until
after the scheduler has started. However, it appears that there is some
kernel code that can loop for quite some time at runtime, but which also
executes during early boot. So it would be good to make it so that
cond_resched_rcu_qs() could be called at boot.

One approach would be to check rcu_scheduler_active, but this isn't
defined in normal Tiny RCU builds. I can expand Tiny RCU, or I can
kludge the non-CONFIG_DEBUG_LOCK_ALLOC value of rcu_scheduler_active
to false (with this latter being the current state). But it occurred
to me that I could also condition on !is_idle_task(), given that idle
tasks shouldn't ever be invoking the scheduler anyway.

So is the following a sensible approach, or should I look elsewhere?

#define cond_resched_rcu_qs() \
do { \
if (!is_idle_task(current) && !cond_resched()) \
rcu_note_voluntary_context_switch(current); \
} while (0)

Thanx, Paul

2016-11-30 09:59:35

by Petr Mladek

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue 2016-11-29 09:36:00, Paul E. McKenney wrote:
> Updated (but still untested) commit below.
>
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit d3df9bc5fb5d838b049f32a476721eadbc349553
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Nov 29 05:49:06 2016 -0800
>
> rcu: Once again use NMI-based stack traces in stall warnings
>
> This commit is for all intents and purposes a revert of bc1dce514e9b
> ("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose
> that this can now safely be reverted is the presence of 42a0bb3f7138
> ("printk/nmi: generic solution for safe printk in NMI"), which is said
> to have made NMI-based stack dumps safe.
>
> However, this reversion keeps one nice property of bc1dce514e9b
> ("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
> only those CPUs blocking the grace period are dumped. The new
> trigger_single_cpu_backtrace() is used to make this happen, as
> suggested by Josh Poimboeuf.
>
> Reported-by: Vince Weaver <[email protected]>
> Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Petr Mladek <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Reviewed-by: Josh Poimboeuf <[email protected]>

Looks fine to me.

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 91a68e4e6671..ba0e4825be9d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1396,7 +1396,10 @@ static void rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
> }
>
> /*
> - * Dump stacks of all tasks running on stalled CPUs.
> + * Dump stacks of all tasks running on stalled CPUs. First try using
> + * NMIs, but fall back to manual remote stack tracing on architectures
> + * that don't support NMI-based stack dumps. The NMI-triggered stack
> + * traces are more accurate because they are printed by the target CPU.
> */
> static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
> {
> @@ -1406,11 +1409,10 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
>
> rcu_for_each_leaf_node(rsp, rnp) {
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> - if (rnp->qsmask != 0) {
> - for_each_leaf_node_possible_cpu(rnp, cpu)
> - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
> + for_each_leaf_node_possible_cpu(rnp, cpu)
> + if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
> + if (!trigger_single_cpu_backtrace(cpu))
> dump_cpu_task(cpu);
> - }
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
> }
>

2016-11-30 10:04:59

by Petr Mladek

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue 2016-11-29 18:10:38, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 05:29:20PM +0100, Petr Mladek wrote:
>
> > > > People are very busy polishing the turd we call printk, but from where
> > > > I'm sitting its terminally and unfixably broken.
> >
> > I still hope that we could do better :-)
>
> How? The console drivers are a complete trainwreck, you simply cannot
> build anything sensible ontop of a trainwreck.

I am afraid that I will not persuade you but...


> And from what I understood from talking to someone (I again forgot who)
> at LPC, the whole reason people were poking at this is that the block
> layer (or something thereabouts) prints a gazillion lines of crap when
> you attach a stupid amount of devices (through FC or other SAN like
> things).

This is crazy indeed if it happens on a production system.


> The way we've 'fixed' that in the scheduler (a fairly long time ago)
> when SGI complained about our printks taking too long (because they had
> 4096 CPUs), is to simply remove the printks (they're now hidden behind
> the sched_debug boot param).

This is a solution. But what if you want to enable debugging and the
system does not boot because the printing takes too long.


> In any case, as long as printk has a globally serialized 'log', it, per
> design, will be worse than the console drivers its build upon. And them
> being shit precludes the entire stack from being useful.

I probably still do not understand all the problems with console
drivers. My understanding is that the problem is that they have
its own locking and are slow. It means that they are prone to
a deadlock and they might block for a long time.

In compare, the serialized log buffer has one lock and writing
is fast. It means that it suffers "only" from the deadlocks.
And we try to address the deadlocks by using the temporary
per-CPU buffers in critical situations (NMI, locked sections).

Of course, it is useless if you have the messages in a buffer
and can't reach them. But we do the best effort to push them
to consoles and crash dump. Also it might be very useful to
have the log buffer on persistent memory.


> It mostly works, most of the time, and that seems to be what Linus
> wants, since its really the best we can have given the constraints. But
> for debugging, when you have a UART, it totally blows.

I believe that the early console is the last resort for debugging
some type of bugs. But many other bugs can be debugged with the
classic printk(). And there are (production) systems where you
cannot (easily) or do not want to use early printk all the time.

Another question is the complexity of the printk() code. Especially,
the big effort to get "perfect" (non-mixed) output is questionable.

Best Regards,
Petr

2016-11-30 11:06:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Wed, Nov 30, 2016 at 11:01:29AM +0100, Petr Mladek wrote:
> On Tue 2016-11-29 18:10:38, Peter Zijlstra wrote:

> > In any case, as long as printk has a globally serialized 'log', it, per
> > design, will be worse than the console drivers its build upon. And them
> > being shit precludes the entire stack from being useful.
>
> I probably still do not understand all the problems with console
> drivers. My understanding is that the problem is that they have
> its own locking and are slow. It means that they are prone to
> a deadlock and they might block for a long time.

Slow isn't a problem; just limit the crap you want to push down them.

Them taking locks, them using the scheduler and them depending on entire
subsystem state to be 'sane' are the problems.

Take for instance the usb-serial console driver (everybody agrees its
crap, and gregkh did it as a lark, but still it exists), that takes
locks, relies on the scheduler, depends on the USB subsystem, which in
turn depends on the PCI subsystem.

Now imagine trying to use that for something halfway sensible.

Even the DRM based consoles suffer much the same problems.

Heck, even the 'normal' UART drivers do this :-(

> In compare, the serialized log buffer has one lock and writing
> is fast. It means that it suffers "only" from the deadlocks.
> And we try to address the deadlocks by using the temporary
> per-CPU buffers in critical situations (NMI, locked sections).

The temporary buffers are crap when you never get around to flushing
them. You need a fully lockless and wait free buffer or you're screwed.

> Of course, it is useless if you have the messages in a buffer
> and can't reach them. But we do the best effort to push them
> to consoles and crash dump. Also it might be very useful to
> have the log buffer on persistent memory.

Nothing will crash if you do while (1); in NMI context.

Been there, done that (of course it wasn't _that_ blatant, but the
effect was much the same).

I also have WARN()s in scheduler code, now most of those will not
indicate fatal conditions, but given the above state of console drivers
they have a very real chance of deadlocking the system.

And no, we're not going to do WARN_DEFERRED and similar crap. That just
proliferates the utter fail of printk() down the stack and creates more
mess.

> > It mostly works, most of the time, and that seems to be what Linus
> > wants, since its really the best we can have given the constraints. But
> > for debugging, when you have a UART, it totally blows.
>
> I believe that the early console is the last resort for debugging
> some type of bugs. But many other bugs can be debugged with the
> classic printk(). And there are (production) systems where you
> cannot (easily) or do not want to use early printk all the time.
>
> Another question is the complexity of the printk() code. Especially,
> the big effort to get "perfect" (non-mixed) output is questionable.

I'm saying its an entirely wasted effort, because no matter how much
complexity you pile on, you can never get it into a useful state because
its all build on shit.

So yes, I much prefer to not add more and more complexity into printk.
We should strip it down, not pile more junk on.

2016-11-30 19:13:15

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Tue, Nov 29, 2016 at 12:32:59PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 12:07:11PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 29, 2016 at 08:52:04PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 29, 2016 at 11:39:35AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Nov 29, 2016 at 06:10:38PM +0100, Peter Zijlstra wrote:
> > >
> > > > > It mostly works, most of the time, and that seems to be what Linus
> > > > > wants, since its really the best we can have given the constraints. But
> > > > > for debugging, when you have a UART, it totally blows.
> > > >
> > > > UART??? They still make those things??? ;-)
> > >
> > > Yes, most computer like devices actually have them, trouble is, most
> > > consumer devices don't have the pins exposed. Luckily most server class
> > > hardware still does.
> > >
> > > And they're absolutely _awesome_ for debugging; getting data out is a
> > > matter of trivial MMIO poll loops. Rock solid stuff.
> >
> > They very clearly need to bring the baud rate into the current millenium,
> > many tens of Mbaud at the -very- least.
>
> On a more practical note...
>
> Currently, cond_resched_rcu_qs() is not permitted to be invoked until
> after the scheduler has started. However, it appears that there is some
> kernel code that can loop for quite some time at runtime, but which also
> executes during early boot. So it would be good to make it so that
> cond_resched_rcu_qs() could be called at boot.
>
> One approach would be to check rcu_scheduler_active, but this isn't
> defined in normal Tiny RCU builds. I can expand Tiny RCU, or I can
> kludge the non-CONFIG_DEBUG_LOCK_ALLOC value of rcu_scheduler_active
> to false (with this latter being the current state). But it occurred
> to me that I could also condition on !is_idle_task(), given that idle
> tasks shouldn't ever be invoking the scheduler anyway.

This question was probably intended for other folks, but I should point
out that idle tasks *do* invoke the scheduler. cpu_idle_loop() calls
schedule_preempt_disabled().

>
> So is the following a sensible approach, or should I look elsewhere?
>
> #define cond_resched_rcu_qs() \
> do { \
> if (!is_idle_task(current) && !cond_resched()) \
> rcu_note_voluntary_context_switch(current); \
> } while (0)
>
> Thanx, Paul
>

--
Josh

2016-11-30 19:50:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Wed, Nov 30, 2016 at 01:13:03PM -0600, Josh Poimboeuf wrote:
> On Tue, Nov 29, 2016 at 12:32:59PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 29, 2016 at 12:07:11PM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 29, 2016 at 08:52:04PM +0100, Peter Zijlstra wrote:
> > > > On Tue, Nov 29, 2016 at 11:39:35AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Nov 29, 2016 at 06:10:38PM +0100, Peter Zijlstra wrote:
> > > >
> > > > > > It mostly works, most of the time, and that seems to be what Linus
> > > > > > wants, since its really the best we can have given the constraints. But
> > > > > > for debugging, when you have a UART, it totally blows.
> > > > >
> > > > > UART??? They still make those things??? ;-)
> > > >
> > > > Yes, most computer like devices actually have them, trouble is, most
> > > > consumer devices don't have the pins exposed. Luckily most server class
> > > > hardware still does.
> > > >
> > > > And they're absolutely _awesome_ for debugging; getting data out is a
> > > > matter of trivial MMIO poll loops. Rock solid stuff.
> > >
> > > They very clearly need to bring the baud rate into the current millenium,
> > > many tens of Mbaud at the -very- least.
> >
> > On a more practical note...
> >
> > Currently, cond_resched_rcu_qs() is not permitted to be invoked until
> > after the scheduler has started. However, it appears that there is some
> > kernel code that can loop for quite some time at runtime, but which also
> > executes during early boot. So it would be good to make it so that
> > cond_resched_rcu_qs() could be called at boot.
> >
> > One approach would be to check rcu_scheduler_active, but this isn't
> > defined in normal Tiny RCU builds. I can expand Tiny RCU, or I can
> > kludge the non-CONFIG_DEBUG_LOCK_ALLOC value of rcu_scheduler_active
> > to false (with this latter being the current state). But it occurred
> > to me that I could also condition on !is_idle_task(), given that idle
> > tasks shouldn't ever be invoking the scheduler anyway.
>
> This question was probably intended for other folks, but I should point
> out that idle tasks *do* invoke the scheduler. cpu_idle_loop() calls
> schedule_preempt_disabled().

Good point. My next fallback is that idle loops should not be running
for long periods of time within RCU_NONIDLE(). Does that work?

Thanx, Paul

> > So is the following a sensible approach, or should I look elsewhere?
> >
> > #define cond_resched_rcu_qs() \
> > do { \
> > if (!is_idle_task(current) && !cond_resched()) \
> > rcu_note_voluntary_context_switch(current); \
> > } while (0)
> >
> > Thanx, Paul
> >
>
> --
> Josh
>

2016-12-01 05:52:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Wed, Nov 30, 2016 at 01:13:03PM -0600, Josh Poimboeuf wrote:
> This question was probably intended for other folks, but I should point
> out that idle tasks *do* invoke the scheduler. cpu_idle_loop() calls
> schedule_preempt_disabled().

Right, but that doesn't matter I think. The below will simply not call
rcu_note_voluntary_context_switch() from the idle task, which would be
fine I think.

> > So is the following a sensible approach, or should I look elsewhere?
> >
> > #define cond_resched_rcu_qs() \
> > do { \
> > if (!is_idle_task(current) && !cond_resched()) \
> > rcu_note_voluntary_context_switch(current); \

You should reverse your conditions though:

if (!cond_resched() && !is_idle_task(current))
rcu_note_voluntary_context_switch(current);

That way we'll still do cond_resched() and you only gate the RCU call.

2016-12-01 12:33:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Thu, Dec 01, 2016 at 06:52:35AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 30, 2016 at 01:13:03PM -0600, Josh Poimboeuf wrote:
> > This question was probably intended for other folks, but I should point
> > out that idle tasks *do* invoke the scheduler. cpu_idle_loop() calls
> > schedule_preempt_disabled().
>
> Right, but that doesn't matter I think. The below will simply not call
> rcu_note_voluntary_context_switch() from the idle task, which would be
> fine I think.
>
> > > So is the following a sensible approach, or should I look elsewhere?
> > >
> > > #define cond_resched_rcu_qs() \
> > > do { \
> > > if (!is_idle_task(current) && !cond_resched()) \
> > > rcu_note_voluntary_context_switch(current); \
>
> You should reverse your conditions though:
>
> if (!cond_resched() && !is_idle_task(current))
> rcu_note_voluntary_context_switch(current);
>
> That way we'll still do cond_resched() and you only gate the RCU call.

This makes it illegal at early boot. This is not a problem with the
surviving cond_resched_rcu_qs(), but one of the candidates really was
called at boot time. If I reverse the order as you say, I can just as
well leave of the "!is_idle_task(current)".

So I will just drop this patch until such time as someone actually needs
to add a cond_resched_rcu_qs() that sometimes gets invoked at boot time.

Thanx, Paul

2016-12-01 16:41:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Thu, Dec 01, 2016 at 04:33:16AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 01, 2016 at 06:52:35AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 30, 2016 at 01:13:03PM -0600, Josh Poimboeuf wrote:
> > > This question was probably intended for other folks, but I should point
> > > out that idle tasks *do* invoke the scheduler. cpu_idle_loop() calls
> > > schedule_preempt_disabled().
> >
> > Right, but that doesn't matter I think. The below will simply not call
> > rcu_note_voluntary_context_switch() from the idle task, which would be
> > fine I think.
> >
> > > > So is the following a sensible approach, or should I look elsewhere?
> > > >
> > > > #define cond_resched_rcu_qs() \
> > > > do { \
> > > > if (!is_idle_task(current) && !cond_resched()) \
> > > > rcu_note_voluntary_context_switch(current); \
> >
> > You should reverse your conditions though:
> >
> > if (!cond_resched() && !is_idle_task(current))
> > rcu_note_voluntary_context_switch(current);
> >
> > That way we'll still do cond_resched() and you only gate the RCU call.
>
> This makes it illegal at early boot.

Humm, how early are we talking?

2016-12-01 17:00:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start

On Thu, Dec 01, 2016 at 05:41:07PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 01, 2016 at 04:33:16AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 01, 2016 at 06:52:35AM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 30, 2016 at 01:13:03PM -0600, Josh Poimboeuf wrote:
> > > > This question was probably intended for other folks, but I should point
> > > > out that idle tasks *do* invoke the scheduler. cpu_idle_loop() calls
> > > > schedule_preempt_disabled().
> > >
> > > Right, but that doesn't matter I think. The below will simply not call
> > > rcu_note_voluntary_context_switch() from the idle task, which would be
> > > fine I think.
> > >
> > > > > So is the following a sensible approach, or should I look elsewhere?
> > > > >
> > > > > #define cond_resched_rcu_qs() \
> > > > > do { \
> > > > > if (!is_idle_task(current) && !cond_resched()) \
> > > > > rcu_note_voluntary_context_switch(current); \
> > >
> > > You should reverse your conditions though:
> > >
> > > if (!cond_resched() && !is_idle_task(current))
> > > rcu_note_voluntary_context_switch(current);
> > >
> > > That way we'll still do cond_resched() and you only gate the RCU call.
> >
> > This makes it illegal at early boot.
>
> Humm, how early are we talking?

The case I saw was during start_kernel(), IIRC. But again, it turned
out that the patch putting cond_resched_rcu_qs() in that early was
(1) broken and (2) unnecessary.

Thanx, Paul