2020-06-02 02:52:36

by Matt Turner

[permalink] [raw]
Subject: Regression bisected to f2f84b05e02b (bug: consolidate warn_slowpath_fmt() usage)

I bisected a regression on alpha to f2f84b05e02b (bug: consolidate
warn_slowpath_fmt() usage) which looks totally innocuous.

Reverting it on master confirms that it somehow is the trigger. At or a
little after starting userspace, I'll see an oops like this:

Unable to handle kernel paging request at virtual address 0000000000000000
CPU 0
kworker/u2:5(98): Oops -1
pc = [<0000000000000000>] ra = [<0000000000000000>] ps = 0000 Not tainted
pc is at 0x0
ra is at 0x0
v0 = 0000000000000007 t0 = 0000000000000001 t1 = 0000000000000001
t2 = 0000000000000000 t3 = fffffc00bfe68780 t4 = 0000000000000001
t5 = fffffc00bf8cc780 t6 = 00000000026f8000 t7 = fffffc00bfe70000
s0 = fffffc000250d310 s1 = fffffc000250d310 s2 = fffffc000250d310
s3 = fffffc000250ca40 s4 = fffffc000250caa0 s5 = 0000000000000000
s6 = fffffc000250ca40
a0 = fffffc00024f0488 a1 = fffffc00bfe73d98 a2 = fffffc00bfe68800
a3 = fffffc00bf881400 a4 = 0001000000000000 a5 = 0000000000000002
t8 = 0000000000000000 t9 = 0000000000000000 t10= 0000000001321800
t11= 000000000000ba4e pv = fffffc000189ca00 at = 0000000000000000
gp = fffffc000253e430 sp = 0000000043a83c2e
Disabling lock debugging due to kernel taint
Trace:
[<fffffc000105c8ac>] process_one_work+0x25c/0x5a0
[<fffffc000105cc4c>] worker_thread+0x5c/0x7d0
[<fffffc0001066c88>] kthread+0x188/0x1f0
[<fffffc0001011b48>] ret_from_kernel_thread+0x18/0x20
[<fffffc0001066b00>] kthread+0x0/0x1f0
[<fffffc000105cbf0>] worker_thread+0x0/0x7d0

Code:
00000000
00000000
00063301
000012e2
00001111
0005ffde

It seems to cause a hard lock on an SMP system, but not on a system with
a single CPU. Similarly, if I boot the SMP system (2 CPUs) with
maxcpus=1 the oops doesn't happen. Until I tested on a non-SMP system
today I suspected that it was unaffected, but I saw the oops there too.
With the revert applied, I don't see a warning or an oops.

Any clues how this patch could have triggered the oops?

Here's the revert, with a trivial conflict resolved, that I've used in
testing:

From fdbdd0f606f0f412ee06c1152e33a22ca17102bc Mon Sep 17 00:00:00 2001
From: Matt Turner <[email protected]>
Date: Sun, 24 May 2020 20:46:00 -0700
Subject: [PATCH] Revert "bug: consolidate warn_slowpath_fmt() usage"

This reverts commit f2f84b05e02b7710a201f0017b3272ad7ef703d1.
---
include/asm-generic/bug.h | 3 ++-
kernel/panic.c | 15 +++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 384b5c835ced..a4a311d4b4b0 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -82,7 +82,8 @@ struct bug_entry {
extern __printf(4, 5)
void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
const char *fmt, ...);
-#define __WARN() __WARN_printf(TAINT_WARN, NULL)
+extern void warn_slowpath_null(const char *file, const int line);
+#define __WARN() warn_slowpath_null(__FILE__, __LINE__)
#define __WARN_printf(taint, arg...) \
warn_slowpath_fmt(__FILE__, __LINE__, taint, arg)
#else
diff --git a/kernel/panic.c b/kernel/panic.c
index b69ee9e76cb2..c8ed8046b484 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -603,20 +603,19 @@ void warn_slowpath_fmt(const char *file, int line, unsigned taint,
{
struct warn_args args;

- pr_warn(CUT_HERE);
-
- if (!fmt) {
- __warn(file, line, __builtin_return_address(0), taint,
- NULL, NULL);
- return;
- }
-
args.fmt = fmt;
va_start(args.args, fmt);
__warn(file, line, __builtin_return_address(0), taint, NULL, &args);
va_end(args.args);
}
EXPORT_SYMBOL(warn_slowpath_fmt);
+
+void warn_slowpath_null(const char *file, int line)
+{
+ pr_warn(CUT_HERE);
+ __warn(file, line, __builtin_return_address(0), TAINT_WARN, NULL, NULL);
+}
+EXPORT_SYMBOL(warn_slowpath_null);
#else
void __warn_printk(const char *fmt, ...)
{
--
2.26.2


Attachments:
(No filename) (3.92 kB)
signature.asc (386.00 B)
Download all attachments

2020-06-02 18:07:57

by Kees Cook

[permalink] [raw]
Subject: Re: Regression bisected to f2f84b05e02b (bug: consolidate warn_slowpath_fmt() usage)

On Mon, Jun 01, 2020 at 07:48:04PM -0700, Matt Turner wrote:
> I bisected a regression on alpha to f2f84b05e02b (bug: consolidate
> warn_slowpath_fmt() usage) which looks totally innocuous.
>
> Reverting it on master confirms that it somehow is the trigger. At or a
> little after starting userspace, I'll see an oops like this:
>
> Unable to handle kernel paging request at virtual address 0000000000000000
> CPU 0
> kworker/u2:5(98): Oops -1
> pc = [<0000000000000000>] ra = [<0000000000000000>] ps = 0000 Not tainted
> pc is at 0x0

^^^^ so, the instruction pointer is NULL. The only way I can imagine
that happening would be from this line:

worker->current_func(work);

> ra is at 0x0
> v0 = 0000000000000007 t0 = 0000000000000001 t1 = 0000000000000001
> t2 = 0000000000000000 t3 = fffffc00bfe68780 t4 = 0000000000000001
> t5 = fffffc00bf8cc780 t6 = 00000000026f8000 t7 = fffffc00bfe70000
> s0 = fffffc000250d310 s1 = fffffc000250d310 s2 = fffffc000250d310
> s3 = fffffc000250ca40 s4 = fffffc000250caa0 s5 = 0000000000000000
> s6 = fffffc000250ca40
> a0 = fffffc00024f0488 a1 = fffffc00bfe73d98 a2 = fffffc00bfe68800
> a3 = fffffc00bf881400 a4 = 0001000000000000 a5 = 0000000000000002
> t8 = 0000000000000000 t9 = 0000000000000000 t10= 0000000001321800
> t11= 000000000000ba4e pv = fffffc000189ca00 at = 0000000000000000
> gp = fffffc000253e430 sp = 0000000043a83c2e
> Disabling lock debugging due to kernel taint
> Trace:
> [<fffffc000105c8ac>] process_one_work+0x25c/0x5a0

Can you verify where this ^^^^^^^^^^^^^^ is?

> [<fffffc000105cc4c>] worker_thread+0x5c/0x7d0
> [<fffffc0001066c88>] kthread+0x188/0x1f0
> [<fffffc0001011b48>] ret_from_kernel_thread+0x18/0x20
> [<fffffc0001066b00>] kthread+0x0/0x1f0
> [<fffffc000105cbf0>] worker_thread+0x0/0x7d0
>
> Code:
> 00000000
> 00000000
> 00063301
> 000012e2
> 00001111
> 0005ffde
>
> It seems to cause a hard lock on an SMP system, but not on a system with
> a single CPU. Similarly, if I boot the SMP system (2 CPUs) with
> maxcpus=1 the oops doesn't happen. Until I tested on a non-SMP system
> today I suspected that it was unaffected, but I saw the oops there too.
> With the revert applied, I don't see a warning or an oops.
>
> Any clues how this patch could have triggered the oops?

I cannot begin to imagine. :P Compared to other things I've seen like
this in the past maybe it's some kind of effect from the code size
changing the location/alignment or timing of something else?

Various questions ranging in degrees of sanity:

Does alpha use work queues for WARN?

Which work queue is getting a NULL function? (And then things like "if
WARN was much slower or much faster, is there a race to something
setting itself to NULL?")

Was there a WARN before the above Oops?

Does WARN have side-effects on alpha?

Does __WARN_printf() do something bad that warn_slowpath_null() doesn't?

Does making incremental changes narrow anything down? (e.g. instead of
this revert, remove the __warn() call in warn_slowpath_fmt() that was
added? (I mean, that'll be quite broken for WARN, but will it not oops?)

Does alpha have hardware breakpoints? When I had to track down a
corruption in the io scheduler, I ended up setting breakpoints on the
thing that went crazy (in this case, I assume the work queue function
pointer) to figure out what touched it.

... I can't think of anything else.

-Kees

>
> Here's the revert, with a trivial conflict resolved, that I've used in
> testing:
>
> From fdbdd0f606f0f412ee06c1152e33a22ca17102bc Mon Sep 17 00:00:00 2001
> From: Matt Turner <[email protected]>
> Date: Sun, 24 May 2020 20:46:00 -0700
> Subject: [PATCH] Revert "bug: consolidate warn_slowpath_fmt() usage"
>
> This reverts commit f2f84b05e02b7710a201f0017b3272ad7ef703d1.
> ---
> include/asm-generic/bug.h | 3 ++-
> kernel/panic.c | 15 +++++++--------
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
> index 384b5c835ced..a4a311d4b4b0 100644
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -82,7 +82,8 @@ struct bug_entry {
> extern __printf(4, 5)
> void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
> const char *fmt, ...);
> -#define __WARN() __WARN_printf(TAINT_WARN, NULL)
> +extern void warn_slowpath_null(const char *file, const int line);
> +#define __WARN() warn_slowpath_null(__FILE__, __LINE__)
> #define __WARN_printf(taint, arg...) \
> warn_slowpath_fmt(__FILE__, __LINE__, taint, arg)
> #else
> diff --git a/kernel/panic.c b/kernel/panic.c
> index b69ee9e76cb2..c8ed8046b484 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -603,20 +603,19 @@ void warn_slowpath_fmt(const char *file, int line, unsigned taint,
> {
> struct warn_args args;
> - pr_warn(CUT_HERE);
> -
> - if (!fmt) {
> - __warn(file, line, __builtin_return_address(0), taint,
> - NULL, NULL);
> - return;
> - }
> -
> args.fmt = fmt;
> va_start(args.args, fmt);
> __warn(file, line, __builtin_return_address(0), taint, NULL, &args);
> va_end(args.args);
> }
> EXPORT_SYMBOL(warn_slowpath_fmt);
> +
> +void warn_slowpath_null(const char *file, int line)
> +{
> + pr_warn(CUT_HERE);
> + __warn(file, line, __builtin_return_address(0), TAINT_WARN, NULL, NULL);
> +}
> +EXPORT_SYMBOL(warn_slowpath_null);
> #else
> void __warn_printk(const char *fmt, ...)
> {
> --
> 2.26.2



--
Kees Cook

2020-06-12 04:26:18

by Matt Turner

[permalink] [raw]
Subject: Re: Regression bisected to f2f84b05e02b (bug: consolidate warn_slowpath_fmt() usage)

On Tue, Jun 2, 2020 at 11:03 AM Kees Cook <[email protected]> wrote:
>
> On Mon, Jun 01, 2020 at 07:48:04PM -0700, Matt Turner wrote:
> > I bisected a regression on alpha to f2f84b05e02b (bug: consolidate
> > warn_slowpath_fmt() usage) which looks totally innocuous.
> >
> > Reverting it on master confirms that it somehow is the trigger. At or a
> > little after starting userspace, I'll see an oops like this:
> >
> > Unable to handle kernel paging request at virtual address 0000000000000000
> > CPU 0
> > kworker/u2:5(98): Oops -1
> > pc = [<0000000000000000>] ra = [<0000000000000000>] ps = 0000 Not tainted
> > pc is at 0x0
>
> ^^^^ so, the instruction pointer is NULL. The only way I can imagine
> that happening would be from this line:
>
> worker->current_func(work);
>
> > ra is at 0x0
> > v0 = 0000000000000007 t0 = 0000000000000001 t1 = 0000000000000001
> > t2 = 0000000000000000 t3 = fffffc00bfe68780 t4 = 0000000000000001
> > t5 = fffffc00bf8cc780 t6 = 00000000026f8000 t7 = fffffc00bfe70000
> > s0 = fffffc000250d310 s1 = fffffc000250d310 s2 = fffffc000250d310
> > s3 = fffffc000250ca40 s4 = fffffc000250caa0 s5 = 0000000000000000
> > s6 = fffffc000250ca40
> > a0 = fffffc00024f0488 a1 = fffffc00bfe73d98 a2 = fffffc00bfe68800
> > a3 = fffffc00bf881400 a4 = 0001000000000000 a5 = 0000000000000002
> > t8 = 0000000000000000 t9 = 0000000000000000 t10= 0000000001321800
> > t11= 000000000000ba4e pv = fffffc000189ca00 at = 0000000000000000
> > gp = fffffc000253e430 sp = 0000000043a83c2e
> > Disabling lock debugging due to kernel taint
> > Trace:
> > [<fffffc000105c8ac>] process_one_work+0x25c/0x5a0
>
> Can you verify where this ^^^^^^^^^^^^^^ is?

It is kernel/workqueue.c:2268, which contains

worker->current_func(work);

as you predicted.

> > [<fffffc000105cc4c>] worker_thread+0x5c/0x7d0
> > [<fffffc0001066c88>] kthread+0x188/0x1f0
> > [<fffffc0001011b48>] ret_from_kernel_thread+0x18/0x20
> > [<fffffc0001066b00>] kthread+0x0/0x1f0
> > [<fffffc000105cbf0>] worker_thread+0x0/0x7d0
> >
> > Code:
> > 00000000
> > 00000000
> > 00063301
> > 000012e2
> > 00001111
> > 0005ffde
> >
> > It seems to cause a hard lock on an SMP system, but not on a system with
> > a single CPU. Similarly, if I boot the SMP system (2 CPUs) with
> > maxcpus=1 the oops doesn't happen. Until I tested on a non-SMP system
> > today I suspected that it was unaffected, but I saw the oops there too.
> > With the revert applied, I don't see a warning or an oops.
> >
> > Any clues how this patch could have triggered the oops?
>
> I cannot begin to imagine. :P Compared to other things I've seen like
> this in the past maybe it's some kind of effect from the code size
> changing the location/alignment or timing of something else?
>
> Various questions ranging in degrees of sanity:
>
> Does alpha use work queues for WARN?

I do not know. I don't see much in a few greps of arch/alpha that
would indicate that it uses work queues.

> Which work queue is getting a NULL function? (And then things like "if
> WARN was much slower or much faster, is there a race to something
> setting itself to NULL?")
>
> Was there a WARN before the above Oops?

No, which I suspect means that your much scarier suggestion that this
is somehow due to code size or alignment is increasingly plausible.

> Does WARN have side-effects on alpha?

alpha just uses the asm-generic implementation of WARN as far as I can
tell, so I think not.

> Does __WARN_printf() do something bad that warn_slowpath_null() doesn't?
>
> Does making incremental changes narrow anything down? (e.g. instead of
> this revert, remove the __warn() call in warn_slowpath_fmt() that was
> added? (I mean, that'll be quite broken for WARN, but will it not oops?)

Commenting out the added __warn does not work around the problem.

Readding warn_slowpath_null and the EXPORT_SYMBOL (but not calling it
from WARN) does not work around the problem.

Calling warn_slowpath_fmt() with fmt=" " instead of fmt=NULL does not
work around the problem.

I also tried GCC-10.1 as a stab in the dark, and that doesn't work
around the problem.

So I'm thinking it's something about code size or alignment. I would
be worried it's to do with memory ordering (since this is on Alpha)
but I'm seeing the problem on a single CPU system, so that should be
ruled out, I think?

Using CONFIG_CC_OPTIMIZE_FOR_SIZE=y doesn't work around the problem.
So that hurts the theory of code size being the trigger.

Since I noticed earlier that using maxcpus=1 on a 2-CPU system
prevented the system from hanging, I tried disabling CONFIG_SMP on my
1-CPU system as well. In doing so, I discovered that the RCU torture
module (RCU_TORTURE_TEST) triggers some null pointer dereferences on
Alpha when CONFIG_SMP is set, but works successfully when CONFIG_SMP
is unset.

That seems likely to be a symptom of the same underlying problem that
started this thread, don't you think? If so, I'll focus my attention
on that.

> Does alpha have hardware breakpoints? When I had to track down a
> corruption in the io scheduler, I ended up setting breakpoints on the
> thing that went crazy (in this case, I assume the work queue function
> pointer) to figure out what touched it.

As far as I know we don't have anything implemented in the kernel, but
they could be implemented by faulting on read/write.

> ... I can't think of anything else.

Thanks for your time and suggestions!

Matt

2020-06-12 05:09:21

by Kees Cook

[permalink] [raw]
Subject: Re: Regression bisected to f2f84b05e02b (bug: consolidate warn_slowpath_fmt() usage)

On Fri, Jun 12, 2020 at 04:47:57PM +1200, Michael Cree wrote:
> On Thu, Jun 11, 2020 at 09:23:52PM -0700, Matt Turner wrote:
> > Since I noticed earlier that using maxcpus=1 on a 2-CPU system
> > prevented the system from hanging, I tried disabling CONFIG_SMP on my
> > 1-CPU system as well. In doing so, I discovered that the RCU torture
> > module (RCU_TORTURE_TEST) triggers some null pointer dereferences on
> > Alpha when CONFIG_SMP is set, but works successfully when CONFIG_SMP
> > is unset.
> >
> > That seems likely to be a symptom of the same underlying problem that
> > started this thread, don't you think? If so, I'll focus my attention
> > on that.
>
> I wonder if that is related to user space segfaults we are now seeing
> on SMP systems but not UP systems while building Alpha debian-ports.
> It's happening in the test-suites of builds of certain software
> (such as autogen and guile) but they always build successfully with
> the test suite passing on a UP system.
>
> When investigating I seem to recall it was a NULL (or near NULL)
> pointer dereference but couldn't make any sense of how it might
> have got into such an obviously wrong state.

By some miracle, I have avoided any experience with RCU bugs. ;) If
the RCU_TORTURE_TEST Oopses or the segfaults are repeatable and don't
go away with the WARN patch reverted, then perhaps it might be used to
bisect to something closer to the root cause?

Given the similarity to the SMP vs UP stuff and the RCU tests, I'd agree
that does seem like the best path to investigate.

--
Kees Cook

2020-06-12 05:12:55

by Michael Cree

[permalink] [raw]
Subject: Re: Regression bisected to f2f84b05e02b (bug: consolidate warn_slowpath_fmt() usage)

On Thu, Jun 11, 2020 at 09:23:52PM -0700, Matt Turner wrote:
> Since I noticed earlier that using maxcpus=1 on a 2-CPU system
> prevented the system from hanging, I tried disabling CONFIG_SMP on my
> 1-CPU system as well. In doing so, I discovered that the RCU torture
> module (RCU_TORTURE_TEST) triggers some null pointer dereferences on
> Alpha when CONFIG_SMP is set, but works successfully when CONFIG_SMP
> is unset.
>
> That seems likely to be a symptom of the same underlying problem that
> started this thread, don't you think? If so, I'll focus my attention
> on that.

I wonder if that is related to user space segfaults we are now seeing
on SMP systems but not UP systems while building Alpha debian-ports.
It's happening in the test-suites of builds of certain software
(such as autogen and guile) but they always build successfully with
the test suite passing on a UP system.

When investigating I seem to recall it was a NULL (or near NULL)
pointer dereference but couldn't make any sense of how it might
have got into such an obviously wrong state.

Cheers,
Michael.