2014-07-30 06:53:32

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [PATCH 3.15 33/37] Fix gcc-4.9.0 miscompilation of load_balance() in scheduler

On Tue, Jul 29, 2014 at 06:49:09PM -0700, Greg Kroah-Hartman wrote:
> 3.15-stable review patch. If anyone has any objections, please let me know.

IMNSHO this is a too big hammer approach. The bug happened on a single
file only (right?), so if anything, IMHO it could be disabled for that
single file only, and better do it only for compilers with the bug.
If there are wrong code bugs with -O2 (we've fixed many in the past), you
also don't turn off -O2 everywhere, similarly for -Os or any other options.
Disabling it just in case the same bug happens elsewhere when it actually
took 5 years before the bug caused miscompilation of something is too
defensive. We had at least 15 other wrong-code bugfixes just in between
4.9.0 and 4.9.1. -fvar-tracking-assignments doesn't make a small difference
in debug info, but significant for optimized code.
If you build the kernel without and with -fno-var-tracking-assignments,
you can use e.g. the dwlocstat tool to see what kind of difference it makes
for the kernel in particular in variable debug info coverage.
>
> --- a/Makefile
> +++ b/Makefile
> @@ -669,6 +669,8 @@ KBUILD_CFLAGS += -fomit-frame-pointer
> endif
> endif
>
> +KBUILD_CFLAGS += $(call cc-option, -fno-var-tracking-assignments)
> +
> ifdef CONFIG_DEBUG_INFO
> KBUILD_CFLAGS += -g
> KBUILD_AFLAGS += -Wa,--gdwarf-2
>

Jakub


2014-07-30 07:13:14

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 3.15 33/37] Fix gcc-4.9.0 miscompilation of load_balance() in scheduler

On 2014.07.30 at 08:53 +0200, Jakub Jelinek wrote:
> On Tue, Jul 29, 2014 at 06:49:09PM -0700, Greg Kroah-Hartman wrote:
> > 3.15-stable review patch. If anyone has any objections, please let me know.
>
> IMNSHO this is a too big hammer approach. The bug happened on a single
> file only (right?), so if anything, IMHO it could be disabled for that
> single file only, and better do it only for compilers with the bug.

No. There are many more files affected. It just happens that Linus
analyzed the assembly of this single file (fair.c) and found a bug.
Just build your redhat distro kernel with GCC_COMPARE_DEBUG=1 and you'll
see. So unless someone analyzes the assembly output of all other
affected files by hand and finds no issues, one has to assume the worst.

> If there are wrong code bugs with -O2 (we've fixed many in the past), you
> also don't turn off -O2 everywhere, similarly for -Os or any other options.
> Disabling it just in case the same bug happens elsewhere when it actually
> took 5 years before the bug caused miscompilation of something is too
> defensive. We had at least 15 other wrong-code bugfixes just in between
> 4.9.0 and 4.9.1. -fvar-tracking-assignments doesn't make a small difference
> in debug info, but significant for optimized code.
> If you build the kernel without and with -fno-var-tracking-assignments,
> you can use e.g. the dwlocstat tool to see what kind of difference it makes
> for the kernel in particular in variable debug info coverage.

I'm sure it would be possible to backport a proper check based on the
gcc testcase to the stable kernels, once it gets implemented.

--
Markus

2014-07-30 07:21:37

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [PATCH 3.15 33/37] Fix gcc-4.9.0 miscompilation of load_balance() in scheduler

On Wed, Jul 30, 2014 at 09:13:08AM +0200, Markus Trippelsdorf wrote:
> On 2014.07.30 at 08:53 +0200, Jakub Jelinek wrote:
> > On Tue, Jul 29, 2014 at 06:49:09PM -0700, Greg Kroah-Hartman wrote:
> > > 3.15-stable review patch. If anyone has any objections, please let me know.
> >
> > IMNSHO this is a too big hammer approach. The bug happened on a single
> > file only (right?), so if anything, IMHO it could be disabled for that
> > single file only, and better do it only for compilers with the bug.
>
> No. There are many more files affected. It just happens that Linus
> analyzed the assembly of this single file (fair.c) and found a bug.
> Just build your redhat distro kernel with GCC_COMPARE_DEBUG=1 and you'll
> see. So unless someone analyzes the assembly output of all other
> affected files by hand and finds no issues, one has to assume the worst.

I'm talking about wrong-code issues. For -fcompare-debug, we indeed check
it primarily during gcc bootstrap (through bootstrap-debug) and some
testcases, and we'll certainly try to build some more code with
-fcompare-debug and fix the issues.

Jakub

2014-07-30 07:27:50

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 3.15 33/37] Fix gcc-4.9.0 miscompilation of load_balance() in scheduler

On 2014.07.30 at 09:21 +0200, Jakub Jelinek wrote:
> On Wed, Jul 30, 2014 at 09:13:08AM +0200, Markus Trippelsdorf wrote:
> > On 2014.07.30 at 08:53 +0200, Jakub Jelinek wrote:
> > > On Tue, Jul 29, 2014 at 06:49:09PM -0700, Greg Kroah-Hartman wrote:
> > > > 3.15-stable review patch. If anyone has any objections, please let me know.
> > >
> > > IMNSHO this is a too big hammer approach. The bug happened on a single
> > > file only (right?), so if anything, IMHO it could be disabled for that
> > > single file only, and better do it only for compilers with the bug.
> >
> > No. There are many more files affected. It just happens that Linus
> > analyzed the assembly of this single file (fair.c) and found a bug.
> > Just build your redhat distro kernel with GCC_COMPARE_DEBUG=1 and you'll
> > see. So unless someone analyzes the assembly output of all other
> > affected files by hand and finds no issues, one has to assume the worst.
>
> I'm talking about wrong-code issues. For -fcompare-debug, we indeed check
> it primarily during gcc bootstrap (through bootstrap-debug) and some
> testcases, and we'll certainly try to build some more code with
> -fcompare-debug and fix the issues.

Yes, I'm talking about wrong-code issues, too. For example the pr61801.c
testcase was reduced from kernel/exit.c.

--
Markus

2014-07-30 15:47:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3.15 33/37] Fix gcc-4.9.0 miscompilation of load_balance() in scheduler

On Tue, Jul 29, 2014 at 11:53 PM, Jakub Jelinek <[email protected]> wrote:
>
> IMNSHO this is a too big hammer approach. The bug happened on a single
> file only (right?)

Very dubious. We happened to see it in a single case, and _maybe_ that
was the only one in the whole kernel. But it's much more likely that
it wasn't - it's not like the code in question was even all that
unusual (just a percpu access triggering an asm - but we have tons of
asms in the kernel).

I'd argue that we were very lucky to get the problem happening
reliably enough for a couple of people who then cared enoiugh to do
good bug reports (considering that it needed an interrupt in *just*
the right place) that we could debug it at all. In some code that gets
run much less than the scheduler, it could easily have been one of
those "people report it once in a blue moon, looks like memory
corruption".

Now, it would be interesting to hear if there is something very
special that made that instruction scheduling bug trigger just for
4.9.x, or if there is something else that made it very particular to
that code sequence. But in the absence of good reasoning to the
contrary, I'd much rather say "let's just avoid the bug entirely".

And that's partly because we really don't care that much about the
debug info. Yes, it gets used, but it's not *that* common, and the
last time the issue of debug info sucking up tons of resources came
up, the biggest users were people who just wanted line information for
oopses. Yes, there are people running kgdb etc, but on the whole it's
rare, and quite frankly, from everything I have _ever_ seen, that's
not how the real kernel bugs are ever really discovered. So the kind
of debug information that the variable tracking logic adds just isn't
all that important for the kernel.

Linus