2020-05-09 12:08:35

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH net-next 2/2] ath10k: fix ath10k_pci struct layout

gcc-10 correctly points out a bug with a zero-length array in
struct ath10k_pci:

drivers/net/wireless/ath/ath10k/ahb.c: In function 'ath10k_ahb_remove':
drivers/net/wireless/ath/ath10k/ahb.c:30:9: error: array subscript 0 is outside the bounds of an interior zero-length array 'struct ath10k_ahb[0]' [-Werror=zero-length-bounds]
30 | return &((struct ath10k_pci *)ar->drv_priv)->ahb[0];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/net/wireless/ath/ath10k/ahb.c:13:
drivers/net/wireless/ath/ath10k/pci.h:185:20: note: while referencing 'ahb'
185 | struct ath10k_ahb ahb[0];
| ^~~

The last addition to the struct ignored the comments and added
new members behind the array that must remain last.

Change it to a flexible-array member and move it last again to
make it work correctly, prevent the same thing from happening
again (all compilers warn about flexible-array members in the
middle of a struct) and get it to build without warnings.

Fixes: 521fc37be3d8 ("ath10k: Avoid override CE5 configuration for QCA99X0 chipsets")
Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/wireless/ath/ath10k/pci.h | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/pci.h b/drivers/net/wireless/ath/ath10k/pci.h
index e3cbd259a2dc..862d0901c5b8 100644
--- a/drivers/net/wireless/ath/ath10k/pci.h
+++ b/drivers/net/wireless/ath/ath10k/pci.h
@@ -178,15 +178,16 @@ struct ath10k_pci {
*/
u32 (*targ_cpu_to_ce_addr)(struct ath10k *ar, u32 addr);

+ struct ce_attr *attr;
+ struct ce_pipe_config *pipe_config;
+ struct ce_service_to_pipe *serv_to_pipe;
+
/* Keep this entry in the last, memory for struct ath10k_ahb is
* allocated (ahb support enabled case) in the continuation of
* this struct.
*/
- struct ath10k_ahb ahb[0];
+ struct ath10k_ahb ahb[];

- struct ce_attr *attr;
- struct ce_pipe_config *pipe_config;
- struct ce_service_to_pipe *serv_to_pipe;
};

static inline struct ath10k_pci *ath10k_pci_priv(struct ath10k *ar)
--
2.26.0


2020-05-11 12:05:55

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] ath10k: fix ath10k_pci struct layout

Arnd Bergmann <[email protected]> writes:

> gcc-10 correctly points out a bug with a zero-length array in
> struct ath10k_pci:
>
> drivers/net/wireless/ath/ath10k/ahb.c: In function 'ath10k_ahb_remove':
> drivers/net/wireless/ath/ath10k/ahb.c:30:9: error: array subscript 0
> is outside the bounds of an interior zero-length array 'struct
> ath10k_ahb[0]' [-Werror=zero-length-bounds]
> 30 | return &((struct ath10k_pci *)ar->drv_priv)->ahb[0];
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> In file included from drivers/net/wireless/ath/ath10k/ahb.c:13:
> drivers/net/wireless/ath/ath10k/pci.h:185:20: note: while referencing 'ahb'
> 185 | struct ath10k_ahb ahb[0];
> | ^~~
>
> The last addition to the struct ignored the comments and added
> new members behind the array that must remain last.
>
> Change it to a flexible-array member and move it last again to
> make it work correctly, prevent the same thing from happening
> again (all compilers warn about flexible-array members in the
> middle of a struct) and get it to build without warnings.

Very good find, thanks! This bug would cause all sort of strange memory
corruption issues.

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2020-05-11 12:19:22

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] ath10k: fix ath10k_pci struct layout

Kalle Valo <[email protected]> writes:

> Arnd Bergmann <[email protected]> writes:
>
>> gcc-10 correctly points out a bug with a zero-length array in
>> struct ath10k_pci:
>>
>> drivers/net/wireless/ath/ath10k/ahb.c: In function 'ath10k_ahb_remove':
>> drivers/net/wireless/ath/ath10k/ahb.c:30:9: error: array subscript 0
>> is outside the bounds of an interior zero-length array 'struct
>> ath10k_ahb[0]' [-Werror=zero-length-bounds]
>> 30 | return &((struct ath10k_pci *)ar->drv_priv)->ahb[0];
>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> In file included from drivers/net/wireless/ath/ath10k/ahb.c:13:
>> drivers/net/wireless/ath/ath10k/pci.h:185:20: note: while referencing 'ahb'
>> 185 | struct ath10k_ahb ahb[0];
>> | ^~~
>>
>> The last addition to the struct ignored the comments and added
>> new members behind the array that must remain last.
>>
>> Change it to a flexible-array member and move it last again to
>> make it work correctly, prevent the same thing from happening
>> again (all compilers warn about flexible-array members in the
>> middle of a struct) and get it to build without warnings.
>
> Very good find, thanks! This bug would cause all sort of strange memory
> corruption issues.

This motivated me to switch to using GCC 10.x and I noticed that you had
already upgraded crosstool so it was a trivial thing to do, awesome :)

https://mirrors.edge.kernel.org/pub/tools/crosstool/

I use crosstool like this using GNUmakefile:

CROSS_COMPILE=/opt/cross/gcc-10.1.0-nolibc/x86_64-linux/bin/x86_64-linux-
include Makefile

I think it's handy trick and would be good to mention that in the
crosstool main page. That way I could just point people to the crosstool
main page when they are using ancient compilers and would need to
upgrade.

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2020-05-11 12:40:36

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] ath10k: fix ath10k_pci struct layout

On Mon, May 11, 2020 at 2:17 PM Kalle Valo <[email protected]> wrote:
>
> Kalle Valo <[email protected]> writes:
> >>
> >> Change it to a flexible-array member and move it last again to
> >> make it work correctly, prevent the same thing from happening
> >> again (all compilers warn about flexible-array members in the
> >> middle of a struct) and get it to build without warnings.
> >
> > Very good find, thanks! This bug would cause all sort of strange memory
> > corruption issues.
>
> This motivated me to switch to using GCC 10.x and I noticed that you had
> already upgraded crosstool so it was a trivial thing to do, awesome :)
>
> https://mirrors.edge.kernel.org/pub/tools/crosstool/
>
> I use crosstool like this using GNUmakefile:
>
> CROSS_COMPILE=/opt/cross/gcc-10.1.0-nolibc/x86_64-linux/bin/x86_64-linux-
> include Makefile

Right, I have something similar (with many more additional things)
in a local makefile here. I mainly use that to pick the correct cross
toolchain based on ${ARCH}, and to build multiple randconfig kernels
in parallel with 'make -j${NR_CPUS}' for better CPU utilization.

> I think it's handy trick and would be good to mention that in the
> crosstool main page. That way I could just point people to the crosstool
> main page when they are using ancient compilers and would need to
> upgrade.

I actually started working on a script that I'd like to include the kernel
sources to list the installed compilers, automatically pick on that
works for the current architecture, or download one for local installation.

Arnd

2020-05-13 06:51:11

by Kalle Valo

[permalink] [raw]
Subject: gcc-10: kernel stack is corrupted and fails to boot

(trimming CC, changing title)

Kalle Valo <[email protected]> writes:

> Kalle Valo <[email protected]> writes:
>
>> Arnd Bergmann <[email protected]> writes:
>>
>>> gcc-10 correctly points out a bug with a zero-length array in
>>> struct ath10k_pci:
>>>
>>> drivers/net/wireless/ath/ath10k/ahb.c: In function 'ath10k_ahb_remove':
>>> drivers/net/wireless/ath/ath10k/ahb.c:30:9: error: array subscript 0
>>> is outside the bounds of an interior zero-length array 'struct
>>> ath10k_ahb[0]' [-Werror=zero-length-bounds]
>>> 30 | return &((struct ath10k_pci *)ar->drv_priv)->ahb[0];
>>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> In file included from drivers/net/wireless/ath/ath10k/ahb.c:13:
>>> drivers/net/wireless/ath/ath10k/pci.h:185:20: note: while referencing 'ahb'
>>> 185 | struct ath10k_ahb ahb[0];
>>> | ^~~
>>>
>>> The last addition to the struct ignored the comments and added
>>> new members behind the array that must remain last.
>>>
>>> Change it to a flexible-array member and move it last again to
>>> make it work correctly, prevent the same thing from happening
>>> again (all compilers warn about flexible-array members in the
>>> middle of a struct) and get it to build without warnings.
>>
>> Very good find, thanks! This bug would cause all sort of strange memory
>> corruption issues.
>
> This motivated me to switch to using GCC 10.x and I noticed that you had
> already upgraded crosstool so it was a trivial thing to do, awesome :)
>
> https://mirrors.edge.kernel.org/pub/tools/crosstool/

And now I have a problem :) I first noticed that my x86 testbox is not
booting when I compile the kernel with GCC 10.1.0 from crosstool. I
didn't get any error messages so I just downgraded the compiler and the
kernel was booting fine again. Next I decided to try GCC 10.1 with my
x86 laptop and it also failed to boot, but this time I got kernel logs
and saw this:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secodary+0x178/0x180

Call Trace:
dump_stack
panic
? _raw_spin_unlock_irqrestore
? start_secondary
__stack_chk_fail
start_secondary
secondary_startup

(I wrote the above messages manually from a picture so expect typos)

Then also on my x86 laptop I downgraded the compiler to GCC 8.1.0 (from
crosstool), rebuilt the exactly same kernel version and the kernel
booted without issues.

I'm using 5.7.0-rc4-wt-ath+ which is basically v5.7-rc4 plus latest
wireless patches, and I doubt the wireless patches are making any
difference this early in the boot. All compilers I use are prebuilt
binaries from kernel.org crosstool repo[1] with addition of ccache
v3.4.1 to speed up my builds.

Any ideas? How should I debug this further?

[1] https://mirrors.edge.kernel.org/pub/tools/crosstool/

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2020-05-13 08:53:50

by Arnd Bergmann

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 8:50 AM Kalle Valo <[email protected]> wrote:
>
> Kalle Valo <[email protected]> writes:
>
> > This motivated me to switch to using GCC 10.x and I noticed that you had
> > already upgraded crosstool so it was a trivial thing to do, awesome :)
> >
> > https://mirrors.edge.kernel.org/pub/tools/crosstool/
>
> And now I have a problem :) I first noticed that my x86 testbox is not
> booting when I compile the kernel with GCC 10.1.0 from crosstool. I
> didn't get any error messages so I just downgraded the compiler and the
> kernel was booting fine again. Next I decided to try GCC 10.1 with my
> x86 laptop and it also failed to boot, but this time I got kernel logs
> and saw this:
>
> Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secodary+0x178/0x180
>
> Call Trace:
> dump_stack
> panic
> ? _raw_spin_unlock_irqrestore
> ? start_secondary
> __stack_chk_fail
> start_secondary
> secondary_startup
>
> (I wrote the above messages manually from a picture so expect typos)
>
> Then also on my x86 laptop I downgraded the compiler to GCC 8.1.0 (from
> crosstool), rebuilt the exactly same kernel version and the kernel
> booted without issues.
>
> I'm using 5.7.0-rc4-wt-ath+ which is basically v5.7-rc4 plus latest
> wireless patches, and I doubt the wireless patches are making any
> difference this early in the boot. All compilers I use are prebuilt
> binaries from kernel.org crosstool repo[1] with addition of ccache
> v3.4.1 to speed up my builds.
>
> Any ideas? How should I debug this further?

At least if it fails reproducibly, it's probably not too hard to drill
down further. Some ideas:

* I'd first try to reproduce it in qemu. Since you don't even need
any user space or modules, I would simply try
$ qemu-system-x86_64 -nographic -monitor none -append
"console=ttyS0" -serial stdio -smp 4 -kernel arch/x86/boot/bzImage
I tried it here with an x86 defconfig linux-next kernel but did not
run into the problem you described. If you share your .config,
I can try reproducing with that as well. Once there is a reproducer
in qemu, it should be trivial to step through it using gdb.

* There are still two prerelease compiler versions on kernel.org,
from February and from April. You can try each one to see
if this was a recent regression. It's also possible that there is
a problem with my specific builds of gcc-10.1, and that the
compiler is actually fine for others.The gcc-10 packages in
Fedora/Debian/Ubuntu are probably better tested.

Arnd

2020-05-13 12:49:43

by Kalle Valo

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

Arnd Bergmann <[email protected]> writes:

> On Wed, May 13, 2020 at 8:50 AM Kalle Valo <[email protected]> wrote:
>>
>> Kalle Valo <[email protected]> writes:
>>
>> > This motivated me to switch to using GCC 10.x and I noticed that you had
>> > already upgraded crosstool so it was a trivial thing to do, awesome :)
>> >
>> > https://mirrors.edge.kernel.org/pub/tools/crosstool/
>>
>> And now I have a problem :) I first noticed that my x86 testbox is not
>> booting when I compile the kernel with GCC 10.1.0 from crosstool. I
>> didn't get any error messages so I just downgraded the compiler and the
>> kernel was booting fine again. Next I decided to try GCC 10.1 with my
>> x86 laptop and it also failed to boot, but this time I got kernel logs
>> and saw this:
>>
>> Kernel panic - not syncing: stack-protector: Kernel stack is
>> corrupted in: start_secodary+0x178/0x180
>>
>> Call Trace:
>> dump_stack
>> panic
>> ? _raw_spin_unlock_irqrestore
>> ? start_secondary
>> __stack_chk_fail
>> start_secondary
>> secondary_startup
>>
>> (I wrote the above messages manually from a picture so expect typos)
>>
>> Then also on my x86 laptop I downgraded the compiler to GCC 8.1.0 (from
>> crosstool), rebuilt the exactly same kernel version and the kernel
>> booted without issues.
>>
>> I'm using 5.7.0-rc4-wt-ath+ which is basically v5.7-rc4 plus latest
>> wireless patches, and I doubt the wireless patches are making any
>> difference this early in the boot. All compilers I use are prebuilt
>> binaries from kernel.org crosstool repo[1] with addition of ccache
>> v3.4.1 to speed up my builds.
>>
>> Any ideas? How should I debug this further?
>
> At least if it fails reproducibly, it's probably not too hard to drill
> down further. Some ideas:
>
> * I'd first try to reproduce it in qemu. Since you don't even need
> any user space or modules, I would simply try
> $ qemu-system-x86_64 -nographic -monitor none -append
> "console=ttyS0" -serial stdio -smp 4 -kernel arch/x86/boot/bzImage
> I tried it here with an x86 defconfig linux-next kernel but did not
> run into the problem you described.

Thanks, I'll try that but I expect it will take few days before I can do
it.

> If you share your .config, I can try reproducing with that as well.
> Once there is a reproducer in qemu, it should be trivial to step
> through it using gdb.

I have attached the .config I used with GCC 10.1. If you are able to
test it please do let me know how it went.

> * There are still two prerelease compiler versions on kernel.org,
> from February and from April. You can try each one to see
> if this was a recent regression. It's also possible that there is
> a problem with my specific builds of gcc-10.1, and that the
> compiler is actually fine for others.The gcc-10 packages in
> Fedora/Debian/Ubuntu are probably better tested.

I'm still using Ubuntu 16.04 so not sure how easy it is to find a
package for that, but maybe this is a good reason to finally my upgrade
my laptop :)

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


Attachments:
.config.old (146.32 kB)

2020-05-13 13:47:53

by Arnd Bergmann

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 2:57 PM Kalle Valo <[email protected]> wrote:
>
> Arnd Bergmann <[email protected]> writes:
>
> > On Wed, May 13, 2020 at 8:50 AM Kalle Valo <[email protected]> wrote:
> >>
> >> Kalle Valo <[email protected]> writes:
> >
> > At least if it fails reproducibly, it's probably not too hard to drill
> > down further. Some ideas:
> >
> > * I'd first try to reproduce it in qemu. Since you don't even need
> > any user space or modules, I would simply try
> > $ qemu-system-x86_64 -nographic -monitor none -append
> > "console=ttyS0" -serial stdio -smp 4 -kernel arch/x86/boot/bzImage
> > I tried it here with an x86 defconfig linux-next kernel but did not
> > run into the problem you described.
>
> Thanks, I'll try that but I expect it will take few days before I can do
> it.
>
> > If you share your .config, I can try reproducing with that as well.
> > Once there is a reproducer in qemu, it should be trivial to step
> > through it using gdb.
>
> I have attached the .config I used with GCC 10.1. If you are able to
> test it please do let me know how it went.

Yes, I see the same problem now, but have not investigated
any further.

> > * There are still two prerelease compiler versions on kernel.org,
> > from February and from April. You can try each one to see
> > if this was a recent regression. It's also possible that there is
> > a problem with my specific builds of gcc-10.1, and that the
> > compiler is actually fine for others.The gcc-10 packages in
> > Fedora/Debian/Ubuntu are probably better tested.
>
> I'm still using Ubuntu 16.04 so not sure how easy it is to find a
> package for that, but maybe this is a good reason to finally my upgrade
> my laptop :)

I checked with the gcc-10 package from Ubuntu 20.04, same
result as with my version, at least that indicates it's not my fault ;-)

Arnd

2020-05-13 15:32:13

by Kalle Valo

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

Arnd Bergmann <[email protected]> writes:

> On Wed, May 13, 2020 at 2:57 PM Kalle Valo <[email protected]> wrote:
>>
>> Arnd Bergmann <[email protected]> writes:
>>
>> > On Wed, May 13, 2020 at 8:50 AM Kalle Valo <[email protected]> wrote:
>> >>
>> >> Kalle Valo <[email protected]> writes:
>> >
>> > At least if it fails reproducibly, it's probably not too hard to drill
>> > down further. Some ideas:
>> >
>> > * I'd first try to reproduce it in qemu. Since you don't even need
>> > any user space or modules, I would simply try
>> > $ qemu-system-x86_64 -nographic -monitor none -append
>> > "console=ttyS0" -serial stdio -smp 4 -kernel arch/x86/boot/bzImage
>> > I tried it here with an x86 defconfig linux-next kernel but did not
>> > run into the problem you described.
>>
>> Thanks, I'll try that but I expect it will take few days before I can do
>> it.
>>
>> > If you share your .config, I can try reproducing with that as well.
>> > Once there is a reproducer in qemu, it should be trivial to step
>> > through it using gdb.
>>
>> I have attached the .config I used with GCC 10.1. If you are able to
>> test it please do let me know how it went.
>
> Yes, I see the same problem now, but have not investigated
> any further.

Great, so it's not a problem due to my setup.

>> > * There are still two prerelease compiler versions on kernel.org,
>> > from February and from April. You can try each one to see
>> > if this was a recent regression. It's also possible that there is
>> > a problem with my specific builds of gcc-10.1, and that the
>> > compiler is actually fine for others.The gcc-10 packages in
>> > Fedora/Debian/Ubuntu are probably better tested.
>>
>> I'm still using Ubuntu 16.04 so not sure how easy it is to find a
>> package for that, but maybe this is a good reason to finally my upgrade
>> my laptop :)
>
> I checked with the gcc-10 package from Ubuntu 20.04, same
> result as with my version, at least that indicates it's not my fault ;-)

That's good to know as well. And now I can delay my laptop upgrade even
more ;)

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2020-05-13 16:01:33

by Arnd Bergmann

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 5:31 PM Kalle Valo <[email protected]> wrote:
> Arnd Bergmann <[email protected]> writes:
> > On Wed, May 13, 2020 at 2:57 PM Kalle Valo <[email protected]> wrote:
> >>
> >> Arnd Bergmann <[email protected]> writes:
> >>
> >> > If you share your .config, I can try reproducing with that as well.
> >> > Once there is a reproducer in qemu, it should be trivial to step
> >> > through it using gdb.
> >>
> >> I have attached the .config I used with GCC 10.1. If you are able to
> >> test it please do let me know how it went.
> >
> > Yes, I see the same problem now, but have not investigated
> > any further.
>
> Great, so it's not a problem due to my setup.

I investigated a little more: This does happen with 'defconfig'
after all, in my first try I must have missed the '-smp 2' argument
to qemu, and it ended up working correctly with just one CPU
but fails now.

Stepping through the boot process, I see where it crashes
in start_secondary:

| /* to prevent fake stack check failure in clock setup */
| boot_init_stack_canary();
|
| x86_cpuinit.setup_percpu_clockev();
|
| wmb();
| cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

The call to cpu_startup_entry() does not succeed, instead
it jumps to __stack_chk_fail() from there.

Arnd

2020-05-13 16:08:36

by David Laight

[permalink] [raw]
Subject: RE: gcc-10: kernel stack is corrupted and fails to boot

From: Arnd Bergmann
> Sent: 13 May 2020 17:00
> On Wed, May 13, 2020 at 5:31 PM Kalle Valo <[email protected]> wrote:
...
> I investigated a little more: This does happen with 'defconfig'
> after all, in my first try I must have missed the '-smp 2' argument
> to qemu, and it ended up working correctly with just one CPU
> but fails now.
>
> Stepping through the boot process, I see where it crashes
> in start_secondary:
>
> | /* to prevent fake stack check failure in clock setup */
> | boot_init_stack_canary();
> |
> | x86_cpuinit.setup_percpu_clockev();
> |
> | wmb();
> | cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
>
> The call to cpu_startup_entry() does not succeed, instead
> it jumps to __stack_chk_fail() from there.

Hasn't this already been fixed?
Add:
asm("");
after cpu_startup_entry() to stop it being tail-called.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-05-13 21:29:13

by Arnd Bergmann

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 5:48 PM Arvind Sankar <[email protected]> wrote:
>
> On Wed, May 13, 2020 at 09:50:03AM +0300, Kalle Valo wrote:

> > And now I have a problem :) I first noticed that my x86 testbox is not
> > booting when I compile the kernel with GCC 10.1.0 from crosstool. I
> > didn't get any error messages so I just downgraded the compiler and the
> > kernel was booting fine again. Next I decided to try GCC 10.1 with my
> > x86 laptop and it also failed to boot, but this time I got kernel logs
> > and saw this:
> >
> > Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secodary+0x178/0x180
> >
>
> See https://lore.kernel.org/lkml/[email protected]/

Thanks!

I see the patch in linux-next but not in mainline. I suppose we want it in v5.7
and backported to stable kernels so they can boot when built with gcc-10?

I suppose the only reason that the other architectures don't run into the
problem is that they don't call boot_init_stack_canary() in start_secondary()
though they probably should?

Arnd

2020-05-13 21:42:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 11:28:09PM +0200, Arnd Bergmann wrote:
> I see the patch in linux-next but not in mainline. I suppose we want
> it in v5.7 and backported to stable kernels so they can boot when
> built with gcc-10?

It is queued for 5.8. For a good reason, if you read the whole thread
Arvind pointed you to.

Lemme guess: gcc10 got released in the meantime (hohumm, website says
so) and so we probably should expedite this and send it to Linus now...?

--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2020-05-13 21:51:09

by Arnd Bergmann

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 11:41 PM Borislav Petkov <[email protected]> wrote:
>
> On Wed, May 13, 2020 at 11:28:09PM +0200, Arnd Bergmann wrote:
> > I see the patch in linux-next but not in mainline. I suppose we want
> > it in v5.7 and backported to stable kernels so they can boot when
> > built with gcc-10?
>
> It is queued for 5.8. For a good reason, if you read the whole thread
> Arvind pointed you to.
>
> Lemme guess: gcc10 got released in the meantime (hohumm, website says
> so) and so we probably should expedite this and send it to Linus now...?

Right, in particular since Linus started building with gcc-10 already and
would likely soon run into that problem if he hasn't already ;-)

Arnd

2020-05-13 22:52:20

by Arvind Sankar

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Thu, May 14, 2020 at 12:20:38AM +0200, Borislav Petkov wrote:
> On Wed, May 13, 2020 at 11:49:49PM +0200, Arnd Bergmann wrote:
> > Right, in particular since Linus started building with gcc-10 already and
> > would likely soon run into that problem if he hasn't already ;-)
>
> Oh noo, we don't want Linus' kernel broken. ;-)
>
> We will send him the fix this weekend.
>
> Looking at that branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/build
>
> I think we can send the whole thing even.
>
> Linus, shout if you'd prefer only the last three commits there:
>
> 950a37078aa0 x86/build: Use $(CONFIG_SHELL)
> f670269a42bf x86: Fix early boot crash on gcc-10, next try
> 73da86741e7f x86/build: Check whether the compiler is sane
>
> but the other three from Masahiro are simple cleanups:
>
> 675a59b7dec6 x86/boot/build: Add phony targets in arch/x86/boot/Makefile to PHONY
> 30ce434e44d7 x86/boot/build: Make 'make bzlilo' not depend on vmlinux or $(obj)/bzImage
> e3c7c1052271 x86/boot/build: Add cpustr.h to targets and remove clean-files
>
> which should be ok to take now too AFAICT.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

I notice that doesn't add it to start_kernel (init/main.c). That's not
currently affected, but I think we should it add for consistency and
future-proofing, at least for 5.8?

2020-05-13 23:08:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 2:50 PM Arnd Bergmann <[email protected]> wrote:
>
> Right, in particular since Linus started building with gcc-10 already and
> would likely soon run into that problem if he hasn't already ;-)

I don't happen to have stack canaries on the configs I actually boot,
so I didn't notice.

But yes, we should get this before 5.7 and mark it for stable, since
F32 is now gcc-10, so people will start hitting it if they enable
stack canaries.

Linus

2020-05-13 23:15:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 3:20 PM Borislav Petkov <[email protected]> wrote:
>
> Linus, shout if you'd prefer only the last three commits there:
>
> 950a37078aa0 x86/build: Use $(CONFIG_SHELL)
> f670269a42bf x86: Fix early boot crash on gcc-10, next try
> 73da86741e7f x86/build: Check whether the compiler is sane

Do we really need that sanity check?

Are there known compilers that fail that check? Because honestly, that
sounds unlikely to me to begin with, but if it does happen then that
just means that the prevent_tail_call_optimization() thing is broken.

The check itself doesn't seem worth it. If your worry is that an empty
asm() can be optimized away, then don't use an empty asm!

In other words, the only reason for that check seems to be a worry
that simply isn't worth having.

In fact, I think the check is wrong anyway, since the main thing I can
see that would do a tailcall despite the empty asm is link-time
optimizations that that check doesn't even check for!

So everything I see there just screams "the check is bogus" to me. The
check doesn't work, and if it were to work it only means that the
prevent_tail_call_optimization() thing is too fragile.

Just put a full memory barrier in there, with an actual "mfence"
instruction or whatever, so that you know that the check is pointless,
and so that you know that a link-time optimizer can't turn the
call+return into a tailcall.

Don't send me the broken check.

Linus

2020-05-13 23:40:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 04:13:53PM -0700, Linus Torvalds wrote:
> The check itself doesn't seem worth it. If your worry is that an empty
> asm() can be optimized away, then don't use an empty asm!

gcc guys said we should use that since the first attempt using

__attribute__((optimize("-fno-stack-protector")))

didn't work because, well, that attribute turned out to be "not suitable in
production code". :)

Full thread here:

https://lore.kernel.org/lkml/[email protected]/

> In other words, the only reason for that check seems to be a worry
> that simply isn't worth having.

Yes, that was me asking for a way to check whether any future gccs would
violate that. But if they'd do that, they would break a lot of code
depending on it.

> In fact, I think the check is wrong anyway, since the main thing I can
> see that would do a tailcall despite the empty asm is link-time
> optimizations that that check doesn't even check for!
>
> So everything I see there just screams "the check is bogus" to me. The
> check doesn't work, and if it were to work it only means that the
> prevent_tail_call_optimization() thing is too fragile.

So I did test it trivially by removing the asm("") and then it would
tailcall optimize. But we didn't think about LTO so hm, that would
probably break it.

> Just put a full memory barrier in there, with an actual "mfence"
> instruction or whatever, so that you know that the check is pointless,
> and so that you know that a link-time optimizer can't turn the
> call+return into a tailcall.

Right, the intention here was to have it arch-agnostic in
include/linux/compiler.h because powerpc might need it too soon:

arch/powerpc/kernel/smp.c:1296: boot_init_stack_canary();

Looking at them, they do have an mb() too so how about this then
instead?

#define prevent_tail_call_optimization() mb()

Thx.

--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2020-05-14 00:13:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 4:36 PM Borislav Petkov <[email protected]> wrote:
>
>
> Looking at them, they do have an mb() too so how about this then
> instead?
>
> #define prevent_tail_call_optimization() mb()

Yeah, I think a full mb() is likely safe, because that's pretty much
always going to be a real instruction with real semantics, and no
amount of link-time optimizations can move it around a call
instruction.

I could imagine some completely UP in-order CPU that doesn't need to
serialize with anything at all, and even "mb()" might be empty. I
think you can compile old ARM kernels for that. But realistically I
think we can ignore them at least for now - I'm not sure the link-time
optimization will even do things like that tailcall conversion, and
I'm not convinced that old pre-ARMv7 systems will be relevant by the
time (if) it ever does.

Linus

2020-05-14 00:52:08

by Nick Desaulniers

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 5:11 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, May 13, 2020 at 4:36 PM Borislav Petkov <[email protected]> wrote:
> >
> >
> > Looking at them, they do have an mb() too so how about this then
> > instead?
> >
> > #define prevent_tail_call_optimization() mb()
>
> Yeah, I think a full mb() is likely safe, because that's pretty much
> always going to be a real instruction with real semantics, and no
> amount of link-time optimizations can move it around a call
> instruction.

Are you sure LTO treats empty asm statements differently than full
memory barriers in regards to preventing tail calls? (I'll take your
word for it, I don't actually know, but seeing an example of real code
run through a production compiler is much much more convincing).

The TL;DR of the very long thread is that
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722 is a proper fix, on
the GCC side. Adding arbitrary empty asm statements to work around
it? Hacks. Full memory barriers? Hacks.

I'm happy that GCC does an optimization that Clang does not. At the
same time, it sucks to pay a penalty for a bug we don't trigger. This
is the same reason why `asm_volatile_goto` expands differently between
GCC and Clang (and why I tried to undo that like a year ago).

If Clang realizes the same optimization GCC is doing here (related to
tailcalls) tomorrow, well we already support
__attribute__((no_stack_protector)) which can be added to the callees
we don't want tail called in this case (i.e. allowing tail calls). I
should send a patch adding that to include/linux/compiler_attributes.h
and annotate the callees in question, before we forget about this
issue.

Sprinkling empty asm statements or full memory barriers should be
treated with the same hesitancy as adding sleep()s to "work around"
concurrency bugs. Red flag.

And LTO is fun; we've been shipping it in Android for years (and need
to attempt upstreaming again). Just today we found an ODR violation
in one of the most important symbols in the kernel. Will be sending a
patch for that tomorrow.

>
> I could imagine some completely UP in-order CPU that doesn't need to
> serialize with anything at all, and even "mb()" might be empty. I
> think you can compile old ARM kernels for that. But realistically I
> think we can ignore them at least for now - I'm not sure the link-time
> optimization will even do things like that tailcall conversion, and
> I'm not convinced that old pre-ARMv7 systems will be relevant by the
> time (if) it ever does.
>
> Linus



--
Thanks,
~Nick Desaulniers

2020-05-14 02:21:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Wed, May 13, 2020 at 5:51 PM Nick Desaulniers
<[email protected]> wrote:
>
> Are you sure LTO treats empty asm statements differently than full
> memory barriers in regards to preventing tail calls?

It had better.

At link-time, there is nothing left of an empty asm statement. So by
the time the linker runs, it only sees

call xyz
ret

in the object code. At that point, it's somewhat reasonable for any
link-time optimizer (or an optimizing assembler, for that matter) to
say "I'll just turn that sequence into a simple 'jmp xyz' instead".

Now, don't get me wrong - I'm not convinced any existing LTO does
that. But I'd also not be shocked by something like that.

In contrast, if it's a real mb(), the linker won't see just a
'call+ret" sequence. It will see something like

call xyz
mfence
ret

(ok, the mfence may actually be something else, and we'll have a label
on it and an alternatives table pointing to it, but the point is,
unlike an empty asm, there's something _there_).

Now, if the linker turns that 'call' into a 'jmp', the linker is just
plain buggy.

See the difference?

> The TL;DR of the very long thread is that
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722 is a proper fix, on
> the GCC side. Adding arbitrary empty asm statements to work around
> it? Hacks. Full memory barriers? Hacks.

BS.

A compiler person might call it a "hack". But said compiler person
will be _wrong_.

An intelligent developer knows that it will take years for compilers
to give us all the infrastructure we need, and even then the compiler
won't actually give us everything - and people will be using the old
compilers for years anyway.

That's why inline asm's exist. They are the escape from the excessive
confines of "let's wait for the compiler person to solve this for us"
- which they'll never do completely anyway.

It's a bit like unsafe C type casts and allowing people to write
"non-portable code". Some compiler people will say that it's bad, and
unsafe. Sure, it can be unsafe, but the point is that it allows you to
do things that aren't necessarily _possible_ to do in an overly
restrictive language.

Sometimes you need to break the rules.

There's a reason everybody writes library routines in "unsafe"
languages like C. Because you need those kinds of escapes in order to
actually do something like a memory allocator etc.

And that's also why we have inline asm - because the compiler will
never know everything or be able to generate code for everything
people wants to do.

And anybody that _thinks_ that the compiler will always know better
and should be in complete control shouldn't be working on compilers.
They should probably be repeating kindergarten, sitting in a corner
eating paste with their friends.

So no. Using inline asms to work around the compiler not understanding
what is going on isn't a "hack". It's the _point_ of inline asm.

Is it perfect? No. But it's not like there are many alternatives.

Linus

2020-05-14 08:13:05

by David Laight

[permalink] [raw]
Subject: RE: gcc-10: kernel stack is corrupted and fails to boot

From: Linus Torvalds
> Sent: 14 May 2020 03:20
> On Wed, May 13, 2020 at 5:51 PM Nick Desaulniers
> <[email protected]> wrote:
> >
> > Are you sure LTO treats empty asm statements differently than full
> > memory barriers in regards to preventing tail calls?
>
> It had better.
>
> At link-time, there is nothing left of an empty asm statement. So by
> the time the linker runs, it only sees
>
> call xyz
> ret
>
> in the object code. At that point, it's somewhat reasonable for any
> link-time optimizer (or an optimizing assembler, for that matter) to
> say "I'll just turn that sequence into a simple 'jmp xyz' instead".

Except is sees:
call xyz
canary_check_code
ret

There's also almost certainly some stack frame tidyup.
Which it would have to detect and convert.
And, in principle, the function is allowed to access the
stack space than contains the canary.

> Now, don't get me wrong - I'm not convinced any existing LTO does
> that. But I'd also not be shocked by something like that.
>
> In contrast, if it's a real mb(), the linker won't see just a
> 'call+ret" sequence. It will see something like
>
> call xyz
> mfence
> ret
>
> (ok, the mfence may actually be something else, and we'll have a label
> on it and an alternatives table pointing to it, but the point is,
> unlike an empty asm, there's something _there_).

Not if you've an architecture that doesn't have any memory barriers.
In that case mb() may not even be asm("").
(although it might have to be asm ("":::memory)).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-05-14 09:16:32

by Harald Arnesen

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

Kalle Valo [13.05.2020 17:31]:

> Great, so it's not a problem due to my setup.

I see the same thing on two machines, using a self-compiled gcc 10.1.0.
Glad to hear it's not just me. Switched back to 9.3.0 for the time being.
--
Hilsen Harald

2020-05-14 15:51:34

by Arvind Sankar

[permalink] [raw]
Subject: Re: gcc-10: kernel stack is corrupted and fails to boot

On Thu, May 14, 2020 at 10:40:44AM +0200, Arnd Bergmann wrote:
> On Thu, May 14, 2020 at 7:22 AM Arvind Sankar <[email protected]> wrote:
> > On Wed, May 13, 2020 at 09:52:07PM -0700, Linus Torvalds wrote:
> > > On Wed, May 13, 2020, 20:50 Andy Lutomirski <[email protected]> wrote:
> > The gcc docs [1,2] at least don't inspire much confidence that this will
> > continue working with plain asm("") though:
> >
> > "Note that GCC’s optimizers can move asm statements relative to other
> > code, including across jumps."
> > ...
> > "Note that the compiler can move even volatile asm instructions relative
> > to other code, including across jump instructions."
> >
> > Even if we don't include an instruction in it I think it should at least
> > have a memory clobber, to stop the compiler from deciding that it can be
> > moved before the call so it can do the tail-call optimization.
>
> I think LTO would still be able to notice that cpu_startup_entry() can
> be annotated __attribute__((noreturn)) and optimize the callers
> accordingly, which in turn would allow a tail call again after dead code
> elimination.
>
> Arnd

Yes, with LTO the only solution is to actually compile the caller
without stack checking I think. Although at present gcc actually
doesn't tail-call optimize calls to noreturn functions that could easily
change.