2023-08-10 02:01:49

by Nathan Chancellor

[permalink] [raw]
Subject: Hang when booting guest kernels compiled with clang after SRSO mitigations

Hi Boris,

I updated my AMD 3990X workstation to a version of mainline that
contains the SRSO mitigations and I am now seeing a hang when booting
guest kernels built with clang in QEMU/KVM with an '-smp' value greater
than one (I am just testing 'ARCH=x86_64 defconfig', nothing fancy). The
host's kernel is built with GCC 13.2.0, in case that is relevant. The
issue happens with all versions of clang that the kernel supports
(11.x+). I do not see the issue with guest kernels built with GCC nor do
I see the issue with '-smp 1', so it could be something that clang has
done to the guest kernel that causes this but I figured I would report
it early anyways.

With '-smp 4' (for example), I see

[ 0.102817] smpboot: CPU0: AMD Ryzen Threadripper 3990X 64-Core Processor (family: 0x17, model: 0x31, stepping: 0x0)
...
[ 0.109778] smp: Bringing up secondary CPUs ...
[ 0.110559] smpboot: x86: Booting SMP configuration:

then nothing after that, until timeout kills QEMU.

With '-smp 2', I can get all the way to userspace but it hangs when
shutting down and I see what appears to be basically the same stack
trace three times (I just included the last one):

Sent SIGKILL to all processes
Requesting system poweroff
[ 2.499704] ACPI: PM: Preparing to enter system sleep state S5
[ 2.500470] reboot: Power down
...
[ 152.698101] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 152.698813] rcu: 0-...0: (9 ticks this GP) idle=556c/1/0x4000000000000000 softirq=333/335 fqs=29392
[ 152.699718] rcu: (detected by 1, t=147019 jiffies, g=-1003, q=2 ncpus=2)
[ 152.700368] Sending NMI from CPU 1 to CPUs 0:
[ 152.700795] NMI backtrace for cpu 0
[ 152.700799] CPU: 0 PID: 117 Comm: init Not tainted 6.5.0-rc5 #1
[ 152.700800] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 152.700802] RIP: 0010:default_send_IPI_allbutself+0x2a/0x60
[ 152.700806] Code: 83 ff 02 74 35 66 66 2e 0f 1f 84 00 00 00 00 00 f7 04 25 00 c3 5f ff 00 10 00 00 74 0f f3 90 f7 04 25 00 c3 5f ff 00 10 00 00 <75> f1 81 cf 00 00 0c 00 89 3c 25 00 c3 5f ff c3 48 8b 05 d7 2c 61
[ 152.700807] RSP: 0018:ffff99e58027fdb0 EFLAGS: 00000082
[ 152.700808] RAX: ffffffff8426ab50 RBX: 0000000000000000 RCX: 0000000000000001
[ 152.700809] RDX: ffffffffff5fb000 RSI: 00000000000000f8 RDI: 00000000000000f8
[ 152.700810] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff84482120
[ 152.700810] R10: 0000000000000000 R11: ffffffff82c57f50 R12: 000000004321fedc
[ 152.700811] R13: fffffffffee1dead R14: 0000000000000000 R15: 0000000028121969
[ 152.700813] FS: 00007f1eb36696a0(0000) GS:ffff978b9f000000(0000) knlGS:0000000000000000
[ 152.700814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 152.700815] CR2: 00007f09d4a8a360 CR3: 00000000024c4000 CR4: 0000000000350ef0
[ 152.700817] Call Trace:
[ 152.700819] <NMI>
[ 152.700820] ? nmi_cpu_backtrace+0xde/0x110
[ 152.700822] ? nmi_cpu_backtrace_handler+0x8/0x10
[ 152.700823] ? nmi_handle+0x69/0x150
[ 152.700824] ? default_do_nmi+0x43/0x1d0
[ 152.700826] ? exc_nmi+0xbc/0x130
[ 152.700827] ? end_repeat_nmi+0x16/0x67
[ 152.700836] ? default_send_IPI_single+0x30/0x30
[ 152.700838] ? default_send_IPI_allbutself+0x2a/0x60
[ 152.700839] ? default_send_IPI_allbutself+0x2a/0x60
[ 152.700840] ? default_send_IPI_allbutself+0x2a/0x60
[ 152.700840] </NMI>
[ 152.700841] <TASK>
[ 152.700841] native_stop_other_cpus+0x7d/0x1f0
[ 152.700843] native_machine_shutdown+0x17/0x40
[ 152.700844] native_machine_power_off+0x24/0x30
[ 152.700846] __se_sys_reboot+0x221/0x230
[ 152.700848] do_syscall_64+0x31/0x50
[ 152.700849] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 152.700851] RIP: 0033:0x7f1eb35ceffb
[ 152.700853] Code: ff 76 10 48 8b 15 95 ee 07 00 f7 d8 64 89 02 48 83 c8 ff c3 48 63 d7 be 69 19 12 28 b8 a9 00 00 00 48 c7 c7 ad de e1 fe 0f 05 <48> 3d 00 f0 ff ff 76 10 48 8b 15 66 ee 07 00 f7 d8 64 89 02 48 83
[ 152.700854] RSP: 002b:00007fff81094238 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[ 152.700855] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f1eb35ceffb
[ 152.700855] RDX: 000000004321fedc RSI: 0000000028121969 RDI: fffffffffee1dead
[ 152.700856] RBP: 000000004321fedc R08: 8080808080808080 R09: 6b6470ff65656e71
[ 152.700856] R10: 0000000000000008 R11: 0000000000000246 R12: 00007fff81095fc2
[ 152.700857] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 152.700858] </TASK>

I tested the other mitigation options for SRSO and it seems the only one
that has problems is safe-ret (I don't think there is a microcode update
for this machine yet, so I did not both trying microcode).

spec_rstack_overflow=off: No issue
spec_rstack_overflow=safe-ret: Has issue
spec_rstack_overflow=ibpb: No issue
spec_rstack_overflow=ibpb-vmexit: No issue

My QEMU command, in case it is relevant, the rootfs is available at
https://github.com/ClangBuiltLinux/boot-utils/releases/tag/20230707-182910.

$ qemu-system-x86_64 \
-display none \
-nodefaults \
-d unimp,guest_errors \
-append 'console=ttyS0 earlycon=uart8250,io,0x3f8' \
-kernel arch/x86/boot/bzImage \
-initrd rootfs.cpio \
-cpu host \
-enable-kvm \
-m 512m \
-smp 4 \
-serial mon:stdio

If there is any other information I can provide or patches I can test, I
am more than happy to do so. Should you need access to a clang toolchain
for building the guest kernel (if you did not want to bother installing
one from your distro), I have ones available on kernel.org similar to
the GCC ones that should work fine:

https://mirrors.edge.kernel.org/pub/tools/llvm/

Cheers,
Nathan


2023-08-10 08:18:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Wed, Aug 09, 2023 at 06:33:34PM -0700, Nathan Chancellor wrote:
> Hi Boris,
>
> I updated my AMD 3990X workstation to a version of mainline that
> contains the SRSO mitigations and I am now seeing a hang when booting
> guest kernels built with clang in QEMU/KVM with an '-smp' value greater
> than one (I am just testing 'ARCH=x86_64 defconfig', nothing fancy). The
> host's kernel is built with GCC 13.2.0, in case that is relevant. The
> issue happens with all versions of clang that the kernel supports
> (11.x+). I do not see the issue with guest kernels built with GCC nor do
> I see the issue with '-smp 1', so it could be something that clang has
> done to the guest kernel that causes this but I figured I would report
> it early anyways.
>
> With '-smp 4' (for example), I see
>
> [ 0.102817] smpboot: CPU0: AMD Ryzen Threadripper 3990X 64-Core Processor (family: 0x17, model: 0x31, stepping: 0x0)
> ...
> [ 0.109778] smp: Bringing up secondary CPUs ...
> [ 0.110559] smpboot: x86: Booting SMP configuration:

I can repro this here with Debian clang version 14.0.6-2 even with -smp
2.

Lemme poke at this a bit.

Thx.


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 09:51:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 11:08:35AM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 10:10:38AM +0200, Borislav Petkov wrote:
> > I can repro this here with Debian clang version 14.0.6-2 even with -smp
> > 2.
> >
> > Lemme poke at this a bit.
>
> Err, this stops booting even on plain -rc5 which doesn't have the SRSO
> patches.
>
> If so, then you'd need to bisect.

-rc4 doesn't boot either here.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 10:39:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 10:10:38AM +0200, Borislav Petkov wrote:
> I can repro this here with Debian clang version 14.0.6-2 even with -smp
> 2.
>
> Lemme poke at this a bit.

Err, this stops booting even on plain -rc5 which doesn't have the SRSO
patches.

If so, then you'd need to bisect.


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 11:05:12

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 11:08:35AM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 10:10:38AM +0200, Borislav Petkov wrote:
> > I can repro this here with Debian clang version 14.0.6-2 even with -smp
> > 2.
> >
> > Lemme poke at this a bit.
>
> Err, this stops booting even on plain -rc5 which doesn't have the SRSO
> patches.

Just to clarify, this is the guest kernel at -rc5 and the host kernel
with the SRSO mitigations applied? If so, that's the problem. The guest
kernel does not have to have the SRSO mitigations applied to see this
problem. Sorry I should have made that more clear! If not though, that's
interesting because I was running -rc5 on the host without issues.

Cheers,
Nathan

2023-08-10 13:26:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 03:16:49AM -0700, Nathan Chancellor wrote:
> Just to clarify, this is the guest kernel at -rc5 and the host kernel
> with the SRSO mitigations applied? If so, that's the problem. The guest
> kernel does not have to have the SRSO mitigations applied to see this
> problem. Sorry I should have made that more clear! If not though, that's
> interesting because I was running -rc5 on the host without issues.

Well, how do you even build CPU_SRSO with clang?

config CPU_SRSO
bool "Mitigate speculative RAS overflow on AMD"
depends on CPU_SUP_AMD && X86_64 && RETHUNK
^^^^^^^

config RETHUNK
bool "Enable return-thunks"
depends on RETPOLINE && CC_HAS_RETURN_THUNK
^^^^^^^^^^^^^^^^^^^

config CC_HAS_RETURN_THUNK
def_bool $(cc-option,-mfunction-return=thunk-extern)

$ clang -mfunction-return=thunk-extern
clang: error: unknown argument: '-mfunction-return=thunk-extern'
clang: error: no input files

$ clang --version
Debian clang version 14.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Hmmm.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 14:03:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 06:27:06AM -0700, Nathan Chancellor wrote:
> But my host kernel was compiled using GCC 13.2.0 from kernel.org for the
> sake of testing to see if the compiler used to build the host kernel had
> an impact on the problem and it did not.

Ok, now I'm confused.

Lemme see if I understand it correctly:

host kernel:
- SRSO enabled
- built with gcc

guest kernel:
- built with clang
- SRSO not necessary

Is that the scenario?

Anything else?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 14:07:20

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 02:51:22PM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 03:16:49AM -0700, Nathan Chancellor wrote:
> > Just to clarify, this is the guest kernel at -rc5 and the host kernel
> > with the SRSO mitigations applied? If so, that's the problem. The guest
> > kernel does not have to have the SRSO mitigations applied to see this
> > problem. Sorry I should have made that more clear! If not though, that's
> > interesting because I was running -rc5 on the host without issues.
>
> Well, how do you even build CPU_SRSO with clang?
>
> config CPU_SRSO
> bool "Mitigate speculative RAS overflow on AMD"
> depends on CPU_SUP_AMD && X86_64 && RETHUNK
> ^^^^^^^
>
> config RETHUNK
> bool "Enable return-thunks"
> depends on RETPOLINE && CC_HAS_RETURN_THUNK
> ^^^^^^^^^^^^^^^^^^^
>
> config CC_HAS_RETURN_THUNK
> def_bool $(cc-option,-mfunction-return=thunk-extern)
>
> $ clang -mfunction-return=thunk-extern
> clang: error: unknown argument: '-mfunction-return=thunk-extern'
> clang: error: no input files
>
> $ clang --version
> Debian clang version 14.0.6
> Target: x86_64-pc-linux-gnu
> Thread model: posix
> InstalledDir: /usr/bin
>
> Hmmm.

That option was only backported to LLVM 15.x+ because 14.x and older
were not supported any more when it was added.

$ clang -mfunction-return=thunk-extern -x c -c -o /dev/null /dev/null

$ clang --version
clang version 15.0.7
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

But my host kernel was compiled using GCC 13.2.0 from kernel.org for the
sake of testing to see if the compiler used to build the host kernel had
an impact on the problem and it did not.

Cheers,
Nathan

2023-08-10 14:45:18

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 03:32:16PM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 06:27:06AM -0700, Nathan Chancellor wrote:
> > But my host kernel was compiled using GCC 13.2.0 from kernel.org for the
> > sake of testing to see if the compiler used to build the host kernel had
> > an impact on the problem and it did not.
>
> Ok, now I'm confused.

Heh, so was I at first when I was doing my regular build and boot tests
of -next :P

> Lemme see if I understand it correctly:
>
> host kernel:
> - SRSO enabled
> - built with gcc
>
> guest kernel:
> - built with clang
> - SRSO not necessary
>
> Is that the scenario?

Yes, that should be correct.

Host kernel string:

6.5.0-rc5-00039-g138bcddb86d8 ([email protected]) (x86_64-linux-gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT_DYNAMIC Wed Aug 9 17:34:43 MST 2023

Guest kernel string:

6.5.0-rc5 ([email protected]) (ClangBuiltLinux clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6), GNU ld (GNU Binutils) 2.41.50.20230809) #1 SMP PREEMPT_DYNAMIC Wed Aug 9 16:54:33 MST 2023

> Anything else?

Shouldn't be. As I noted in the original email, it seems something
specific to the safe-ret mitigation as I don't see the problem with
ibpb, that would be a good canary for making sure that you see the same
behavior.

Cheers,
Nathan

2023-08-10 15:16:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 06:40:56AM -0700, Nathan Chancellor wrote:
> 6.5.0-rc5-00039-g138bcddb86d8 ([email protected]) (x86_64-linux-gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT_DYNAMIC Wed Aug 9 17:34:43 MST 2023

Mine:

Linux version 6.5.0-rc5+ (root@vh) (gcc (Debian 10.2.1-3) 10.2.1 20201224, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Thu Aug 10 16:13:54 CEST 2023

...

[ 0.083541] Speculative Return Stack Overflow: Mitigation: safe RET

> Guest kernel string:
>
> 6.5.0-rc5 ([email protected]) (ClangBuiltLinux clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6), GNU ld (GNU Binutils) 2.41.50.20230809) #1 SMP PREEMPT_DYNAMIC Wed Aug 9 16:54:33 MST 2023
>

Mine:

[ 0.000000] Linux version 6.5.0-rc5 (root@vh) (Debian clang version 14.0.6, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Thu Aug 10 13:22:30 CEST 2023

Guest and host are up and running.

There's something else missing.

Your host gcc is 13, maybe I should update...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 15:26:39

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 04:43:44PM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 06:40:56AM -0700, Nathan Chancellor wrote:
> Linux version 6.5.0-rc5+ (root@vh) (gcc (Debian 10.2.1-3) 10.2.1 20201224, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Thu Aug 10 16:13:54 CEST 2023
>
> ...
>
> [ 0.083541] Speculative Return Stack Overflow: Mitigation: safe RET

I just tried

Linux version 6.5.0-rc5-00039-g138bcddb86d8 ([email protected]) (x86_64-linux-gcc (GCC) 10.4.0, GNU ld (GNU Binutils) 2.39) #1 SMP PREEMPT_DYNAMIC Thu Aug 10 07:48:28 MST 2023

[ 0.000259] Speculative Return Stack Overflow: Mitigation: safe RET

on the host...

> [ 0.000000] Linux version 6.5.0-rc5 (root@vh) (Debian clang version 14.0.6, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Thu Aug 10 13:22:30 CEST 2023

with

[ 0.000000] Linux version 6.5.0-rc5 ([email protected]) (Debian clang version 14.0.6, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Thu Aug 10 07:58:56 MST 2023

in the guest and I see the same problem.

> Guest and host are up and running.
>
> There's something else missing.

Configuration difference? Here is the one from the most recent build:

https://gist.github.com/nathanchance/2d7ad0b9440a6a2ec5ba0b88e3e673a9

Is there any other information that could be relevant here? My microcode
version according to dmesg, in case that matters.

[ 2.408527] microcode: microcode updated early to new patch_level=0x0830107a

Is that machine Zen 2? I see this issue on my Ryzen 3 4300G as well,
which is also Zen 2.

> Your host gcc is 13, maybe I should update...

Seems like I can reproduce it with earlier versions of GCC (and I could
reproduce it with clang) so it does not seem like it is toolchain
related on the host side but might be interesting to test.

I just use https://mirrors.edge.kernel.org/pub/tools/crosstool/ for easy
access to multiple versions.

Cheers,
Nathan

2023-08-10 15:52:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 08:07:06AM -0700, Nathan Chancellor wrote:
> [ 2.408527] microcode: microcode updated early to new patch_level=0x0830107a

Hm, a wild guess: can you boot the *host* with "dis_ucode_ldr" on the
kernel cmdline and see if it still reproduces?

Also, can you bisect rc5..master to see which exact patch is causing
this?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-10 16:01:29

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 05:14:10PM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 08:07:06AM -0700, Nathan Chancellor wrote:
> > [ 2.408527] microcode: microcode updated early to new patch_level=0x0830107a
>
> Hm, a wild guess: can you boot the *host* with "dis_ucode_ldr" on the
> kernel cmdline and see if it still reproduces?

It does.

> Also, can you bisect rc5..master to see which exact patch is causing
> this?

Sure thing. I at least isolated it to the SRSO merge, so I will just
bisect that to see the exact commit that causes this.

Cheers,
Nathan

2023-08-10 16:36:24

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 08:48:31AM -0700, Nathan Chancellor wrote:
> On Thu, Aug 10, 2023 at 05:14:10PM +0200, Borislav Petkov wrote:
> > Also, can you bisect rc5..master to see which exact patch is causing
> > this?
>
> Sure thing. I at least isolated it to the SRSO merge, so I will just
> bisect that to see the exact commit that causes this.

Heh, figured this would be the case:

# bad: [138bcddb86d8a4f842e4ed6f0585abc9b1a764ff] Merge tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# good: [14f9643dc90adea074a0ffb7a17d337eafc6a5cc] Merge tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
git bisect start '138bcddb86d8a4f842e4ed6f0585abc9b1a764ff' '14f9643dc90adea074a0ffb7a17d337eafc6a5cc'
# bad: [233d6f68b98d480a7c42ebe78c38f79d44741ca9] x86/srso: Add IBPB
git bisect bad 233d6f68b98d480a7c42ebe78c38f79d44741ca9
# bad: [fb3bd914b3ec28f5fb697ac55c4846ac2d542855] x86/srso: Add a Speculative RAS Overflow mitigation
git bisect bad fb3bd914b3ec28f5fb697ac55c4846ac2d542855
# good: [0e52740ffd10c6c316837c6c128f460f1aaba1ea] x86/bugs: Increase the x86 bugs vector size to two u32s
git bisect good 0e52740ffd10c6c316837c6c128f460f1aaba1ea
# first bad commit: [fb3bd914b3ec28f5fb697ac55c4846ac2d542855] x86/srso: Add a Speculative RAS Overflow mitigation

Not sure how helpful that will be...

Cheers,
Nathan

2023-08-11 11:56:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Thu, Aug 10, 2023 at 09:14:14AM -0700, Nathan Chancellor wrote:
> Not sure how helpful that will be...

Yeah, not really. More wild guesses: if you uncomment the UNTRAIN_RET in
__svm_vcpu_run() on the host, does that have any effect? Diff below.

Also, can you send me the host and guest .configs and the compilers
you've used so that I can try to reproduce here exactly what you have?

Thx.

---
diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S
index 265452fc9ebe..b5871259a973 100644
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -222,7 +222,7 @@ SYM_FUNC_START(__svm_vcpu_run)
* because interrupt handlers won't sanitize 'ret' if the return is
* from the kernel.
*/
- UNTRAIN_RET
+// UNTRAIN_RET

/* SRSO */
ALTERNATIVE "", "call entry_ibpb", X86_FEATURE_IBPB_ON_VMEXIT


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-11 15:04:19

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Fri, Aug 11, 2023 at 12:14:56PM +0200, Borislav Petkov wrote:
> On Thu, Aug 10, 2023 at 09:14:14AM -0700, Nathan Chancellor wrote:
> > Not sure how helpful that will be...
>
> Yeah, not really. More wild guesses: if you uncomment the UNTRAIN_RET in
> __svm_vcpu_run() on the host, does that have any effect? Diff below.

Unfortunately, that seems to make no difference...

I did have to switch to the Ryzen 3 box for testing, as I am not at home
for a couple of days and I did not want to lose access to my workstation
if I took a bad update since it has no remote management capabilities.
Something I noticed in doing so is that the VM boot on that machine
appears to get farther along than on my Threadripper 3990X, but I still
see a hang with a stack trace similar to the one that I reported in the
initial post with '-smp 2', so I think it is the same problem but
perhaps the more cores the VM has, the more likely it is to appear
totally hung? Might be a red herring but I figured I would mention it in
case it is relevant.

[ 0.000000] Linux version 6.5.0-rc5 ([email protected]) (ClangBuiltLinux clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6), GNU ld (GNU Binutils) 2.41.0) #1 SMP PREEMPT_DYNAMIC Fri Aug 11 06:15:25 MST 2023
...
[ 0.141781] smp: Bringing up secondary CPUs ...
[ 0.142524] smpboot: x86: Booting SMP configuration:
[ 0.143450] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 21.145445] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 21.146443] rcu: 1-...!: (20554 ticks this GP) idle=04bc/0/0x1 softirq=1/1 fqs=0
[ 21.146443] rcu: (t=21007 jiffies g=-1187 q=1 ncpus=8)
[ 21.146443] rcu: rcu_preempt kthread starved for 21009 jiffies! g-1187 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 21.146443] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 21.146443] rcu: RCU grace-period kthread stack dump:
[ 21.146443] task:rcu_preempt state:R running task stack:15360 pid:14 ppid:2 flags:0x00004000
[ 21.146443] Call Trace:
[ 21.146443] <TASK>
[ 21.146443] __schedule+0x618/0x8a0
[ 21.146443] schedule+0x51/0x90
[ 21.146443] schedule_timeout+0xb5/0x170
[ 21.146443] ? __pfx_process_timeout+0x10/0x10
[ 21.146443] rcu_gp_fqs_loop+0x1a7/0x6b0
[ 21.146443] ? __note_gp_changes+0x39/0x210
[ 21.146443] rcu_gp_kthread+0x1c/0x1e0
[ 21.146443] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 21.146443] kthread+0xe6/0x100
[ 21.146443] ? __pfx_kthread+0x10/0x10
[ 21.146443] ret_from_fork+0x35/0x40
[ 21.146443] ? __pfx_kthread+0x10/0x10
[ 21.146443] ret_from_fork_asm+0x1b/0x30
[ 21.146443] </TASK>
[ 21.146443] rcu: Stack dump where RCU GP kthread last ran:
[ 21.146443] Sending NMI from CPU 1 to CPUs 0:
[ 21.196100] NMI backtrace for cpu 0
[ 21.196103] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc5 #1
[ 21.196105] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014
[ 21.196106] RIP: 0010:default_send_IPI_allbutself+0x23/0x50
[ 21.196111] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 83 ff 02 74 2f f7 04 25 00 c3 5f ff 00 10 00 00 74 0f f3 90 f7 04 25 00 c3 5f ff 00 10 00 00 <75> f1 81 cf 00 00 0c 00 89 3c 25 00 c3 5f ff 2e e9 68 5b ef 00 48
[ 21.196112] RSP: 0018:ffffb268c0013cb0 EFLAGS: 00000282
[ 21.196114] RAX: ffffffff9e993b50 RBX: ffffa2061f02bf90 RCX: 00000000000000ff
[ 21.196115] RDX: 0000000000000000 RSI: ffffa2061f1efda0 RDI: 00000000000000fc
[ 21.196116] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 21.196117] R10: 0000000000000000 R11: ffffffff9d2652a0 R12: 0000000000000000
[ 21.196118] R13: ffffa2061f02bf80 R14: ffffa2061f1efda0 R15: 0000000000000007
[ 21.196120] FS: 0000000000000000(0000) GS:ffffa2061f000000(0000) knlGS:0000000000000000
[ 21.196121] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 21.196122] CR2: ffffa20607a01000 CR3: 0000000006e2a000 CR4: 0000000000350ef0
[ 21.196124] Call Trace:
[ 21.196126] <NMI>
[ 21.196127] ? nmi_cpu_backtrace+0x105/0x130
[ 21.196130] ? nmi_cpu_backtrace_handler+0xc/0x20
[ 21.196132] ? nmi_handle+0x66/0x150
[ 21.196134] ? default_send_IPI_allbutself+0x23/0x50
[ 21.196135] ? default_do_nmi+0x41/0x100
[ 21.196137] ? exc_nmi+0xbb/0x130
[ 21.196138] ? end_repeat_nmi+0x16/0x67
[ 21.196140] ? __pfx_default_send_IPI_allbutself+0x10/0x10
[ 21.196141] ? default_send_IPI_allbutself+0x23/0x50
[ 21.196143] ? default_send_IPI_allbutself+0x23/0x50
[ 21.196144] ? default_send_IPI_allbutself+0x23/0x50
[ 21.196145] </NMI>
[ 21.196145] <TASK>
[ 21.196146] kvm_smp_send_call_func_ipi+0x10/0x60
[ 21.196148] smp_call_function_many_cond+0x2be/0x520
[ 21.196151] ? __pfx_do_sync_core+0x10/0x10
[ 21.196153] on_each_cpu_cond_mask+0x1c/0x40
[ 21.196155] text_poke_bp_batch+0xb3/0x2a0
[ 21.196156] text_poke_finish+0x1a/0x30
[ 21.196157] arch_jump_label_transform_apply+0x15/0x30
[ 21.196159] static_key_enable_cpuslocked+0x48/0x80
[ 21.196161] static_key_enable+0x15/0x20
[ 21.196163] _cpu_up+0x1f7/0x280
[ 21.196165] cpu_up+0x60/0xa0
[ 21.196166] cpuhp_bringup_mask+0x49/0xc0
[ 21.196169] cpuhp_bringup_cpus_parallel+0xba/0xd0
[ 21.196171] bringup_nonboot_cpus+0xc/0x30
[ 21.196172] smp_init+0x25/0x80
[ 21.196174] kernel_init_freeable+0xd3/0x150
[ 21.196177] ? __pfx_kernel_init+0x10/0x10
[ 21.196179] kernel_init+0x15/0x190
[ 21.196180] ret_from_fork+0x35/0x40
[ 21.196182] ? __pfx_kernel_init+0x10/0x10
[ 21.196183] ret_from_fork_asm+0x1b/0x30
[ 21.196186] </TASK>
[ 84.297444] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 84.298443] rcu: 2-....: (82371 ticks this GP) idle=1c5c/0/0x1 softirq=1/1 fqs=0
[ 84.298443] rcu: (t=84155 jiffies g=-1187 q=1 ncpus=8)
[ 84.298443] rcu: rcu_preempt kthread starved for 84156 jiffies! g-1187 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 84.298443] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 84.298443] rcu: RCU grace-period kthread stack dump:
[ 84.298443] task:rcu_preempt state:R running task stack:15360 pid:14 ppid:2 flags:0x00004000
[ 84.298443] Call Trace:
[ 84.298443] <TASK>
[ 84.298443] __schedule+0x618/0x8a0
[ 84.298443] schedule+0x51/0x90
[ 84.298443] schedule_timeout+0xb5/0x170
[ 84.298443] ? __pfx_process_timeout+0x10/0x10
[ 84.298443] rcu_gp_fqs_loop+0x1a7/0x6b0
[ 84.298443] ? __note_gp_changes+0x39/0x210
[ 84.298443] rcu_gp_kthread+0x1c/0x1e0
[ 84.298443] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 84.298443] kthread+0xe6/0x100
[ 84.298443] ? __pfx_kthread+0x10/0x10
[ 84.298443] ret_from_fork+0x35/0x40
[ 84.298443] ? __pfx_kthread+0x10/0x10
[ 84.298443] ret_from_fork_asm+0x1b/0x30
[ 84.298443] </TASK>
[ 84.298443] rcu: Stack dump where RCU GP kthread last ran:
[ 84.298443] Sending NMI from CPU 2 to CPUs 0:
[ 84.321804] NMI backtrace for cpu 0
[ 84.321804] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc5 #1
[ 84.321804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014
[ 84.321804] RIP: 0010:default_send_IPI_allbutself+0x23/0x50
[ 84.321804] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 83 ff 02 74 2f f7 04 25 00 c3 5f ff 00 10 00 00 74 0f f3 90 f7 04 25 00 c3 5f ff 00 10 00 00 <75> f1 81 cf 00 00 0c 00 89 3c 25 00 c3 5f ff 2e e9 68 5b ef 00 48
[ 84.321804] RSP: 0018:ffffb268c0013cb0 EFLAGS: 00000286
[ 84.321804] RAX: ffffffff9e993b50 RBX: ffffa2061f02bf90 RCX: 00000000000000ff
[ 84.321804] RDX: 0000000000000000 RSI: ffffa2061f1efda0 RDI: 00000000000000fc
[ 84.321804] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 84.321804] R10: 0000000000000000 R11: ffffffff9d2652a0 R12: 0000000000000000
[ 84.321804] R13: ffffa2061f02bf80 R14: ffffa2061f1efda0 R15: 0000000000000007
[ 84.321804] FS: 0000000000000000(0000) GS:ffffa2061f000000(0000) knlGS:0000000000000000
[ 84.321804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 84.321804] CR2: ffffa20607a01000 CR3: 0000000006e2a000 CR4: 0000000000350ef0
[ 84.321804] Call Trace:
[ 84.321804] <NMI>
[ 84.321804] ? nmi_cpu_backtrace+0x105/0x130
[ 84.321804] ? nmi_cpu_backtrace_handler+0xc/0x20
[ 84.321804] ? nmi_handle+0x66/0x150
[ 84.321804] ? default_send_IPI_allbutself+0x23/0x50
[ 84.321804] ? default_send_IPI_allbutself+0x23/0x50
[ 84.321804] ? default_do_nmi+0x41/0x100
[ 84.321804] ? exc_nmi+0xbb/0x130
[ 84.321804] ? end_repeat_nmi+0x16/0x67
[ 84.321804] ? __pfx_default_send_IPI_allbutself+0x10/0x10
[ 84.321804] ? default_send_IPI_allbutself+0x23/0x50
[ 84.321804] ? default_send_IPI_allbutself+0x23/0x50
[ 84.321804] ? default_send_IPI_allbutself+0x23/0x50
[ 84.321804] </NMI>
[ 84.321804] <TASK>
[ 84.321804] kvm_smp_send_call_func_ipi+0x10/0x60
[ 84.321804] smp_call_function_many_cond+0x2be/0x520
[ 84.321859] ? __pfx_do_sync_core+0x10/0x10
[ 84.321859] on_each_cpu_cond_mask+0x1c/0x40
[ 84.321859] text_poke_bp_batch+0xb3/0x2a0
[ 84.321863] text_poke_finish+0x1a/0x30
[ 84.321863] arch_jump_label_transform_apply+0x15/0x30
[ 84.321863] static_key_enable_cpuslocked+0x48/0x80
[ 84.321863] static_key_enable+0x15/0x20
[ 84.321863] _cpu_up+0x1f7/0x280
[ 84.321863] cpu_up+0x60/0xa0
[ 84.321863] cpuhp_bringup_mask+0x49/0xc0
[ 84.321863] cpuhp_bringup_cpus_parallel+0xba/0xd0
[ 84.321863] bringup_nonboot_cpus+0xc/0x30
[ 84.321863] smp_init+0x25/0x80
[ 84.321863] kernel_init_freeable+0xd3/0x150
[ 84.321863] ? __pfx_kernel_init+0x10/0x10
[ 84.321863] kernel_init+0x15/0x190
[ 84.321863] ret_from_fork+0x35/0x40
[ 84.321863] ? __pfx_kernel_init+0x10/0x10
[ 84.321863] ret_from_fork_asm+0x1b/0x30
[ 84.321863] </TASK>
[ 84.298443] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.5.0-rc5 #1
[ 84.298443] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014
[ 84.298443] RIP: 0010:default_idle+0x13/0x20
[ 84.298443] Code: 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d f9 81 2b 00 f3 0f 1e fa fb f4 <fa> 2e e9 c6 b5 00 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90
[ 84.298443] RSP: 0018:ffffb268c0093ee8 EFLAGS: 00000206
[ 84.298443] RAX: ffffa2061f0a7e28 RBX: 0000000000000002 RCX: 0000000000141c54
[ 84.298443] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000141c54
[ 84.298443] RBP: ffffb268c0093ef8 R08: ffffffffffff98d0 R09: 000000000000db9d
[ 84.298443] R10: 000000000002271c R11: ffffffff9d34d300 R12: 0000000000000000
[ 84.298443] R13: ffffa206011f8000 R14: 0000000000000000 R15: 0000000000000000
[ 84.298443] FS: 0000000000000000(0000) GS:ffffa2061f080000(0000) knlGS:0000000000000000
[ 84.298443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 84.298443] CR2: 0000000000000000 CR3: 0000000006e2a000 CR4: 0000000000350ee0
[ 84.298443] Call Trace:
[ 84.298443] <IRQ>
[ 84.298443] ? rcu_dump_cpu_stacks+0xd9/0x130
[ 84.298443] ? rcu_sched_clock_irq+0x52e/0xf40
[ 84.298443] ? __pfx_jiffies_read+0x10/0x10
[ 84.298443] ? update_process_times+0x5a/0x80
[ 84.298443] ? tick_periodic+0x60/0x70
[ 84.298443] ? tick_handle_periodic+0x1d/0x90
[ 84.298443] ? __sysvec_apic_timer_interrupt+0x5b/0x190
[ 84.298443] ? sysvec_apic_timer_interrupt+0x67/0x80
[ 84.298443] </IRQ>
[ 84.298443] <TASK>
[ 84.298443] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 84.298443] ? __pfx_jiffies_read+0x10/0x10
[ 84.298443] ? default_idle+0x13/0x20
[ 84.298443] default_idle_call+0x35/0x60
[ 84.298443] do_idle+0xce/0x240
[ 84.298443] cpu_startup_entry+0x18/0x20
[ 84.298443] start_secondary+0x97/0xa0
[ 84.298443] secondary_startup_64_no_verify+0x179/0x17b
[ 84.298443] </TASK>
[ 84.298443] Sending NMI from CPU 2 to CPUs 3:
[ 84.407546] NMI backtrace for cpu 3 skipped: idling at default_idle+0x13/0x20
...

> Also, can you send me the host and guest .configs and the compilers
> you've used so that I can try to reproduce here exactly what you have?

Sure thing!

Host compiler: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/13.2.0/x86_64-gcc-13.2.0-nolibc-x86_64-linux.tar.xz
Host config: https://gist.github.com/nathanchance/e3b03f955e718fd802229ef04f3a87da/raw/46d1ec9f37506f87f40dc32729019d841ec921c0/srso-host.config

Guest compiler: https://mirrors.edge.kernel.org/pub/tools/llvm/files/llvm-16.0.6-x86_64.tar.xz
Guest config: https://gist.github.com/nathanchance/e3b03f955e718fd802229ef04f3a87da/raw/46d1ec9f37506f87f40dc32729019d841ec921c0/srso-guest.config

Cheers,
Nathan

2023-08-11 16:36:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Fri, Aug 11, 2023 at 09:03:18AM -0700, Sean Christopherson wrote:
> Might be the flags bug that borks KVM's fastop() emulation. If that fixes things,
> my guess is that bringing APs out of WFS somehow triggers emulation.

I was just about to connect you two guys, thanks Sean!

> https://lore.kernel.org/all/[email protected]

Nathan, if you could test, that would be cool.

Also, Nick has another patch for -mno-shared, it probably isn't fixing
yours but it would be good to test it too, just in case:

https://github.com/ClangBuiltLinux/linux/issues/1911#issuecomment-1674993796

Thx!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-11 16:48:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Fri, Aug 11, 2023, Nathan Chancellor wrote:
> On Fri, Aug 11, 2023 at 12:14:56PM +0200, Borislav Petkov wrote:
> > On Thu, Aug 10, 2023 at 09:14:14AM -0700, Nathan Chancellor wrote:
> > > Not sure how helpful that will be...
> >
> > Yeah, not really. More wild guesses: if you uncomment the UNTRAIN_RET in
> > __svm_vcpu_run() on the host, does that have any effect? Diff below.
>
> Unfortunately, that seems to make no difference...
>
> I did have to switch to the Ryzen 3 box for testing, as I am not at home
> for a couple of days and I did not want to lose access to my workstation
> if I took a bad update since it has no remote management capabilities.
> Something I noticed in doing so is that the VM boot on that machine
> appears to get farther along than on my Threadripper 3990X, but I still
> see a hang with a stack trace similar to the one that I reported in the
> initial post with '-smp 2', so I think it is the same problem but
> perhaps the more cores the VM has, the more likely it is to appear
> totally hung? Might be a red herring but I figured I would mention it in
> case it is relevant.

Might be the flags bug that borks KVM's fastop() emulation. If that fixes things,
my guess is that bringing APs out of WFS somehow triggers emulation.

https://lore.kernel.org/all/[email protected]

2023-08-12 02:30:19

by Nick Desaulniers

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Fri, Aug 11, 2023 at 9:12 AM Borislav Petkov <[email protected]> wrote:
>
> On Fri, Aug 11, 2023 at 09:03:18AM -0700, Sean Christopherson wrote:
> > Might be the flags bug that borks KVM's fastop() emulation. If that fixes things,
> > my guess is that bringing APs out of WFS somehow triggers emulation.
>
> I was just about to connect you two guys, thanks Sean!
>
> > https://lore.kernel.org/all/[email protected]
>
> Nathan, if you could test, that would be cool.

Nathan confirmed on IRC (since Sean isn't there; Sean what are you
doing, you know we have corp IRCCloud accounts, yeah?):


<nathanchance>
bpetkov: Seems like Sean bailed you out :P his patch appears to fix
the issue on the Ryzen 3 box, about to test the Threadripper
bpetkov: Will have results shortly assuming the machine boots :^) is
it possible that clang is generating a code sequence that triggers
this issue that gcc does not?

<bpetkov>
yeah, something about rFLAGS gets clobbered in the clang variant while
gcc doesn't
dunno if this is a more serious code generation issue

I'm not familiar enough with the relevant code to make a call there,
but perhaps Sean has more context and can help us deduce if that's the
case?

>
> Also, Nick has another patch for -mno-shared, it probably isn't fixing
> yours but it would be good to test it too, just in case:
>
> https://github.com/ClangBuiltLinux/linux/issues/1911#issuecomment-1674993796

So far it looks like it's working for folks. Fixing that issue is the
lowest priority issue of the three we found; I'll send it formally
next week.

I literally just had my hard drive fail on my main dev box (that's two
machines fail in one week; laptop wont power on anymore; down to one
machine left). Going to see if fsck can help at all; worst case I may
need Nathan to formally send it for me next week. Let's see if I can
recover this machine first...what a way to end the week. SMH

>
> Thx!
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette



--
Thanks,
~Nick Desaulniers

2023-08-12 15:13:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Fri, Aug 11, 2023 at 05:42:04PM -0700, Nick Desaulniers wrote:
> So far it looks like it's working for folks. Fixing that issue is the
> lowest priority issue of the three we found; I'll send it formally
> next week.

Ok. Also, it might not be needed as some of PeterZ's stuff do simplify
that code more so I'm thinking of taking them and will make your fix not
needed. But we'll talk.

> I literally just had my hard drive fail on my main dev box (that's two
> machines fail in one week; laptop wont power on anymore; down to one
> machine left).

Sounds like Murphy came to visit. I hate that.

> Going to see if fsck can help at all; worst case I may
> need Nathan to formally send it for me next week. Let's see if I can
> recover this machine first...what a way to end the week. SMH

Yeah, I can scrape it off some gitgub issue page too - that's not
a problem.

Btw, this:

https://github.com/ClangBuiltLinux/linux/commit/150c42407f87463c27a2459e06845965291d9973

Is this fixing a current issue and so it needs to go to Linus now?

If so, I'll expedite it too.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-08-12 15:15:02

by Nathan Chancellor

[permalink] [raw]
Subject: Re: Hang when booting guest kernels compiled with clang after SRSO mitigations

On Sat, Aug 12, 2023 at 03:46:42PM +0200, Borislav Petkov wrote:
> Btw, this:
>
> https://github.com/ClangBuiltLinux/linux/commit/150c42407f87463c27a2459e06845965291d9973
>
> Is this fixing a current issue and so it needs to go to Linus now?
>
> If so, I'll expedite it too.

Yes, that fixes an error at link time when building with LTO:

https://github.com/ClangBuiltLinux/linux/issues/1909

That is commit 973ab2d61f33 ("x86/retpoline,kprobes: Fix position of
thunk sections with CONFIG_LTO_CLANG") in x86/core, with the conflicts
against SRSO resolved and the fix that Andrew Cooper pointed out
squashed in, in case my comment in the commit message was not clear
enough :)

https://lore.kernel.org/lkml/[email protected]/

With that change and the other ld.lld change you already picked up in
x86/bugs, our builds should go back to green, then we can decide what to
do about that runtime warning based on Peter's series and Nick's ability
to get back up and running (I have his patch applied locally somewhere
so I don't mind collecting tags and sending it next week).

Thanks again for helping us with this, I know it has been chaotic.

Cheers,
Nathan