2024-05-30 06:58:44

by Jörn Heusipp

[permalink] [raw]
Subject: [REGRESSION] commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 (Linux 6.7+) crashes during boot


Hello x86 maintainers!


commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 ("x86/sev-es: Set
x86_virt_bits to the correct value straight away, instead of a two-phase
approach") crashes during boot for me on this 32bit x86 system.

Updating a Debian testing system resulted in a hang during boot before
printing anything, with any 6.7 or later kernel. With 'earlyprintk=vga',
I managed to capture the crash on video and stitched it together as an
image [1].
Trimmed transcription (might contain typos) of the crash from Debian
kernel 6.7.12-1:
===
BUG: kernel NULL pointer dereference, address: 00000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
Oops: 0002 [#1] PREEMPT SMP NOPTI
[...]
EIP: __ring_buffer_alloc+0x32/0x194
[...]
show_regs
__die
page_fault_oops
kernelmode_fixup_or_oops.constprop
__bad_area_nosemaphore.constprop
bad_area_nosemaphore
do_user_addr_fault
prb_read_valid
exc_page_fault
pvclock_clocksource_read_nowd
handle_exception
pvclock_clocksource_read_nowd
__ring_buffer_alloc
pvclock_clocksource_read_nowd
__ring_buffer_alloc
early_trace_init
start_kernel
i386_start_kernel
startup_32_smp
[...]
===
I could transcribe all of it or capture it again from latest git and
decode the symbols, if truely really needed, but I figured the type of
crash and the trace itself could maybe be sufficient. It looks identical
to me for all later crashing kernel versions.

I bisected this down to commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6.

The kernel config [2] I used is 'make olddefconfig' based on Debian's
config-6.8.11-686-pae [3].

I also tested 6.9.2 and 6.10-rc1, both also still crash in the same way.

cpuinfo:
===
manx@caesar:~$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Duron(tm)
stepping : 1
cpu MHz : 1798.331
cache size : 64 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fdiv_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow cpuid
3dnowprefetch vmmcall
bugs : fxsave_leak sysret_ss_attrs spectre_v1 spectre_v2
spec_store_bypass
bogomips : 3596.66
clflush size : 32
cache_alignment : 32
address sizes : 34 bits physical, 32 bits virtual
power management: ts
===

dmesg from a successful boot (Debian kernel 6.6.15-2) is here [4].

This particular system has been running all Debian testing kernels since
at least the 2.6.32 days and is currently running 6.6.15-2 completely
fine, thus this is an obvious regression.

The original Debian bug is #1071378 [5].


#regzbot introduced: fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6

[1] https://manx.datengang.de/temp/linux-6.7-crash/6.7.12-1-crash.png
[2] https://manx.datengang.de/temp/linux-6.7-crash/config
[3] https://manx.datengang.de/temp/linux-6.7-crash/config-6.8.11-686-pae
[4] https://manx.datengang.de/temp/linux-6.7-crash/dmesg-6.6.15-2.txt
[5] https://bugs.debian.org/1071378


Best regards,
Jörn


2024-05-30 07:28:57

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 (Linux 6.7+) crashes during boot

On 30.05.24 08:55, Jörn Heusipp wrote:
>
> Hello x86 maintainers!
>
> commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 ("x86/sev-es: Set
> x86_virt_bits to the correct value straight away, instead of a two-phase
> approach") crashes during boot for me on this 32bit x86 system.

FWIW, not my area of expertise, but there is a patch from Dave with a
Fixes: tag for your culprit up for review:
https://lore.kernel.org/all/[email protected]/

Ciao, Thorsten

> Updating a Debian testing system resulted in a hang during boot before
> printing anything, with any 6.7 or later kernel. With 'earlyprintk=vga',
> I managed to capture the crash on video and stitched it together as an
> image [1].
> Trimmed transcription (might contain typos) of the crash from Debian
> kernel 6.7.12-1:
> ===
> BUG: kernel NULL pointer dereference, address: 00000010
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> [...]
> EIP: __ring_buffer_alloc+0x32/0x194
> [...]
> show_regs
> __die
> page_fault_oops
> kernelmode_fixup_or_oops.constprop
> __bad_area_nosemaphore.constprop
> bad_area_nosemaphore
> do_user_addr_fault
> prb_read_valid
> exc_page_fault
> pvclock_clocksource_read_nowd
> handle_exception
> pvclock_clocksource_read_nowd
> __ring_buffer_alloc
> pvclock_clocksource_read_nowd
> __ring_buffer_alloc
> early_trace_init
> start_kernel
> i386_start_kernel
> startup_32_smp
> [...]
> ===
> I could transcribe all of it or capture it again from latest git and
> decode the symbols, if truely really needed, but I figured the type of
> crash and the trace itself could maybe be sufficient. It looks identical
> to me for all later crashing kernel versions.
>
> I bisected this down to commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6.
>
> The kernel config [2] I used is 'make olddefconfig' based on Debian's
> config-6.8.11-686-pae [3].
>
> I also tested 6.9.2 and 6.10-rc1, both also still crash in the same way.
>
> cpuinfo:
> ===
> manx@caesar:~$ cat /proc/cpuinfo
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 6
> model           : 8
> model name      : AMD Duron(tm)
> stepping        : 1
> cpu MHz         : 1798.331
> cache size      : 64 KB
> physical id     : 0
> siblings        : 1
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fdiv_bug        : no
> f00f_bug        : no
> coma_bug        : no
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 1
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow cpuid
> 3dnowprefetch vmmcall
> bugs            : fxsave_leak sysret_ss_attrs spectre_v1 spectre_v2
> spec_store_bypass
> bogomips        : 3596.66
> clflush size    : 32
> cache_alignment : 32
> address sizes   : 34 bits physical, 32 bits virtual
> power management: ts
> ===
>
> dmesg from a successful boot (Debian kernel 6.6.15-2) is here [4].
>
> This particular system has been running all Debian testing kernels since
> at least the 2.6.32 days and is currently running 6.6.15-2 completely
> fine, thus this is an obvious regression.
>
> The original Debian bug is #1071378 [5].
>
>
> #regzbot introduced: fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6
>
> [1] https://manx.datengang.de/temp/linux-6.7-crash/6.7.12-1-crash.png
> [2] https://manx.datengang.de/temp/linux-6.7-crash/config
> [3] https://manx.datengang.de/temp/linux-6.7-crash/config-6.8.11-686-pae
> [4] https://manx.datengang.de/temp/linux-6.7-crash/dmesg-6.6.15-2.txt
> [5] https://bugs.debian.org/1071378
>
>
> Best regards,
> Jörn
>
>

2024-05-30 08:55:14

by Jörn Heusipp

[permalink] [raw]
Subject: Re: [REGRESSION] commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 (Linux 6.7+) crashes during boot


Hello!

On 30/05/2024 09:27, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 30.05.24 08:55, Jörn Heusipp wrote:

>> commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 ("x86/sev-es: Set
>> x86_virt_bits to the correct value straight away, instead of a two-phase
>> approach") crashes during boot for me on this 32bit x86 system.
>
> FWIW, not my area of expertise, but there is a patch from Dave with a
> Fixes: tag for your culprit up for review:
> https://lore.kernel.org/all/[email protected]/

That did not apply cleanly to 6.10-rc1, but I figured it out manually. I
can confirm that it fixes the issue.

Best regards,
Jörn

2024-05-30 09:14:14

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 (Linux 6.7+) crashes during boot

On 30.05.24 10:45, Jörn Heusipp wrote:
>
> On 30/05/2024 09:27, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 30.05.24 08:55, Jörn Heusipp wrote:
>>> commit fbf6449f84bf5e4ad09f2c09ee70ed7d629b5ff6 ("x86/sev-es: Set
>>> x86_virt_bits to the correct value straight away, instead of a two-phase
>>> approach") crashes during boot for me on this 32bit x86 system.
>>
>> FWIW, not my area of expertise, but there is a patch from Dave with a
>> Fixes: tag for your culprit up for review:
>> https://lore.kernel.org/all/[email protected]/
>
> That did not apply cleanly to 6.10-rc1,

Maybe something changed since then.

> but I figured it out manually. I
> can confirm that it fixes the issue.

Cool. Guess Dave in that case might be happy about a "Tested-by" tag
from you:
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot dup:
https://lore.kernel.org/all/[email protected]/
#regzbot fix: x86/cpu: Provide default cache line size if not enumerated
#regzbot related:
https://lore.kernel.org/all/[email protected]/