2023-09-05 16:12:32

by Christian Lamparter

[permalink] [raw]
Subject: Missing L3 linesize on AMD Ryzen 7940HS chip causes crash in amd_cpuid4.

Greetings,

a PhD-Student complained that his virtualbox-supported linux VM wouldn't start
on his brand new laptop... And I took the bait. (Note: I helped him installing
Linux Mint 21.1 on the same Laptop and it worked without issues). But as
soon as he tried porting his virtualbox VMs he got the following panic during
the early boot:

---

| divide error: 0000 [#1] PREEMPT SMP NOPTI | CPU: 0 PID: 19 Comm: cpuhp/0 Not tainted 5.19.0-46-generic #47~22.04.1-Ubuntu | Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 | RIP: 0010:amd_cpuid4+0x195/0x2f0 | Code: c1 e0 0a 81 e3 ff 03 00 00 81 e2 ff 0f 00 00 48 8b 7d b0 c1 e3 0c 09 d3 89 f2 81 e6 ff 03 00 00 c1 e2 16 83 c6 01 09 d3 31 d2 <f7> f1 41 89 1f 31 d2 f7 f6 83 e8 01 89 07 48 8b 45 d0 65 48 2b 04 | RSP: 0018:ffffbb78800a3ce8 EFLAGS: 00010246 | RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000000 | RDX: 0000000000000000 RSI: 0000000000000400 RDI: ffffbb78800a3d60 | RBP: ffffbb78800a3d48 R08: 0000000000000000 R09: 0000000000000000 | R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003 | R13: ffffbb78800a3d08 R14: ffffbb78800a3d58 R15: ffffbb78800a3d5c | FS: 0000000000000000(0000) GS:ffffa05759a00000(0000)
knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | CR2: 00007f2946fc1e24 CR3: 0000000108010000 CR4: 00000000000506f0 | Call Trace: |  <TASK> | cpuid4_cache_lookup_regs+0x14d/0x160 | populate_cache_leaves+0x180/0x200 | cacheinfo_cpu_online+0xc1/0x1c0 | cache_add_dev+0x420/0x420 | [...]

---

looking at amd_cpuid4() function in arch/x86/kernel/cpu/cacheinfo.c, it reads the cpuid values for 0x80000005 and 0x80000006. I convinced the student to run "cpuid -1 -r" and received the these:

| 0x80000005 0x00: eax=0xff48ff40 ebx=0xff48ff40 ecx=0x20080140 edx=0x20080140 | 0x80000006 0x00: eax=0x5c002200 ebx=0x6c004200 ecx=0x04006140 >> edx=0x00009000 <<

(The 0x80000006 edx=0x00009000 is the important bit)

Plugging these values into the amd_cpuid4 function() causes an division by zero
in line 297:
| ecx <https://elixir.bootlin.com/linux/v5.19.10/C/ident/ecx>->split <https://elixir.bootlin.com/linux/v5.19.10/C/ident/split>.number_of_sets <https://elixir.bootlin.com/linux/v5.19.10/C/ident/number_of_sets> = (size_in_kb <https://elixir.bootlin.com/linux/v5.19.10/C/ident/size_in_kb> * 1024) / line_size <https://elixir.bootlin.com/linux/v5.19.10/C/ident/line_size> /
|                                                    (ebx <https://elixir.bootlin.com/linux/v5.19.10/C/ident/ebx>->split <https://elixir.bootlin.com/linux/v5.19.10/C/ident/split>.ways_of_associativity <https://elixir.bootlin.com/linux/v5.19.10/C/ident/ways_of_associativity> + 1) - 1;

This is because L3 cache's line_size is "0" (this is coming from the 80000006 edx
value of 0x00009000).

This can't be right... or? Well, digging around. I found the following explanation
in AMD's community forum:
<https://community.amd.com/t5/processors/ryzen-7-3700x-cpuid-function-id-0x80000006-returns-wrong-number/td-p/376937>
So there's an issue with "wonky L3 values" that happens even earlier with the
AMD 3700X. In this forum post, the author talks about the
"L3 cache associativity (bits 12-15) is 0x9".

And the same is happening with both AMD 7950X and 7940HS.
The kicker is: this value of "9" means:
"Please look at CPUID.8000_001D".

Which I think boils down to implementing X86_FEATURE_TOPOEXT
for virtualbox to get over this issue?

Now, is there something I'm missing? I don't know if qemu is be affected.
Or if there's another way around it.

Regards,
Christian Lamparter

Note:

For now with Virtualbox. The problem can be mitigated by running | vboxmanage setextradata $VM VBoxInternal/CPUM/HostCPUID/80000006/edx 0x02009140

(The value 0x02009140 is coming from an AMD Ryzen 7950X. While also "wonky", this allowed the Virtualbox VMs to boot).