2024-03-22 17:52:17

by Laura Nao

[permalink] [raw]
Subject: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hello,

KernelCI has identified a mainline boot regression [1] on the following
AMD Stoney Ridge Chromebooks (grunt family), between v6.8 (e8f897) and
v6.8-1185-g855684c7d938 (855684):
- Acer Chromebook Spin 311 R721T (codename kasumi360)
- HP Chromebook 14 (codename careena)
- HP Chromebook 11A G6 EE (codename barla)

The kernel doesn't boot at all and nothing is reported on the serial
console after "Starting kernel ...". The issue is still present on the
latest mainline revision.
The defconfig used by KernelCI for the boot tests can be found in [2].

Sending this report in order to track the regression while a fix is
identified.

Thanks,

Laura

[1] https://linux.kernelci.org/test/case/id/65fca98e3883a392524c4380/
[2]
https://storage.kernelci.org/mainline/master/v6.8-11837-g2ac2b1665d3fb/x86_64/x86_64_defconfig%2Bx86-board/gcc-10/config/kernel.config

#regzbot introduced: v6.8..v6.8-1185-g855684c7d938



2024-03-28 09:52:36

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

I ran a manual bisection to track down the root cause for this
regression and landed on the c749ce commit from this series:
https://lore.kernel.org/all/[email protected]/

Do you have any insight on this issue or any suggestion on how to
effectively debug this?

Thank you!

Laura

#regzbot introduced: c749ce393b


2024-03-28 11:50:02

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

> Hi Thomas,
>
> I ran a manual bisection to track down the root cause for this
> regression and landed on the c749ce commit from this series:
> https://lore.kernel.org/all/[email protected]/
>
> Do you have any insight on this issue or any suggestion on how to
> effectively debug this?
>
> Thank you!
>
> Laura

Hi again,

I actually forgot to quote the previous message describing the issue:

> KernelCI has identified a mainline boot regression [1] on the following
> AMD Stoney Ridge Chromebooks (grunt family), between v6.8 (e8f897) and
> v6.8-1185-g855684c7d938 (855684):
> - Acer Chromebook Spin 311 R721T (codename kasumi360)
> - HP Chromebook 14 (codename careena)
> - HP Chromebook 11A G6 EE (codename barla)
>
> The kernel doesn't boot at all and nothing is reported on the serial
> console after "Starting kernel ...". The issue is still present on the
> latest mainline revision.
> The defconfig used by KernelCI for the boot tests can be found in [2].
>
> Sending this report in order to track the regression while a fix is
> identified.
>
> Thanks,
>
> Laura
>
> [1] https://linux.kernelci.org/test/case/id/65fca98e3883a392524c4380/
> [2]
> https://storage.kernelci.org/mainline/master/v6.8-11837-g2ac2b1665d3fb/x86_64/x86_64_defconfig%2Bx86-board/gcc-10/config/kernel.config

sorry about that!

Best,

Laura

2024-04-04 08:37:11

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

[/me added x86 team]

On 28.03.24 12:50, Laura Nao wrote:
>>
>> I ran a manual bisection to track down the root cause for this
>> regression and landed on the c749ce
>> commit from this series:
>> https://lore.kernel.org/all/[email protected]/

FWIW, that commit is c749ce393b8fe9 ("x86/cpu: Use common topology code
for AMD") from tglx that was part of the "x86/cpu: Rework topology
evaluation" series.

>> Do you have any insight on this issue or any suggestion on how to
>> effectively debug this?
>>
>> Thank you!

Hmmm, it looks like this did not make any progress. Thomas, did this
fall through the cracks due to Easter, or is this this on your todo list?

Or was there some progress and I just missed it?

Laura Nao, I assume the problem is still happening?

FWIW, this was the initial problem description:

>>> KernelCI has identified a mainline boot regression [1] on the following
>>> AMD Stoney Ridge Chromebooks (grunt family), between v6.8 (e8f897) and
>>> v6.8-1185-g855684c7d938 (855684):
>>> - Acer Chromebook Spin 311 R721T (codename kasumi360)
>>> - HP Chromebook 14 (codename careena)
>>> - HP Chromebook 11A G6 EE (codename barla)
>>>
>>> The kernel doesn't boot at all and nothing is reported on the serial
>>> console after "Starting kernel ...". The issue is still present on the
>>> latest mainline revision.
>>> The defconfig used by KernelCI for the boot tests can be found in [2].
>>>
>>> Sending this report in order to track the regression while a fix is
>>> identified.
>>>
>>> Thanks,
>>>
>>> Laura
>>>
>>> [1] https://linux.kernelci.org/test/case/id/65fca98e3883a392524c4380/
>>> [2]
>>> https://storage.kernelci.org/mainline/master/v6.8-11837-g2ac2b1665d3fb/x86_64/x86_64_defconfig%2Bx86-board/gcc-10/config/kernel.config


Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

2024-04-04 09:30:21

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thorsten,

On 4/4/24 10:24, Linux regression tracking (Thorsten Leemhuis) wrote:
> [/me added x86 team]
>
> On 28.03.24 12:50, Laura Nao wrote:
>>>
>>> I ran a manual bisection to track down the root cause for this
>>> regression and landed on the c749ce
>>> commit from this series:
>>> https://lore.kernel.org/all/[email protected]/
>
> FWIW, that commit is c749ce393b8fe9 ("x86/cpu: Use common topology code
> for AMD") from tglx that was part of the "x86/cpu: Rework topology
> evaluation" series.
>
>>> Do you have any insight on this issue or any suggestion on how to
>>> effectively debug this?
>>>
>>> Thank you!
>
> Hmmm, it looks like this did not make any progress. Thomas, did this
> fall through the cracks due to Easter, or is this this on your todo list?
>
> Or was there some progress and I just missed it?
>
> Laura Nao, I assume the problem is still happening?

Yes, I confirm the issue is still happening.

See e.g. KernelCI report for next-20240403:

https://linux.kernelci.org/test/plan/id/660dbeed148e1a8a284c4372/

>
> FWIW, this was the initial problem description:
>
>>>> KernelCI has identified a mainline boot regression [1] on the following
>>>> AMD Stoney Ridge Chromebooks (grunt family), between v6.8 (e8f897) and
>>>> v6.8-1185-g855684c7d938 (855684):
>>>> - Acer Chromebook Spin 311 R721T (codename kasumi360)
>>>> - HP Chromebook 14 (codename careena)
>>>> - HP Chromebook 11A G6 EE (codename barla)
>>>>
>>>> The kernel doesn't boot at all and nothing is reported on the serial
>>>> console after "Starting kernel ...". The issue is still present on the
>>>> latest mainline revision.
>>>> The defconfig used by KernelCI for the boot tests can be found in [2].
>>>>
>>>> Sending this report in order to track the regression while a fix is
>>>> identified.
>>>>
>>>> Thanks,
>>>>
>>>> Laura
>>>>
>>>> [1] https://linux.kernelci.org/test/case/id/65fca98e3883a392524c4380/
>>>> [2]
>>>> https://storage.kernelci.org/mainline/master/v6.8-11837-g2ac2b1665d3fb/x86_64/x86_64_defconfig%2Bx86-board/gcc-10/config/kernel.config
>
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke


2024-04-04 13:01:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On Thu, Apr 04 2024 at 10:24, Linux regression tracking wrote:
> On 28.03.24 12:50, Laura Nao wrote:
>>>
>>> I ran a manual bisection to track down the root cause for this
>>> regression and landed on the c749ce
>>> commit from this series:
>>> https://lore.kernel.org/all/[email protected]/
>
> FWIW, that commit is c749ce393b8fe9 ("x86/cpu: Use common topology code
> for AMD") from tglx that was part of the "x86/cpu: Rework topology
> evaluation" series.
>
>>> Do you have any insight on this issue or any suggestion on how to
>>> effectively debug this?
>>>
>>> Thank you!
>
> Hmmm, it looks like this did not make any progress. Thomas, did this
> fall through the cracks due to Easter, or is this this on your todo
> list?
>
> Or was there some progress and I just missed it?

No. It's in my easter eggs basket, but I need to eat all the chocolate
eggs first to get to this.

I'll have a look later today.

Thanks,

tglx

2024-04-04 13:36:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Thu, Apr 04 2024 at 15:01, Thomas Gleixner wrote:
> On Thu, Apr 04 2024 at 10:24, Linux regression tracking wrote:
>> On 28.03.24 12:50, Laura Nao wrote:
>>>>
>>>> I ran a manual bisection to track down the root cause for this
>>>> regression and landed on the c749ce
>>>> commit from this series:
>>>> https://lore.kernel.org/all/[email protected]/
>>
>> FWIW, that commit is c749ce393b8fe9 ("x86/cpu: Use common topology code
>> for AMD") from tglx that was part of the "x86/cpu: Rework topology
>> evaluation" series.

Can you please provide a boot dmesg from a 6.8 kernel and the output of
'cpuid' ?

Thanks,

tglx

2024-04-04 15:23:57

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

On 4/4/24 15:05, Thomas Gleixner wrote:
> Laura!
>
> On Thu, Apr 04 2024 at 15:01, Thomas Gleixner wrote:
>> On Thu, Apr 04 2024 at 10:24, Linux regression tracking wrote:
>>> On 28.03.24 12:50, Laura Nao wrote:
>>>>>
>>>>> I ran a manual bisection to track down the root cause for this
>>>>> regression and landed on the c749ce
>>>>> commit from this series:
>>>>> https://lore.kernel.org/all/[email protected]/
>>>
>>> FWIW, that commit is c749ce393b8fe9 ("x86/cpu: Use common topology code
>>> for AMD") from tglx that was part of the "x86/cpu: Rework topology
>>> evaluation" series.
>
> Can you please provide a boot dmesg from a 6.8 kernel and the output of
> 'cpuid' ?

Sure, here's the output from an Acer Chromebook Spin 311 R721T device:
- kernel log: https://pastebin.com/raw/NNbeJupE
- cpuid output: https://pastebin.com/raw/ThmGqnqG

This is with kernel v6.8 (e8f897f4afef0031fe618a8e94127a0934896aba), let
me know if you need the output from a different revision.

Thank you!

Laura

2024-04-04 16:15:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Thu, Apr 04 2024 at 17:23, Laura Nao wrote:
> On 4/4/24 15:05, Thomas Gleixner wrote:
>> Can you please provide a boot dmesg from a 6.8 kernel and the output of
>> 'cpuid' ?
>
> Sure, here's the output from an Acer Chromebook Spin 311 R721T device:
> - kernel log: https://pastebin.com/raw/NNbeJupE
> - cpuid output: https://pastebin.com/raw/ThmGqnqG

Thanks for the quick response.

Can you please provide 'cpuid -r' output too as 'cpuid' does a few
tweaks to the raw data and it's hard to match it back to the code.

Thanks,

tglx

2024-04-04 18:29:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Thu, Apr 04 2024 at 18:14, Thomas Gleixner wrote:
> Can you please provide 'cpuid -r' output too as 'cpuid' does a few
> tweaks to the raw data and it's hard to match it back to the code.

Don't bother. I think I figured it out.

Can you please test the patch below?

Thanks,

tglx
---
diff --git a/arch/x86/kernel/cpu/topology_amd.c b/arch/x86/kernel/cpu/topology_amd.c
index 1a8b3ad493af..0d91a04b1741 100644
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -29,7 +29,17 @@ static bool parse_8000_0008(struct topo_scan *tscan)
if (!sft)
sft = get_count_order(ecx.cpu_nthreads + 1);

- topology_set_dom(tscan, TOPO_SMT_DOMAIN, sft, ecx.cpu_nthreads + 1);
+ /*
+ * cpu_nthreads describes the number of threads in the package
+ * sft is the number of APIC ID bits per package
+ *
+ * As the number of actual threads per core is not described in
+ * this leaf, just set the CORE domain shift and let the later
+ * parsers set SMT shift. Assume one thread per core by default
+ * which is correct if there are no other CPUID leafs to parse.
+ */
+ topology_update_dom(tscan, TOPO_SMT_DOMAIN, 0, 1);
+ topology_set_dom(tscan, TOPO_CORE_DOMAIN, sft, ecx.cpu_nthreads + 1);
return true;
}

@@ -73,12 +83,14 @@ static bool parse_8000_001e(struct topo_scan *tscan, bool has_0xb)
tscan->c->topo.initial_apicid = leaf.ext_apic_id;

/*
- * If leaf 0xb is available, then SMT shift is set already. If not
- * take it from ecx.threads_per_core and use topo_update_dom() -
- * topology_set_dom() would propagate and overwrite the already
- * propagated CORE level.
+ * If leaf 0xb is available, then the domain shifts are set
+ * already and nothing to do here.
*/
if (!has_0xb) {
+ /*
+ * Leaf 0x80000008 set the CORE domain shift already.
+ * Update the SMT domain, but do not propagate it.
+ */
unsigned int nthreads = leaf.core_nthreads + 1;

topology_update_dom(tscan, TOPO_SMT_DOMAIN, get_count_order(nthreads), nthreads);

2024-04-04 19:15:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On Thu, Apr 04 2024 at 20:06, Thomas Gleixner wrote:
> On Thu, Apr 04 2024 at 18:14, Thomas Gleixner wrote:
>> Can you please provide 'cpuid -r' output too as 'cpuid' does a few
>> tweaks to the raw data and it's hard to match it back to the code.
>
> Don't bother. I think I figured it out.
>
> Can you please test the patch below?

Bah. Won't help. Let me dig deeper.

2024-04-04 20:05:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On Thu, Apr 04 2024 at 21:14, Thomas Gleixner wrote:

> On Thu, Apr 04 2024 at 20:06, Thomas Gleixner wrote:
>> On Thu, Apr 04 2024 at 18:14, Thomas Gleixner wrote:
>>> Can you please provide 'cpuid -r' output too as 'cpuid' does a few
>>> tweaks to the raw data and it's hard to match it back to the code.
>>
>> Don't bother. I think I figured it out.
>>
>> Can you please test the patch below?
>
> Bah. Won't help. Let me dig deeper.

Can I get the /proc/cpuinfo output please?

2024-04-05 08:14:05

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

On 4/4/24 22:05, Thomas Gleixner wrote:
> On Thu, Apr 04 2024 at 21:14, Thomas Gleixner wrote:
>
>> On Thu, Apr 04 2024 at 20:06, Thomas Gleixner wrote:
>>> On Thu, Apr 04 2024 at 18:14, Thomas Gleixner wrote:
>>>> Can you please provide 'cpuid -r' output too as 'cpuid' does a few
>>>> tweaks to the raw data and it's hard to match it back to the code.
>>>
>>> Don't bother. I think I figured it out.
>>>
>>> Can you please test the patch below?
>>
>> Bah. Won't help. Let me dig deeper.
>
> Can I get the /proc/cpuinfo output please?

Here are more logs from the same device running v6.8:
- cpuid -r: https://pastebin.com/raw/YmDrnw06
- cpuinfo: https://pastebin.com/raw/Ki0vwdDE

Thanks for looking into this!

Laura

2024-04-05 08:44:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Fri, Apr 05 2024 at 10:14, Laura Nao wrote:
> On 4/4/24 22:05, Thomas Gleixner wrote:
>> Can I get the /proc/cpuinfo output please?
>
> Here are more logs from the same device running v6.8:
> - cpuid -r: https://pastebin.com/raw/YmDrnw06
> - cpuinfo: https://pastebin.com/raw/Ki0vwdDE

Thanks for the info. I'm still not seing why that wouldn't work.

As the machine seems to have a serial port, can you please replace
'earlycon' with 'earlyprintk=ttyS0,115200' on the command line and see
whether that gives some output. I'm not sure that the current 'earlycon'
has the same effect.

Thanks,

Thomas


2024-04-05 10:39:41

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On 4/5/24 10:42, Thomas Gleixner wrote:
> Laura!
>
> On Fri, Apr 05 2024 at 10:14, Laura Nao wrote:
>> On 4/4/24 22:05, Thomas Gleixner wrote:
>>> Can I get the /proc/cpuinfo output please?
>>
>> Here are more logs from the same device running v6.8:
>> - cpuid -r: https://pastebin.com/raw/YmDrnw06
>> - cpuinfo: https://pastebin.com/raw/Ki0vwdDE
>
> Thanks for the info. I'm still not seing why that wouldn't work.
>
> As the machine seems to have a serial port, can you please replace
> 'earlycon' with 'earlyprintk=ttyS0,115200' on the command line and see
> whether that gives some output. I'm not sure that the current 'earlycon'
> has the same effect.
>

Thanks for the hint - replacing earlycon with earlyprintk=ttyS0,115200
doesn't seem to help unfortunately, I still get no output at all.

Not sure if it's really relevant to mention, but when booting a v6.8
kernel the first stage bootloader (i.e. coreboot in this case) spits out
a few lines before the kernel takes over:

coreboot-v1.9308_26_0.0.22-6034-gd586c7b122 Wed Jul 15 17:51:44 UTC 2020 smm starting...
SMI# #0
Chrome EC: Set SMI mask to 0x0000000000000000
Chrome EC: UHEPI supported
Clearing pending EC events. Error code 1 is expected.
EC returned error result code 9
Chrome EC: Set SCI mask to 0x00000000142609fb

When booting e.g. next-20240404 not even these lines are reported after
'Starting Kernel...'. Not sure if that suggests anything significant
though.

For the sake of completeness, I'm leaving here links to the full test
jobs run in our Collabora LAVA lab, to include everything that's reported
on the serial console after the device is powered on:
- v6.8 test job: https://lava.collabora.dev/scheduler/job/13265735#L279
- next-20240404 test job w/ 'earlyprintk=ttyS0,115200' in cmdline:
https://lava.collabora.dev/scheduler/job/13266992#L279

I also tried applying the patch you attached above on top of
next-20240404 but it doesn't seem to make any difference unfortunately.

Best,

Laura


2024-04-05 13:38:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On Fri, Apr 05 2024 at 12:32, Laura Nao wrote:
> I also tried applying the patch you attached above on top of
> next-20240404 but it doesn't seem to make any difference unfortunately.

As I said it won't help, but it fixes some other issue.

I'm still puzzled and trying to make sense of this.

Can you please boot a working kernel and provide the output of:

# rdmsr 0xc0011005

Thanks,

tglx

2024-04-05 13:58:24

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On 4/5/24 15:38, Thomas Gleixner wrote:
>
> Can you please boot a working kernel and provide the output of:
>
> # rdmsr 0xc0011005
>

Sure:

# rdmsr 0xc0011005
2fabbfff2fd3fbff

Thanks,

Laura

2024-04-05 14:59:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

On Fri, Apr 05 2024 at 15:58, Laura Nao wrote:
> On 4/5/24 15:38, Thomas Gleixner wrote:
>>
>> Can you please boot a working kernel and provide the output of:
>>
>> # rdmsr 0xc0011005
>
> # rdmsr 0xc0011005
> 2fabbfff2fd3fbff

Ok. So the CPU does not advertise FEATURE_TOPOEXT which is consistent
with CPUID.

Now I'm even more puzzled because then the patch I gave you should make
a difference. I need to run some errands now, I will provide a debug
patch later this evening.

Thanks,

tglx

2024-04-08 08:20:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Fri, Apr 05 2024 at 16:59, Thomas Gleixner wrote:
> On Fri, Apr 05 2024 at 15:58, Laura Nao wrote:
>> On 4/5/24 15:38, Thomas Gleixner wrote:
>>>
>>> Can you please boot a working kernel and provide the output of:
>>>
>>> # rdmsr 0xc0011005
>>
>> # rdmsr 0xc0011005
>> 2fabbfff2fd3fbff
>
> Ok. So the CPU does not advertise FEATURE_TOPOEXT which is consistent
> with CPUID.
>
> Now I'm even more puzzled because then the patch I gave you should make
> a difference. I need to run some errands now, I will provide a debug
> patch later this evening.

Hmm. No real good idea yet.

Can you please checkout

c749ce393b8f ("x86/cpu: Use common topology code for AMD")

and apply the patch further up in the thread and see whether that
boots. We'll take it from there.

Thanks,

tglx

2024-04-08 11:14:33

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

On 4/8/24 10:20, Thomas Gleixner wrote:
> Laura!
>
> On Fri, Apr 05 2024 at 16:59, Thomas Gleixner wrote:
>> On Fri, Apr 05 2024 at 15:58, Laura Nao wrote:
>>> On 4/5/24 15:38, Thomas Gleixner wrote:
>>>>
>>>> Can you please boot a working kernel and provide the output of:
>>>>
>>>> # rdmsr 0xc0011005
>>>
>>> # rdmsr 0xc0011005
>>> 2fabbfff2fd3fbff
>>
>> Ok. So the CPU does not advertise FEATURE_TOPOEXT which is consistent
>> with CPUID.
>>
>> Now I'm even more puzzled because then the patch I gave you should make
>> a difference. I need to run some errands now, I will provide a debug
>> patch later this evening.
>
> Hmm. No real good idea yet.
>
> Can you please checkout
>
> c749ce393b8f ("x86/cpu: Use common topology code for AMD")
>
> and apply the patch further up in the thread and see whether that
> boots. We'll take it from there.
>

Just tried that and it doesn't boot, nothing on the serial console (I
kept earlyprintk=ttyS0,115200 in the cmdline).

Test job for reference: https://lava.collabora.dev/scheduler/job/13297561

Thanks,

Laura

2024-04-08 14:21:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Mon, Apr 08 2024 at 13:06, Laura Nao wrote:
> Just tried that and it doesn't boot, nothing on the serial console (I
> kept earlyprintk=ttyS0,115200 in the cmdline).

Thanks for trying. Now let's take a small step back.

Please reset the tree to:

ace278e7eca6 ("x86/smpboot: Teach it about topo.amd_node_id")

That's the commit right before switching over and according to your
bisect it works. Now apply the patch below, which just runs the new
topology scan function but discards the result.

Thanks,

tglx
---
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -29,7 +29,17 @@ static bool parse_8000_0008(struct topo_
if (!sft)
sft = get_count_order(ecx.cpu_nthreads + 1);

- topology_set_dom(tscan, TOPO_SMT_DOMAIN, sft, ecx.cpu_nthreads + 1);
+ /*
+ * cpu_nthreads describes the number of threads in the package
+ * sft is the number of APIC ID bits per package
+ *
+ * As the number of actual threads per core is not described in
+ * this leaf, just set the CORE domain shift and let the later
+ * parsers set SMT shift. Assume one thread per core by default
+ * which is correct if there are no other CPUID leafs to parse.
+ */
+ topology_update_dom(tscan, TOPO_SMT_DOMAIN, 0, 1);
+ topology_set_dom(tscan, TOPO_CORE_DOMAIN, sft, ecx.cpu_nthreads + 1);
return true;
}

@@ -73,12 +83,14 @@ static bool parse_8000_001e(struct topo_
tscan->c->topo.initial_apicid = leaf.ext_apic_id;

/*
- * If leaf 0xb is available, then SMT shift is set already. If not
- * take it from ecx.threads_per_core and use topo_update_dom() -
- * topology_set_dom() would propagate and overwrite the already
- * propagated CORE level.
+ * If leaf 0xb is available, then the domain shifts are set
+ * already and nothing to do here.
*/
if (!has_0xb) {
+ /*
+ * Leaf 0x80000008 set the CORE domain shift already.
+ * Update the SMT domain, but do not propagate it.
+ */
unsigned int nthreads = leaf.core_nthreads + 1;

topology_update_dom(tscan, TOPO_SMT_DOMAIN, get_count_order(nthreads), nthreads);
--- a/arch/x86/kernel/cpu/topology_common.c
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -133,6 +133,10 @@ static void parse_topology(struct topo_s
tscan->ebx1_nproc_shift = get_count_order(ebx.nproc);

switch (c->x86_vendor) {
+ case X86_VENDOR_AMD:
+ if (IS_ENABLED(CONFIG_CPU_SUP_AMD))
+ cpu_parse_topology_amd(tscan);
+ break;
case X86_VENDOR_CENTAUR:
case X86_VENDOR_ZHAOXIN:
parse_legacy(tscan);



2024-04-09 10:08:36

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

On 4/8/24 16:19, Thomas Gleixner wrote:
> Laura!
>
> On Mon, Apr 08 2024 at 13:06, Laura Nao wrote:
>> Just tried that and it doesn't boot, nothing on the serial console (I
>> kept earlyprintk=ttyS0,115200 in the cmdline).
>
> Thanks for trying. Now let's take a small step back.
>
> Please reset the tree to:
>
> ace278e7eca6 ("x86/smpboot: Teach it about topo.amd_node_id")
>
> That's the commit right before switching over and according to your
> bisect it works. Now apply the patch below, which just runs the new
> topology scan function but discards the result.
>

The ace278e7eca6 revision with the patch you provided boots correctly.

Reference test job: https://lava.collabora.dev/scheduler/job/13311035#L1894

Thanks,

Laura


2024-04-09 12:25:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Tue, Apr 09 2024 at 12:07, Laura Nao wrote:
> On 4/8/24 16:19, Thomas Gleixner wrote:
>> That's the commit right before switching over and according to your
>> bisect it works. Now apply the patch below, which just runs the new
>> topology scan function but discards the result.
>
> The ace278e7eca6 revision with the patch you provided boots correctly.

Progress :)

Can you please replace that patch with the one below?

Thanks,

tglx
---
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1616,6 +1616,13 @@ void __init early_cpu_init(void)
#endif
}
early_identify_cpu(&boot_cpu_data);
+
+ pr_info("Max cores: %u\n", boot_cpu_data.x86_max_cores);
+ pr_info("pkg %u die %u core %u nid %u\n", boot_cpu_data.topo.pkg_id,
+ boot_cpu_data.topo.die_id, boot_cpu_data.topo.core_id,
+ boot_cpu_data.topo.amd_node_id);
+ pr_info("SNS %u\n", smp_num_siblings);
+ pr_info("NPP %u MDPP %u\n", __amd_nodes_per_pkg, __max_die_per_package);
}

static bool detect_null_seg_behavior(void)
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -29,7 +29,17 @@ static bool parse_8000_0008(struct topo_
if (!sft)
sft = get_count_order(ecx.cpu_nthreads + 1);

- topology_set_dom(tscan, TOPO_SMT_DOMAIN, sft, ecx.cpu_nthreads + 1);
+ /*
+ * cpu_nthreads describes the number of threads in the package
+ * sft is the number of APIC ID bits per package
+ *
+ * As the number of actual threads per core is not described in
+ * this leaf, just set the CORE domain shift and let the later
+ * parsers set SMT shift. Assume one thread per core by default
+ * which is correct if there are no other CPUID leafs to parse.
+ */
+ topology_update_dom(tscan, TOPO_SMT_DOMAIN, 0, 1);
+ topology_set_dom(tscan, TOPO_CORE_DOMAIN, sft, ecx.cpu_nthreads + 1);
return true;
}

@@ -73,12 +83,14 @@ static bool parse_8000_001e(struct topo_
tscan->c->topo.initial_apicid = leaf.ext_apic_id;

/*
- * If leaf 0xb is available, then SMT shift is set already. If not
- * take it from ecx.threads_per_core and use topo_update_dom() -
- * topology_set_dom() would propagate and overwrite the already
- * propagated CORE level.
+ * If leaf 0xb is available, then the domain shifts are set
+ * already and nothing to do here.
*/
if (!has_0xb) {
+ /*
+ * Leaf 0x80000008 set the CORE domain shift already.
+ * Update the SMT domain, but do not propagate it.
+ */
unsigned int nthreads = leaf.core_nthreads + 1;

topology_update_dom(tscan, TOPO_SMT_DOMAIN, get_count_order(nthreads), nthreads);
--- a/arch/x86/kernel/cpu/topology_common.c
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -133,6 +133,10 @@ static void parse_topology(struct topo_s
tscan->ebx1_nproc_shift = get_count_order(ebx.nproc);

switch (c->x86_vendor) {
+ case X86_VENDOR_AMD:
+ if (IS_ENABLED(CONFIG_CPU_SUP_AMD))
+ cpu_parse_topology_amd(tscan);
+ break;
case X86_VENDOR_CENTAUR:
case X86_VENDOR_ZHAOXIN:
parse_legacy(tscan);
@@ -162,6 +166,9 @@ static void topo_set_ids(struct topo_sca

if (c->x86_vendor == X86_VENDOR_AMD)
cpu_topology_fixup_amd(tscan);
+
+ pr_info("pkg %u die %u core %u nid %u\n", c->topo.pkg_id, c->topo.die_id,
+ c->topo.core_id, c->topo.amd_node_id);
}

static void topo_set_max_cores(struct topo_scan *tscan)
@@ -175,6 +182,7 @@ static void topo_set_max_cores(struct to
*/
tscan->c->x86_max_cores = tscan->dom_ncpus[TOPO_DIEGRP_DOMAIN] >>
x86_topo_system.dom_shifts[TOPO_SMT_DOMAIN];
+ pr_info("Max cores: %u\n", tscan->c->x86_max_cores);
}

void cpu_parse_topology(struct cpuinfo_x86 *c)
@@ -215,18 +223,19 @@ void __init cpu_init_topology(struct cpu

parse_topology(&tscan, true);

- if (!topo_is_converted(c))
- return;
-
/* Copy the shift values and calculate the unit sizes. */
memcpy(x86_topo_system.dom_shifts, tscan.dom_shifts, sizeof(x86_topo_system.dom_shifts));

dom = TOPO_SMT_DOMAIN;
x86_topo_system.dom_size[dom] = 1U << x86_topo_system.dom_shifts[dom];
+ pr_info("Dom %u Sft: %u Sz: %u\n", dom, x86_topo_system.dom_size[dom],
+ x86_topo_system.dom_shifts[dom]);

for (dom++; dom < TOPO_MAX_DOMAIN; dom++) {
sft = x86_topo_system.dom_shifts[dom] - x86_topo_system.dom_shifts[dom - 1];
x86_topo_system.dom_size[dom] = 1U << sft;
+ pr_info("Dom %u Sft: %u Sz: %u\n", dom, x86_topo_system.dom_size[dom],
+ x86_topo_system.dom_shifts[dom]);
}

topo_set_ids(&tscan);
@@ -238,6 +247,7 @@ void __init cpu_init_topology(struct cpu
* changes further down the road to get it right during early boot.
*/
smp_num_siblings = tscan.dom_ncpus[TOPO_SMT_DOMAIN];
+ pr_info("SNS %u\n", smp_num_siblings);

/*
* Neither it's clear whether there are as many dies as the APIC
@@ -252,4 +262,6 @@ void __init cpu_init_topology(struct cpu
*/
if (c->x86_vendor == X86_VENDOR_AMD || c->x86_vendor == X86_VENDOR_HYGON)
__amd_nodes_per_pkg = __max_die_per_package = tscan.amd_nodes_per_pkg;
+
+ pr_info("NPP %u MDPP %u\n", __amd_nodes_per_pkg, __max_die_per_package);
}

2024-04-10 08:20:54

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

On 4/9/24 14:25, Thomas Gleixner wrote:
> Laura!
>
> On Tue, Apr 09 2024 at 12:07, Laura Nao wrote:
>> On 4/8/24 16:19, Thomas Gleixner wrote:
>>> That's the commit right before switching over and according to your
>>> bisect it works. Now apply the patch below, which just runs the new
>>> topology scan function but discards the result.
>>
>> The ace278e7eca6 revision with the patch you provided boots correctly.
>
> Progress :)
>

Indeed :)

> Can you please replace that patch with the one below?

So, with this patch applied on top of ace278e7eca6 the kernel doesn't
boot anymore - reference test job:
https://lava.collabora.dev/scheduler/job/13324010

I see the only change between the second and third patch you provided,
besides the debug prints, is:

- if (!topo_is_converted(c))
- return;
-

Printing the debug information without this probably doesn't really help,
but just in case it's useful: I tried excluding the change above from the
patch while leaving everything else unchanged - reference test job:
https://lava.collabora.dev/scheduler/job/13324298 (also pasted the
kernel log here for easier consultation: https://pastebin.com/raw/TQBDvCah)

Hope this helps,

Laura

>
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1616,6 +1616,13 @@ void __init early_cpu_init(void)
> #endif
> }
> early_identify_cpu(&boot_cpu_data);
> +
> + pr_info("Max cores: %u\n", boot_cpu_data.x86_max_cores);
> + pr_info("pkg %u die %u core %u nid %u\n", boot_cpu_data.topo.pkg_id,
> + boot_cpu_data.topo.die_id, boot_cpu_data.topo.core_id,
> + boot_cpu_data.topo.amd_node_id);
> + pr_info("SNS %u\n", smp_num_siblings);
> + pr_info("NPP %u MDPP %u\n", __amd_nodes_per_pkg, __max_die_per_package);
> }
>
> static bool detect_null_seg_behavior(void)
> --- a/arch/x86/kernel/cpu/topology_amd.c
> +++ b/arch/x86/kernel/cpu/topology_amd.c
> @@ -29,7 +29,17 @@ static bool parse_8000_0008(struct topo_
> if (!sft)
> sft = get_count_order(ecx.cpu_nthreads + 1);
>
> - topology_set_dom(tscan, TOPO_SMT_DOMAIN, sft, ecx.cpu_nthreads + 1);
> + /*
> + * cpu_nthreads describes the number of threads in the package
> + * sft is the number of APIC ID bits per package
> + *
> + * As the number of actual threads per core is not described in
> + * this leaf, just set the CORE domain shift and let the later
> + * parsers set SMT shift. Assume one thread per core by default
> + * which is correct if there are no other CPUID leafs to parse.
> + */
> + topology_update_dom(tscan, TOPO_SMT_DOMAIN, 0, 1);
> + topology_set_dom(tscan, TOPO_CORE_DOMAIN, sft, ecx.cpu_nthreads + 1);
> return true;
> }
>
> @@ -73,12 +83,14 @@ static bool parse_8000_001e(struct topo_
> tscan->c->topo.initial_apicid = leaf.ext_apic_id;
>
> /*
> - * If leaf 0xb is available, then SMT shift is set already. If not
> - * take it from ecx.threads_per_core and use topo_update_dom() -
> - * topology_set_dom() would propagate and overwrite the already
> - * propagated CORE level.
> + * If leaf 0xb is available, then the domain shifts are set
> + * already and nothing to do here.
> */
> if (!has_0xb) {
> + /*
> + * Leaf 0x80000008 set the CORE domain shift already.
> + * Update the SMT domain, but do not propagate it.
> + */
> unsigned int nthreads = leaf.core_nthreads + 1;
>
> topology_update_dom(tscan, TOPO_SMT_DOMAIN, get_count_order(nthreads), nthreads);
> --- a/arch/x86/kernel/cpu/topology_common.c
> +++ b/arch/x86/kernel/cpu/topology_common.c
> @@ -133,6 +133,10 @@ static void parse_topology(struct topo_s
> tscan->ebx1_nproc_shift = get_count_order(ebx.nproc);
>
> switch (c->x86_vendor) {
> + case X86_VENDOR_AMD:
> + if (IS_ENABLED(CONFIG_CPU_SUP_AMD))
> + cpu_parse_topology_amd(tscan);
> + break;
> case X86_VENDOR_CENTAUR:
> case X86_VENDOR_ZHAOXIN:
> parse_legacy(tscan);
> @@ -162,6 +166,9 @@ static void topo_set_ids(struct topo_sca
>
> if (c->x86_vendor == X86_VENDOR_AMD)
> cpu_topology_fixup_amd(tscan);
> +
> + pr_info("pkg %u die %u core %u nid %u\n", c->topo.pkg_id, c->topo.die_id,
> + c->topo.core_id, c->topo.amd_node_id);
> }
>
> static void topo_set_max_cores(struct topo_scan *tscan)
> @@ -175,6 +182,7 @@ static void topo_set_max_cores(struct to
> */
> tscan->c->x86_max_cores = tscan->dom_ncpus[TOPO_DIEGRP_DOMAIN] >>
> x86_topo_system.dom_shifts[TOPO_SMT_DOMAIN];
> + pr_info("Max cores: %u\n", tscan->c->x86_max_cores);
> }
>
> void cpu_parse_topology(struct cpuinfo_x86 *c)
> @@ -215,18 +223,19 @@ void __init cpu_init_topology(struct cpu
>
> parse_topology(&tscan, true);
>
> - if (!topo_is_converted(c))
> - return;
> -
> /* Copy the shift values and calculate the unit sizes. */
> memcpy(x86_topo_system.dom_shifts, tscan.dom_shifts, sizeof(x86_topo_system.dom_shifts));
>
> dom = TOPO_SMT_DOMAIN;
> x86_topo_system.dom_size[dom] = 1U << x86_topo_system.dom_shifts[dom];
> + pr_info("Dom %u Sft: %u Sz: %u\n", dom, x86_topo_system.dom_size[dom],
> + x86_topo_system.dom_shifts[dom]);
>
> for (dom++; dom < TOPO_MAX_DOMAIN; dom++) {
> sft = x86_topo_system.dom_shifts[dom] - x86_topo_system.dom_shifts[dom - 1];
> x86_topo_system.dom_size[dom] = 1U << sft;
> + pr_info("Dom %u Sft: %u Sz: %u\n", dom, x86_topo_system.dom_size[dom],
> + x86_topo_system.dom_shifts[dom]);
> }
>
> topo_set_ids(&tscan);
> @@ -238,6 +247,7 @@ void __init cpu_init_topology(struct cpu
> * changes further down the road to get it right during early boot.
> */
> smp_num_siblings = tscan.dom_ncpus[TOPO_SMT_DOMAIN];
> + pr_info("SNS %u\n", smp_num_siblings);
>
> /*
> * Neither it's clear whether there are as many dies as the APIC
> @@ -252,4 +262,6 @@ void __init cpu_init_topology(struct cpu
> */
> if (c->x86_vendor == X86_VENDOR_AMD || c->x86_vendor == X86_VENDOR_HYGON)
> __amd_nodes_per_pkg = __max_die_per_package = tscan.amd_nodes_per_pkg;
> +
> + pr_info("NPP %u MDPP %u\n", __amd_nodes_per_pkg, __max_die_per_package);
> }

2024-04-10 14:05:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Wed, Apr 10 2024 at 10:15, Laura Nao wrote:
> On 4/9/24 14:25, Thomas Gleixner wrote:
>> Can you please replace that patch with the one below?
>
> So, with this patch applied on top of ace278e7eca6 the kernel doesn't
> boot anymore - reference test job:
> https://lava.collabora.dev/scheduler/job/13324010
>
> I see the only change between the second and third patch you provided,
> besides the debug prints, is:
>
> - if (!topo_is_converted(c))
> - return;
> -

Right. So this limits the area to search significantly.

> Printing the debug information without this probably doesn't really help,
> but just in case it's useful: I tried excluding the change above from the
> patch while leaving everything else unchanged - reference test job:
> https://lava.collabora.dev/scheduler/job/13324298 (also pasted the
> kernel log here for easier consultation:
> https://pastebin.com/raw/TQBDvCah)
>
> Hope this helps,

It does. Good idea!

I just moved the exit check a bit so we should see the scan info. That
should tell me what goes south.

Thanks,

tglx
---
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1616,6 +1616,13 @@ void __init early_cpu_init(void)
#endif
}
early_identify_cpu(&boot_cpu_data);
+
+ pr_info("Max cores: %u\n", boot_cpu_data.x86_max_cores);
+ pr_info("pkg %u die %u core %u nid %u\n", boot_cpu_data.topo.pkg_id,
+ boot_cpu_data.topo.die_id, boot_cpu_data.topo.core_id,
+ boot_cpu_data.topo.amd_node_id);
+ pr_info("SNS %u\n", smp_num_siblings);
+ pr_info("NPP %u MDPP %u\n", __amd_nodes_per_pkg, __max_die_per_package);
}

static bool detect_null_seg_behavior(void)
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -29,7 +29,17 @@ static bool parse_8000_0008(struct topo_
if (!sft)
sft = get_count_order(ecx.cpu_nthreads + 1);

- topology_set_dom(tscan, TOPO_SMT_DOMAIN, sft, ecx.cpu_nthreads + 1);
+ /*
+ * cpu_nthreads describes the number of threads in the package
+ * sft is the number of APIC ID bits per package
+ *
+ * As the number of actual threads per core is not described in
+ * this leaf, just set the CORE domain shift and let the later
+ * parsers set SMT shift. Assume one thread per core by default
+ * which is correct if there are no other CPUID leafs to parse.
+ */
+ topology_update_dom(tscan, TOPO_SMT_DOMAIN, 0, 1);
+ topology_set_dom(tscan, TOPO_CORE_DOMAIN, sft, ecx.cpu_nthreads + 1);
return true;
}

@@ -73,12 +83,14 @@ static bool parse_8000_001e(struct topo_
tscan->c->topo.initial_apicid = leaf.ext_apic_id;

/*
- * If leaf 0xb is available, then SMT shift is set already. If not
- * take it from ecx.threads_per_core and use topo_update_dom() -
- * topology_set_dom() would propagate and overwrite the already
- * propagated CORE level.
+ * If leaf 0xb is available, then the domain shifts are set
+ * already and nothing to do here.
*/
if (!has_0xb) {
+ /*
+ * Leaf 0x80000008 set the CORE domain shift already.
+ * Update the SMT domain, but do not propagate it.
+ */
unsigned int nthreads = leaf.core_nthreads + 1;

topology_update_dom(tscan, TOPO_SMT_DOMAIN, get_count_order(nthreads), nthreads);
--- a/arch/x86/kernel/cpu/topology_common.c
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -133,6 +133,10 @@ static void parse_topology(struct topo_s
tscan->ebx1_nproc_shift = get_count_order(ebx.nproc);

switch (c->x86_vendor) {
+ case X86_VENDOR_AMD:
+ if (IS_ENABLED(CONFIG_CPU_SUP_AMD))
+ cpu_parse_topology_amd(tscan);
+ break;
case X86_VENDOR_CENTAUR:
case X86_VENDOR_ZHAOXIN:
parse_legacy(tscan);
@@ -162,6 +166,9 @@ static void topo_set_ids(struct topo_sca

if (c->x86_vendor == X86_VENDOR_AMD)
cpu_topology_fixup_amd(tscan);
+
+ pr_info("pkg %u die %u core %u nid %u\n", c->topo.pkg_id, c->topo.die_id,
+ c->topo.core_id, c->topo.amd_node_id);
}

static void topo_set_max_cores(struct topo_scan *tscan)
@@ -175,6 +182,7 @@ static void topo_set_max_cores(struct to
*/
tscan->c->x86_max_cores = tscan->dom_ncpus[TOPO_DIEGRP_DOMAIN] >>
x86_topo_system.dom_shifts[TOPO_SMT_DOMAIN];
+ pr_info("Max cores: %u\n", tscan->c->x86_max_cores);
}

void cpu_parse_topology(struct cpuinfo_x86 *c)
@@ -215,20 +223,26 @@ void __init cpu_init_topology(struct cpu

parse_topology(&tscan, true);

- if (!topo_is_converted(c))
- return;
-
/* Copy the shift values and calculate the unit sizes. */
memcpy(x86_topo_system.dom_shifts, tscan.dom_shifts, sizeof(x86_topo_system.dom_shifts));

dom = TOPO_SMT_DOMAIN;
x86_topo_system.dom_size[dom] = 1U << x86_topo_system.dom_shifts[dom];
+ pr_info("Dom %u Sft: %u Sz: %u\n", dom, x86_topo_system.dom_size[dom],
+ x86_topo_system.dom_shifts[dom]);

for (dom++; dom < TOPO_MAX_DOMAIN; dom++) {
sft = x86_topo_system.dom_shifts[dom] - x86_topo_system.dom_shifts[dom - 1];
x86_topo_system.dom_size[dom] = 1U << sft;
+ pr_info("Dom %u Sft: %u Sz: %u\n", dom, x86_topo_system.dom_size[dom],
+ x86_topo_system.dom_shifts[dom]);
}

+ pr_info("NPP %u\n", tscan.amd_nodes_per_pkg);
+
+ if (!topo_is_converted(c))
+ return;
+
topo_set_ids(&tscan);
topo_set_max_cores(&tscan);

@@ -238,6 +252,7 @@ void __init cpu_init_topology(struct cpu
* changes further down the road to get it right during early boot.
*/
smp_num_siblings = tscan.dom_ncpus[TOPO_SMT_DOMAIN];
+ pr_info("SNS %u\n", smp_num_siblings);

/*
* Neither it's clear whether there are as many dies as the APIC
@@ -252,4 +267,6 @@ void __init cpu_init_topology(struct cpu
*/
if (c->x86_vendor == X86_VENDOR_AMD || c->x86_vendor == X86_VENDOR_HYGON)
__amd_nodes_per_pkg = __max_die_per_package = tscan.amd_nodes_per_pkg;
+
+ pr_info("NPP %u MDPP %u\n", __amd_nodes_per_pkg, __max_die_per_package);
}

2024-04-10 16:11:24

by Laura Nao

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Hi Thomas,

On 4/10/24 15:57, Thomas Gleixner wrote:
> Laura!
>
> On Wed, Apr 10 2024 at 10:15, Laura Nao wrote:
>> On 4/9/24 14:25, Thomas Gleixner wrote:
>>> Can you please replace that patch with the one below?
>>
>> So, with this patch applied on top of ace278e7eca6 the kernel doesn't
>> boot anymore - reference test job:
>> https://lava.collabora.dev/scheduler/job/13324010
>>
>> I see the only change between the second and third patch you provided,
>> besides the debug prints, is:
>>
>> - if (!topo_is_converted(c))
>> - return;
>> -
>
> Right. So this limits the area to search significantly.
>
>> Printing the debug information without this probably doesn't really help,
>> but just in case it's useful: I tried excluding the change above from the
>> patch while leaving everything else unchanged - reference test job:
>> https://lava.collabora.dev/scheduler/job/13324298 (also pasted the
>> kernel log here for easier consultation:
>> https://pastebin.com/raw/TQBDvCah)
>>
>> Hope this helps,
>
> It does. Good idea!
>
> I just moved the exit check a bit so we should see the scan info. That
> should tell me what goes south.
>

Here's the full kernel log with the latest patch applied:
https://pastebin.com/raw/r2CkP396

Reference test job: https://lava.collabora.dev/scheduler/job/13328709

Thanks!

Laura

2024-04-10 19:50:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [REGRESSION] mainline boot regression on AMD Stoney Ridge Chromebooks

Laura!

On Wed, Apr 10 2024 at 18:11, Laura Nao wrote:
> On 4/10/24 15:57, Thomas Gleixner wrote:
>> I just moved the exit check a bit so we should see the scan info. That
>> should tell me what goes south.
>
> Here's the full kernel log with the latest patch applied:
> https://pastebin.com/raw/r2CkP396

Ok. Now I can see it.

pr_info("NPP %u\n", tscan.amd_nodes_per_pkg);

The output is:

<6>[ 0.000000] NPP 3

The original code then reads out 1, which is what's expected on that
machine.

Now that bogus value '3' is used later in

a %= b / 3

as b == 2, the division result is zero, which makes the %= operation
crash with a division by zero. As there are no handlers yet, nothing to
see.

The problem causing the bogus readout sits between my keyboard and my
chair.

I'll send out the fixes with proper change logs later tonight.

Thanks for all your help!

tglx