2024-04-17 21:21:56

by Lyude Paul

[permalink] [raw]
Subject: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Hi! I just wanted to let you know that one of the desktops I use for
testing no longer seems to boot after this commit (just finished
bisecting and confirming). The machine hangs before it gets to fbcon,
and the error I'm seeing in the early boot console is as such:

Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc5Lyude-Test+ #20
Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.10 01/22/2019
Call trace:
<TASK>
dump_stack_lvl+0x47/0x60
panic+0x340/0x370
? timer_irq_works+0x67/0x130
panic_if_irq_remap+0x1d/0x20
setup_IO_APIC+0x82d/0x950
? _raw_spin_unlock_irqrestore+0x1d/0x40
? clear_IO_APIC_pin+0x16c/0x260
apic_intr_mode_init+0x5d/0xf0
x86_late_time_init+0x24/0x40
start_kernel+0x673/0xa90
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0x96/0xa0
secondary_startup_64_no_verify+0x180/0x18b
</TASK>
--- [ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]---

Assuming I copied this over by hand to my computer correctly, the
decoded backtrace should be:

Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.10 01/22/2019
Call trace:
<TASK>
dump_stack_lvl (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/dump_stack.c:107)
panic (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/panic.c:344)
? timer_irq_works (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./arch/x86/include/asm/msr.h:186 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/io_apic.c:1595 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/io_apic.c:1634)
panic_if_irq_remap (??:?)
setup_IO_APIC (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/io_apic.c:2241 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/io_apic.c:2413)
? _raw_spin_unlock_irqrestore (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./arch/x86/include/asm/preempt.h:94 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/spinlock_api_smp.h:152 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/locking/spinlock.c:194 (discriminator 1))
? clear_IO_APIC_pin (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/io_apic.c:563)
apic_intr_mode_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/apic.c:2330 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/apic/apic.c:1374)
x86_late_time_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/time.c:101)
start_kernel (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/init/main.c:1035)
x86_64_start_reservations (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/head64.c:543)
x86_64_start_kernel (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/head64.c:485 (discriminator 5))
secondary_startup_64_no_verify (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/head_64.S:459)
</TASK>
--- [ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]---

Happy to provide any more information from this machine if you need it
:). And hopefully i'm not just late to the party and reporting a
regression someone else found already lol

--
Cheers,
Lyude Paul (she/her)
Software Engineer at Red Hat



2024-04-18 08:27:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

On Wed, Apr 17, 2024 at 05:21:43PM -0400, Lyude Paul wrote:
> Hi! I just wanted to let you know that one of the desktops I use for
> testing no longer seems to boot after this commit (just finished
> bisecting and confirming). The machine hangs before it gets to fbcon,
> and the error I'm seeing in the early boot console is as such:
>
> Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc5Lyude-Test+ #20
> Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.10 01/22/2019

Looks like an AMD chipset. Thomas did fix some fallout from the topo
rework on AMD, can you test the tip/master branch pls?

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-04-18 17:20:18

by Lyude Paul

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Just gave it a try, unfortunately I'm still seeing the same result on
that branch.

One more piece of information I apparently missed when reporting this
yesterday btw: I noticed one more kernel message that comes before the
panic that's probably relevant:

.TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1

On Thu, 2024-04-18 at 10:27 +0200, Borislav Petkov wrote:
> On Wed, Apr 17, 2024 at 05:21:43PM -0400, Lyude Paul wrote:
> > Hi! I just wanted to let you know that one of the desktops I use
> > for
> > testing no longer seems to boot after this commit (just finished
> > bisecting and confirming). The machine hangs before it gets to
> > fbcon,
> > and the error I'm seeing in the early boot console is as such:
> >
> >    Kernel panic - not syncing: timer doesn't work through
> > Interrupt-remapped IO-APIC
> >    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc5Lyude-Test+
> > #20
> >    Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.10
> > 01/22/2019
>
> Looks like an AMD chipset. Thomas did fix some fallout from the topo
> rework on AMD, can you test the tip/master branch pls?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/
>
> Thx.
>

--
Cheers,
Lyude Paul (she/her)
Software Engineer at Red Hat


2024-04-18 19:13:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

On Thu, Apr 18 2024 at 13:20, Lyude Paul wrote:

> Just gave it a try, unfortunately I'm still seeing the same result on
> that branch.
>
> One more piece of information I apparently missed when reporting this
> yesterday btw: I noticed one more kernel message that comes before the
> panic that's probably relevant:
>
> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1

Can you please apply the debug patch below which should make it boot
again.

Please also provide the output of the files underneath of

/sys/kernel/debug/x86/topo/

Thanks,

tglx
---
arch/x86/kernel/cpu/topology.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -176,6 +176,8 @@ static __init void topo_register_apic(u3
{
int cpu, dom;

+ pr_info("APIC ID %x present %d\n", apic_id, present);
+
if (present) {
set_bit(apic_id, phys_cpu_present_map);

@@ -201,10 +203,7 @@ static __init void topo_register_apic(u3
*/
if (hypervisor_is_type(X86_HYPER_NATIVE) &&
topo_unit_count(pkgid, TOPO_PKG_DOMAIN, phys_cpu_present_map)) {
- pr_info_once("Ignoring hot-pluggable APIC ID %x in present package.\n",
- apic_id);
- topo_info.nr_rejected_cpus++;
- return;
+ pr_info("Hot-pluggable APIC ID %x in present package.\n", apic_id);
}

topo_info.nr_disabled_cpus++;



2024-04-19 05:38:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

On Thu, Apr 18 2024 at 21:13, Thomas Gleixner wrote:
> On Thu, Apr 18 2024 at 13:20, Lyude Paul wrote:
>
>> Just gave it a try, unfortunately I'm still seeing the same result on
>> that branch.
>>
>> One more piece of information I apparently missed when reporting this
>> yesterday btw: I noticed one more kernel message that comes before the
>> panic that's probably relevant:
>>
>> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>
> Can you please apply the debug patch below which should make it boot
> again.

And provide the resulting dmesg obviously.

> Please also provide the output of the files underneath of
>
> /sys/kernel/debug/x86/topo/
>
> Thanks,
>
> tglx

2024-04-19 17:38:46

by Lyude Paul

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Awesome - can confirm the patch does indeed make the machine boot. Full
dmesg from boot attached. And the contents of
/sys/kernel/debug/x86/topo/ is as follows:

domain: Thread shift: 1 dom_size: 2 max_threads: 2
domain: Core shift: 4 dom_size: 8 max_threads: 16
domain: Module shift: 4 dom_size: 1 max_threads: 16
domain: Tile shift: 4 dom_size: 1 max_threads: 16
domain: Die shift: 4 dom_size: 1 max_threads: 16
domain: DieGrp shift: 4 dom_size: 1 max_threads: 16
domain: Package shift: 4 dom_size: 1 max_threads: 16

On Thu, 2024-04-18 at 21:13 +0200, Thomas Gleixner wrote:
> On Thu, Apr 18 2024 at 13:20, Lyude Paul wrote:
>
> > Just gave it a try, unfortunately I'm still seeing the same result
> > on
> > that branch.
> >
> > One more piece of information I apparently missed when reporting
> > this
> > yesterday btw: I noticed one more kernel message that comes before
> > the
> > panic that's probably relevant:
> >
> > ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>
> Can you please apply the debug patch below which should make it boot
> again.
>
> Please also provide the output of the files underneath of
>
>        /sys/kernel/debug/x86/topo/
>
> Thanks,
>
>         tglx
> ---
>  arch/x86/kernel/cpu/topology.c |    7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> --- a/arch/x86/kernel/cpu/topology.c
> +++ b/arch/x86/kernel/cpu/topology.c
> @@ -176,6 +176,8 @@ static __init void topo_register_apic(u3
>  {
>   int cpu, dom;
>  
> + pr_info("APIC ID %x present %d\n", apic_id, present);
> +
>   if (present) {
>   set_bit(apic_id, phys_cpu_present_map);
>  
> @@ -201,10 +203,7 @@ static __init void topo_register_apic(u3
>   */
>   if (hypervisor_is_type(X86_HYPER_NATIVE) &&
>       topo_unit_count(pkgid, TOPO_PKG_DOMAIN,
> phys_cpu_present_map)) {
> - pr_info_once("Ignoring hot-pluggable APIC ID
> %x in present package.\n",
> -      apic_id);
> - topo_info.nr_rejected_cpus++;
> - return;
> + pr_info("Hot-pluggable APIC ID %x in present
> package.\n", apic_id);
>   }
>  
>   topo_info.nr_disabled_cpus++;
>
>

--
Cheers,
Lyude Paul (she/her)
Software Engineer at Red Hat


Attachments:
gamma-apic-debug-patch.dmesg.log (85.73 kB)

2024-04-19 22:15:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Paul!

On Fri, Apr 19 2024 at 13:38, Lyude Paul wrote:
> Awesome - can confirm the patch does indeed make the machine boot. Full
> dmesg from boot attached.

Thanks for providing the data.

[ 0.089286] CPU topo: APIC ID 0 present 1
[ 0.089294] CPU topo: APIC ID 0 present 0
[ 0.089296] CPU topo: Hot-pluggable APIC ID 0 in present package.

ACPI is really a wonderland.

Can you please test the patch below?

Thanks,

tglx
---
Subject: x86/topology: Deal with more broken ACPI tables
From: Thomas Gleixner <[email protected]>
Date: Thu, 18 Apr 2024 21:02:39 +0200

Paul reported a regression which waas caused by the handling of non-present
CPUs in a present package. It's caused by the ACPI table on the system
which advertises APICs twice, present and non-present:

CPU topo: APIC ID 0 present 1
CPU topo: APIC ID 0 present 0
CPU topo: Hot-pluggable APIC ID 0 in present package.
Which causes the topology to get confused to the point that it fails to
bring the system up because the target APIC for the IOAPIC is not
available.

Prevent this by checking whether a non-present CPU has been already
registered as present before. If so emit a firmware warning and ignore the
registration request.

Fixes: f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")
Reported-by: Lyude Paul <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/cpu/topology.c | 6 ++++++
1 file changed, 6 insertions(+)

--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -195,6 +195,12 @@ static __init void topo_register_apic(u3
} else {
u32 pkgid = topo_apicid(apic_id, TOPO_PKG_DOMAIN);

+ if (test_bit(apic_id, phys_cpu_present_map)) {
+ pr_warn_once(FW_BUG "Already present APIC ID %x registered again as non-present\n",
+ apic_id);
+ return;
+ }
+
/*
* Check for present APICs in the same package when running
* on bare metal. Allow the bogosity in a guest.

2024-04-23 17:10:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

On Sat, Apr 20 2024 at 00:15, Thomas Gleixner wrote:
> Paul!
>
> On Fri, Apr 19 2024 at 13:38, Lyude Paul wrote:
>> Awesome - can confirm the patch does indeed make the machine boot. Full
>> dmesg from boot attached.
>
> Thanks for providing the data.
>
> [ 0.089286] CPU topo: APIC ID 0 present 1
> [ 0.089294] CPU topo: APIC ID 0 present 0
> [ 0.089296] CPU topo: Hot-pluggable APIC ID 0 in present package.
>
> ACPI is really a wonderland.

Second thoughts. I just stared at this some more and I really cannot
figure out why any of this (including the debug patch) makes a
difference or even sense at all.

All the commit you bisected to does is to reject the non-present APIC
IDs, but that's just an accounting thing. Instead of having them
accounted as disabled they are accounted as rejected.

So no. None of this makes any sense at all.

2024-04-24 20:57:06

by Lyude Paul

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")



On Sat, 2024-04-20 at 00:15 +0200, Thomas Gleixner wrote:
> Paul!

Lyude is fine BTW :P (I get the confusion though, Paul is usually not a
last name lol)

Anyway - unfortunately it doesn't seem like this patch helps :s, I'm
still not seeing any difference and the backtrace I'm seeing at early
boot looks the same. Any more information I can provide?

>
> On Fri, Apr 19 2024 at 13:38, Lyude Paul wrote:
> > Awesome - can confirm the patch does indeed make the machine boot.
> > Full
> > dmesg from boot attached.
>
> Thanks for providing the data.
>
> [    0.089286] CPU topo: APIC ID 0 present 1
> [    0.089294] CPU topo: APIC ID 0 present 0
> [    0.089296] CPU topo: Hot-pluggable APIC ID 0 in present package.
>
> ACPI is really a wonderland.
>
> Can you please test the patch below?
>
> Thanks,
>
>         tglx
> ---
> Subject: x86/topology: Deal with more broken ACPI tables
> From: Thomas Gleixner <[email protected]>
> Date: Thu, 18 Apr 2024 21:02:39 +0200
>
> Paul reported a regression which waas caused by the handling of non-
> present
> CPUs in a present package. It's caused by the ACPI table on the
> system
> which advertises APICs twice, present and non-present:
>
>   CPU topo: APIC ID 0 present 1
>   CPU topo: APIC ID 0 present 0
>   CPU topo: Hot-pluggable APIC ID 0 in present
> package.                                                             
>                                                                      
>                                       
> Which causes the topology to get confused to the point that it fails
> to
> bring the system up because the target APIC for the IOAPIC is not
> available.
>
> Prevent this by checking whether a non-present CPU has been already
> registered as present before. If so emit a firmware warning and
> ignore the
> registration request.
>
> Fixes: f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a
> present package")
> Reported-by: Lyude Paul <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
>  arch/x86/kernel/cpu/topology.c |    6 ++++++
>  1 file changed, 6 insertions(+)
>
> --- a/arch/x86/kernel/cpu/topology.c
> +++ b/arch/x86/kernel/cpu/topology.c
> @@ -195,6 +195,12 @@ static __init void topo_register_apic(u3
>   } else {
>   u32 pkgid = topo_apicid(apic_id, TOPO_PKG_DOMAIN);
>  
> + if (test_bit(apic_id, phys_cpu_present_map)) {
> + pr_warn_once(FW_BUG "Already present APIC ID
> %x registered again as non-present\n",
> +      apic_id);
> + return;
> + }
> +
>   /*
>   * Check for present APICs in the same package when
> running
>   * on bare metal. Allow the bogosity in a guest.
>

--
Cheers,
Lyude Paul (she/her)
Software Engineer at Red Hat


2024-04-25 02:11:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Lyude!

On Wed, Apr 24 2024 at 16:56, Lyude Paul wrote:
> On Sat, 2024-04-20 at 00:15 +0200, Thomas Gleixner wrote:
> Lyude is fine BTW :P (I get the confusion though, Paul is usually not a
> last name lol)

:)

> Anyway - unfortunately it doesn't seem like this patch helps :s, I'm
> still not seeing any difference and the backtrace I'm seeing at early
> boot looks the same. Any more information I can provide?

Can you please boot a kernel with the commit in question reverted and
add 'possible_cpus=8' to the kernel command line?

In theory this should fail too.

Thanks,

tglx

2024-04-25 16:21:52

by Lyude Paul

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Yep - tried booting a kernel with f0551af0213 reverted and
possible_cpus=8, it definitely looks like that crashes things as well
in the same way. Also - it scrolled off the screen before I had a
chance to write it down, but I'm -fairly- sure I saw some sort of
complaint about "16 [or some double digit number] processors exceeds
max number of 8". Which is quite interesting, as this is definitely
just a quad core ryzen processor with hyperthreading - so there should
only be 8 threads.

On Thu, 2024-04-25 at 04:11 +0200, Thomas Gleixner wrote:
> Lyude!
>
> On Wed, Apr 24 2024 at 16:56, Lyude Paul wrote:
> > On Sat, 2024-04-20 at 00:15 +0200, Thomas Gleixner wrote:
> > Lyude is fine BTW :P (I get the confusion though, Paul is usually
> > not a
> > last name lol)
>
> :)
>
> > Anyway - unfortunately it doesn't seem like this patch helps :s,
> > I'm
> > still not seeing any difference and the backtrace I'm seeing at
> > early
> > boot looks the same. Any more information I can provide?
>
> Can you please boot a kernel with the commit in question reverted and
> add 'possible_cpus=8' to the kernel command line?
>
> In theory this should fail too.
>
> Thanks,
>
>         tglx
>

--
Cheers,
Lyude Paul (she/her)
Software Engineer at Red Hat


2024-04-25 21:46:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

Lyude!

On Thu, Apr 25 2024 at 11:56, Lyude Paul wrote:
> On Thu, 2024-04-25 at 04:11 +0200, Thomas Gleixner wrote:
>>
>> Can you please boot a kernel with the commit in question reverted and
>> add 'possible_cpus=8' to the kernel command line?
>>
>> In theory this should fail too.
>
> Yep - tried booting a kernel with f0551af0213 reverted and
> possible_cpus=8, it definitely looks like that crashes things as well
> in the same way.

Good. That means it's a problem which existed before but went unnoticed.

> Also - it scrolled off the screen before I had a chance to write it
> down, but I'm -fairly- sure I saw some sort of complaint about "16 [or
> some double digit number] processors exceeds max number of 8". Which
> is quite interesting, as this is definitely just a quad core ryzen
> processor with hyperthreading - so there should only be 8 threads.

Right, that's what we saw with the debug patch. The ACPI/MADT table
is clearly bonkers. The effect of it is that it pretends that the system
has 16 possible CPUs:

[ 0.089381] CPU topo: Allowing 8 present CPUs plus 8 hotplug CPUs

Which in turn changes the sizing of the per CPU data and affects some
other details which depend on the number of possible CPUs.

But that should not matter at all because the system scaling should be
sufficient with 8 CPUs, but it does not for some completely non-obvious
reasons.

Can you please try to increase possible_cpus=N on the command line one
by one and check when it actually starts to "work" again.

One other thing to try is to boot with 'possible_cpus=8' and
'intremap=off' and see whether that makes a difference.

I really have no idea where to look and not having the early boot
messages in case of the fail is not helpful as I can't add meaningful
debug to it.

I just checked: the motherboard has a serial port, so it would be
extremly helpful to hook up a serial cable to this thing and enable
serial console on the kernel command line. That way we might eventually
see information which is emitted before it fails to validate the timer
interrupt.

Thanks,

tglx

2024-05-02 10:33:42

by Limonciello, Mario

[permalink] [raw]
Subject: Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

On 4/25/2024 16:42, Thomas Gleixner wrote:
> Lyude!
>
> On Thu, Apr 25 2024 at 11:56, Lyude Paul wrote:
>> On Thu, 2024-04-25 at 04:11 +0200, Thomas Gleixner wrote:
>>>
>>> Can you please boot a kernel with the commit in question reverted and
>>> add 'possible_cpus=8' to the kernel command line?
>>>
>>> In theory this should fail too.
>>
>> Yep - tried booting a kernel with f0551af0213 reverted and
>> possible_cpus=8, it definitely looks like that crashes things as well
>> in the same way.
>
> Good. That means it's a problem which existed before but went unnoticed.
>
>> Also - it scrolled off the screen before I had a chance to write it
>> down, but I'm -fairly- sure I saw some sort of complaint about "16 [or
>> some double digit number] processors exceeds max number of 8". Which
>> is quite interesting, as this is definitely just a quad core ryzen
>> processor with hyperthreading - so there should only be 8 threads.
>
> Right, that's what we saw with the debug patch. The ACPI/MADT table
> is clearly bonkers. The effect of it is that it pretends that the system
> has 16 possible CPUs:
>
> [ 0.089381] CPU topo: Allowing 8 present CPUs plus 8 hotplug CPUs
>
> Which in turn changes the sizing of the per CPU data and affects some
> other details which depend on the number of possible CPUs.

At least this aspect of this I suspect is caused by commit
fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c.

If you try reverting that I expect the "hotplug CPUs" disappear.

>
> But that should not matter at all because the system scaling should be
> sufficient with 8 CPUs, but it does not for some completely non-obvious
> reasons.
>
> Can you please try to increase possible_cpus=N on the command line one
> by one and check when it actually starts to "work" again.
>
> One other thing to try is to boot with 'possible_cpus=8' and
> 'intremap=off' and see whether that makes a difference.
>
> I really have no idea where to look and not having the early boot
> messages in case of the fail is not helpful as I can't add meaningful
> debug to it.
>
> I just checked: the motherboard has a serial port, so it would be
> extremly helpful to hook up a serial cable to this thing and enable
> serial console on the kernel command line. That way we might eventually
> see information which is emitted before it fails to validate the timer
> interrupt.
>
> Thanks,
>
> tglx
>