2021-05-17 08:34:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

+ lkml.

On Mon, May 17, 2021 at 02:13:45AM -0600, James Feeney wrote:
> I re-ran my git bisect, this time with a full power-down and cold boot, and more thorough testing, running a web browser. My second bisect went from good to bad.
>
> So now, instead, git bisect ended here:
>
> 4f432e8bb15b352da72525144da025a46695968f is the first bad commit
> commit 4f432e8bb15b352da72525144da025a46695968f
> Author: Borislav Petkov <[email protected]>
> Date: Thu Jan 7 13:23:34 2021 +0100
>
> x86/mce: Get rid of mcheck_intel_therm_init()
>
> Move the APIC_LVTTHMR read which needs to happen on the BSP, to
> intel_init_thermal(). One less boot dependency.
>
> No functional changes.
>
> Signed-off-by: Borislav Petkov <[email protected]>
> Tested-by: Srinivas Pandruvada <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
>
> arch/x86/include/asm/mce.h | 6 ------
> arch/x86/kernel/cpu/mce/core.c | 1 -
> arch/x86/kernel/cpu/mce/therm_throt.c | 15 ++++-----------
> 3 files changed, 4 insertions(+), 18 deletions(-)
>
>
> Please let me know if that makes more sense.

Not really - this is the first time I'm seeing this and I highly doubt
your bisection is correct. But we'll see.

>
> Again:
>
> Arch Linux
> linux 5.12.arch1-1

Can you reproduce with the upstream 5.12 kernel to rule out influence by
any distro-specific patches?

> Intel Core2 T7200
> Mobile Intel 945PM Express Chipset
> ICH7-M
> Mobility Radeon X1600

Can you send full dmesg from a working kernel and the .config you're
using with 5.12?

> Generally, on failure, the system will not boot past "Loading initial ramdisk...", or, when it does, the boot process will hang, and the console will eventually show:
>
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
> ...
> RIP: 0010:smp_call_function_single+0xf7/0x140
>
> The top of the call trace variously shows either "__flush_tlb_all" or "tlbflush_read_file", with the "soft lockup" repeating indefinitely.
>

I'm presuming there's no way to connect your box over serial cable to
another one so that you can catch the full bad dmesg when it hangs? It
would be good if you could...

Thx.

--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg


2021-05-19 19:10:43

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/17/21 2:32 AM, Borislav Petkov wrote:
> + lkml.
>
> On Mon, May 17, 2021 at 02:13:45AM -0600, James Feeney wrote:
>> I re-ran my git bisect, this time with a full power-down and cold boot, and more thorough testing, running a web browser. My second bisect went from good to bad.
>>
>> So now, instead, git bisect ended here:
>>
>> 4f432e8bb15b352da72525144da025a46695968f is the first bad commit
>> commit 4f432e8bb15b352da72525144da025a46695968f
>> Author: Borislav Petkov <[email protected]>
>> Date: Thu Jan 7 13:23:34 2021 +0100
>>
>> x86/mce: Get rid of mcheck_intel_therm_init()
>>
>> Move the APIC_LVTTHMR read which needs to happen on the BSP, to
>> intel_init_thermal(). One less boot dependency.
>>
>> No functional changes.
>>
>> Signed-off-by: Borislav Petkov <[email protected]>
>> Tested-by: Srinivas Pandruvada <[email protected]>
>> Link: https://lkml.kernel.org/r/[email protected]
>>
>> arch/x86/include/asm/mce.h | 6 ------
>> arch/x86/kernel/cpu/mce/core.c | 1 -
>> arch/x86/kernel/cpu/mce/therm_throt.c | 15 ++++-----------
>> 3 files changed, 4 insertions(+), 18 deletions(-)
>>
>>
>> Please let me know if that makes more sense.
>
> Not really - this is the first time I'm seeing this and I highly doubt
> your bisection is correct. But we'll see.>

I did go back and repeat the git bisect for a third time. This time, I re-booted all of the "good" kernels 10 times, in case there was some random probability that a "good" kernel "just got lucky", and failed to produce an error on that boot. There were *no* boot failures on the "good" kernels, and there was *no change* in the resulting final "bad" commit.

>>
>> Again:
>>
>> Arch Linux
>> linux 5.12.arch1-1
>
> Can you reproduce with the upstream 5.12 kernel to rule out influence by
> any distro-specific patches?
>

Hmm - I am naively supposing that "the bisect is the bisect". No matter what commit initiates a problem, it's still a problem. It would be useful to investigate, and introspect the calling functions in the Call Trace. No?

>> Intel Core2 T7200
>> Mobile Intel 945PM Express Chipset
>> ICH7-M
>> Mobility Radeon X1600
>
> Can you send full dmesg from a working kernel and the .config you're
> using with 5.12?
>

Attached:
dmesglog.7bb39313cd62
bisectconfig

7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs
4f432e8bb15b x86/mce: Get rid of mcheck_intel_therm_init()

7bb39313cd62 is the immediately previous "good" bisect kernel. The config files for the two kernels is exactly the same.

>> Generally, on failure, the system will not boot past "Loading initial ramdisk...", or, when it does, the boot process will hang, and the console will eventually show:
>>
>> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
>> ...
>> RIP: 0010:smp_call_function_single+0xf7/0x140
>>
>> The top of the call trace variously shows either "__flush_tlb_all" or "tlbflush_read_file", with the "soft lockup" repeating indefinitely.
>>
>
> I'm presuming there's no way to connect your box over serial cable to
> another one so that you can catch the full bad dmesg when it hangs? It
> would be good if you could...
>

Attached:
bootlog.7bb39313cd62
bootlog.4f432e8bb15b

The later with the "soft lockup" repeating four times. The kernel command line has loglevel=5 and console=ttyS0,115200.

> Thx.
>

Thanks for looking into this. Would some additional printk's be useful?

James


Attachments:
dmesglog.7bb39313cd62 (70.68 kB)
bisectconfig (231.81 kB)
bootlog.7bb39313cd62 (12.48 kB)
bootlog.4f432e8bb15b (19.82 kB)
Download all attachments

2021-05-19 19:28:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On Tue, May 18, 2021 at 09:58:46PM -0600, James Feeney wrote:
> Hmm - I am naively supposing that "the bisect is the bisect". No
> matter what commit initiates a problem, it's still a problem. It would
> be useful to investigate, and introspect the calling functions in the
> Call Trace. No?

I'd like to know that the source you're looking at is the same source
I'm looking at.

And yes, AFAIK, Arch kernels are simply the upstream kernels but
still...

> Attached:
> bootlog.7bb39313cd62
> bootlog.4f432e8bb15b
>
> The later with the "soft lockup" repeating four times. The kernel
> command line has loglevel=5 and console=ttyS0,115200.

Those are not the full boot messages - they should look like
dmesglog.7bb39313cd62 but probably you cannot log into the box after the
softlockup happens to dump them. That's why I meant to try the serial
connection...

Anyway, let's start somewhere.

1. Take a pristine 5.12 upstream kernel from git, build it using your
bisectconfig and try booting it with

debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0

on the kernel command line. Then save a full dmesg, if you can. If you
ocan catch ot ver serial, then that would be awesomer.

2. Use the exact same kernel but this time disable

CONFIG_X86_THERMAL_VECTOR

in its .config and do the same thing.

Send me both dmesg files then.

Thx.

--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-19 20:24:19

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/19/21 5:12 AM, Borislav Petkov wrote:
> On Tue, May 18, 2021 at 09:58:46PM -0600, James Feeney wrote:
>> Hmm - I am naively supposing that "the bisect is the bisect". No
>> matter what commit initiates a problem, it's still a problem. It would
>> be useful to investigate, and introspect the calling functions in the
>> Call Trace. No?
>
> I'd like to know that the source you're looking at is the same source
> I'm looking at.
>
> And yes, AFAIK, Arch kernels are simply the upstream kernels but
> still...
>

I had to ask, and got this answer:

====
The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla.
====

There are no additional kernel patches in the build.

>> Attached:
>> bootlog.7bb39313cd62
>> bootlog.4f432e8bb15b
>>
>> The later with the "soft lockup" repeating four times. The kernel
>> command line has loglevel=5 and console=ttyS0,115200.
>
> Those are not the full boot messages - they should look like
> dmesglog.7bb39313cd62 but probably you cannot log into the box after the
> softlockup happens to dump them. That's why I meant to try the serial
> connection...
>
> Anyway, let's start somewhere.
>
> 1. Take a pristine 5.12 upstream kernel from git, build it using your
> bisectconfig and try booting it with
>
> debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0
>
> on the kernel command line. Then save a full dmesg, if you can. If you
> ocan catch ot ver serial, then that would be awesomer.
>
> 2. Use the exact same kernel but this time disable
>
> CONFIG_X86_THERMAL_VECTOR
>
> in its .config and do the same thing.
>
> Send me both dmesg files then.
>
> Thx.
>

$ git bisect reset v5.12-arch1
Updating files: 100% (12812/12812), done.
Previous HEAD position was 7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs
HEAD is now at bee4e691ceea Arch Linux kernel v5.12-arch1

$ grep CONFIG_X86_THERMAL_VECTOR .config
CONFIG_X86_THERMAL_VECTOR=y

Attached:
dmesglog.5.12.therm.1.nostart
hangs after unpack rootfs

dmesglog.5.12.therm.2.softlockup
soft lockup, but stops and does not repeat

dmesglog.5.12.therm.3.fullboot
boots all the way to Xorg and does run a browser and play video

The fourth boot attempt hung again at unpack rootfs. If the machine is let sit in this state, the fan will begin to run full, off and on, suggesting that maybe the processor is still running and running full power.

These boots are consecutive and are all from the same stock 5.12.0 kernel.


> Use the exact same kernel but this time disable CONFIG_X86_THERMAL_VECTOR

$ make menuconfig
...

This config option is not listed and is not changeable:

====
drivers/thermal/intel/Kconfig

config X86_THERMAL_VECTOR
def_bool y
depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC
====

The Makefile there has:

obj-$(CONFIG_X86_THERMAL_VECTOR) += therm_throt.o

The files, thermal_interrupt.h and therm_throt.c, by Dmitriy Zavin, are new since 5.11. But, it seems that this therm_throt.c file is one of yours, anyway:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/thermal/intel/therm_throt.c?h=linux-5.12.y&id=9223d0dccb8f8523754122f68316dd1a4f39f7f8

I'm not sure that I can just delete these files, being thermal management and all.

I see some talk in the associated thread about IRQ handler registration. Could there be some connection between this and the soft lockup?
https://lore.kernel.org/linux-pm/[email protected]/

What should we do next?

James


Attachments:
dmesglog.5.12.therm.1.nostart (37.36 kB)
dmesglog.5.12.therm.2.softlockup (69.46 kB)
dmesglog.5.12.therm.3.fullboot (67.73 kB)
Download all attachments

2021-05-19 21:20:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On Wed, May 19, 2021 at 02:03:05PM -0600, James Feeney wrote:
> I had to ask, and got this answer:
>
> ====
> The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla.
> ====
>
> There are no additional kernel patches in the build.

Yeah, ok, let's say you're running pretty much an upstream kernel.

> These boots are consecutive and are all from the same stock 5.12.0 kernel.

Yeah, that's weird. Box seems to boot fine in some cases. There's

[ 26.864040] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[ 26.874541] ACPI: OSL: Resource conflict: System may be unstable or behave erratically

Dunno how relevant this is - it probably has been the case since forever.

> $ make menuconfig
> ...
>
> This config option is not listed and is not changeable:

Bah, sorry about that. Use this small hunk ontop so that you can disable
it in menuconfig:

---
diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index ce4f59213c7a..8c1376a20b54 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -9,7 +9,8 @@ config INTEL_POWERCLAMP
user interface is exposed via generic thermal framework.

config X86_THERMAL_VECTOR
- def_bool y
+ tristate "X86 thermal vector"
+ default y
depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC

config X86_PKG_TEMP_THERMAL
---

and with it ontop, boot it the aforementioned way.

And then pls build 5.11 which doesn't have the therm_throt patch - I'd
like to see if that therm_throt thing is even loading. Because I don't
see something like:

[ 0.302411] mce: CPU0: Thermal monitoring enabled (TM2)

in your dmesg. Could be some detection failure due to the change when
those happen. Althouth they shouldn't but who knows.

I have a similar box as yours:

[ 0.305858] smpboot: CPU0: Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz (family: 0x6, model: 0xf, stepping: 0xd)

a bit newer stepping and AFAIR, it boots fine but I'll try your .config
there to check just in case.

Thx.


--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-20 03:14:14

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/19/21 3:18 PM, Borislav Petkov wrote:
> On Wed, May 19, 2021 at 02:03:05PM -0600, James Feeney wrote:
>> I had to ask, and got this answer:
>>
>> ====
>> The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla.
>> ====
>>
>> There are no additional kernel patches in the build.
>
> Yeah, ok, let's say you're running pretty much an upstream kernel.
>
>> These boots are consecutive and are all from the same stock 5.12.0 kernel.
>
> Yeah, that's weird. Box seems to boot fine in some cases. There's
>
> [ 26.864040] ACPI: OSL: Resource conflict; ACPI support missing from driver?
> [ 26.874541] ACPI: OSL: Resource conflict: System may be unstable or behave erratically
>
> Dunno how relevant this is - it probably has been the case since forever.
>

That's, in part, from the kernel command line "acpi_enforce_resources=lax". Without that, the RAM SPD EEPOMs cannot be accessed. And, the eeprom module is deprecated, being replaced by the at24 module.

>> $ make menuconfig
>> ...
>>
>> This config option is not listed and is not changeable:
>
> Bah, sorry about that. Use this small hunk ontop so that you can disable
> it in menuconfig:
>
> ---
> diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
> index ce4f59213c7a..8c1376a20b54 100644
> --- a/drivers/thermal/intel/Kconfig
> +++ b/drivers/thermal/intel/Kconfig
> @@ -9,7 +9,8 @@ config INTEL_POWERCLAMP
> user interface is exposed via generic thermal framework.
>
> config X86_THERMAL_VECTOR
> - def_bool y
> + tristate "X86 thermal vector"
> + default y
> depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC
>
> config X86_PKG_TEMP_THERMAL
> ---
>
> and with it ontop, boot it the aforementioned way.
>

Ok - I didn't want to do that without some idea that it would not disable the whole thermal management process.

$ make menuconfig
...

$ diff .config .config.old
4983c4983,4984
< # CONFIG_X86_THERMAL_VECTOR is not set
---
> CONFIG_X86_THERMAL_VECTOR=y
> CONFIG_X86_PKG_TEMP_THERMAL=m

No joy. Still have the same soft lockups and full boots - the full boots interrupted by some mystery delay.

Attached:
dmesglog.5.12.NoTherm.1.softlockup

Out of curiosity, I also tried removing the "acpi_enforce_resources=lax" configuration, but that makes no difference, other than removing the "System may be unstable or behave erratically" warnings.

Attached:
dmesglog.5.12.NoTherm.2.notlax.fullboot
dmesglog.5.12.NoTherm.3.notlax.softlockup


I don't know about these patches, modifying and moving the location of therm_throt.c, so I'm not in a position to draw any conclusion from these results.


> And then pls build 5.11 which doesn't have the therm_throt patch - I'd
> like to see if that therm_throt thing is even loading. Because I don't
> see something like:
>
> [ 0.302411] mce: CPU0: Thermal monitoring enabled (TM2)
>
> in your dmesg. Could be some detection failure due to the change when
> those happen. Althouth they shouldn't but who knows.
>

build 5.11? There are lots of 5.11 kernels from the Arch distribution that I have run. Are you looking for a dmesg log from 5.11?

I'm confused. There is the similar arch/x86/kernel/cpu/mce/therm_throt.c in the 5.11 kernel. What is going to be different? Even back to the 5.10 kernels, in the dmesg log, all the references to the thermal governors and the thermal zones are the same.


> I have a similar box as yours:
>
> [ 0.305858] smpboot: CPU0: Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz (family: 0x6, model: 0xf, stepping: 0xd)
>
> a bit newer stepping and AFAIR, it boots fine but I'll try your .config
> there to check just in case.
>
> Thx.
>
>

So far, something looks quirky - somewhere. Timing related failures can be a pain. Is there no useful information being provided by the Call Trace in the dmesg log?


James


Attachments:
dmesglog.5.12.NoTherm.1.softlockup (79.31 kB)
dmesglog.5.12.NoTherm.2.notlax.fullboot (67.60 kB)
dmesglog.5.12.NoTherm.3.notlax.softlockup (93.96 kB)
Download all attachments

2021-05-20 14:54:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On Wed, May 19, 2021 at 09:12:04PM -0600, James Feeney wrote:

> $ diff .config .config.old
> 4983c4983,4984
> < # CONFIG_X86_THERMAL_VECTOR is not set
> ---
> > CONFIG_X86_THERMAL_VECTOR=y
> > CONFIG_X86_PKG_TEMP_THERMAL=m
>
> No joy. Still have the same soft lockups and full boots - the full
> boots interrupted by some mystery delay.

Which means, even with therm_throt disabled, it still locks up. Which
can't be caused by my patch.

> I don't know about these patches, modifying and moving the location of
> therm_throt.c, so I'm not in a position to draw any conclusion from
> these results.

They're just moving the thermal interrupt functionality from the
MCE code where they don't belong to the thermal code where they do.
Otherwise there should be no change.

> build 5.11? There are lots of 5.11 kernels from the Arch distribution
> that I have run. Are you looking for a dmesg log from 5.11?

Take the .config you're normally using, make sure it has

CONFIG_X86_THERMAL_VECTOR=y

and build with it plain 5.11 kernel. No patches ontop, no nothing.

Then add

debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0

to its kernel command line and send me full dmesg again pls.

Looking how it sometimes boots and sometimes it locks up, try that a
couple of times.

> So far, something looks quirky - somewhere. Timing related failures
> can be a pain. Is there no useful information being provided by the
> Call Trace in the dmesg log?

What I'm seeing is that *sometimes* - not always - your CPU0 is not
responding to the TLB flush IPI. Which is really weird. Have you had
those always or did they start appearing with 5.12?

That's why I'm still scratching my head over how my patch would cause
CPU0 not responding to IPIs.

Well, *maybe* there's a little difference which my patch did: it does
that APIC_LVTTHMR only on the BSP. And *maybe* there's a problem there,
who knows with those old CPUs.

So here's two more things to try:

1. On plain 5.12, with the same kernel cmdline params add also

"idle=nomwait"

to the kernel command line and boot with it a couple of times to see
whether it still locks up.

2. On plain 5.12, with the same kernel cmdline params apply this hunk
ontop:

---
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e882592ba5..42db48cd4666 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -630,9 +630,8 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
if (!intel_thermal_supported(c))
return;

- /* On the BSP? */
- if (c == &boot_cpu_data)
- lvtthmr_init = apic_read(APIC_LVTTHMR);
+ lvtthmr_init = apic_read(APIC_LVTTHMR);
+ pr_info("%s: CPU%d, lvtthmr_init: 0x%x\n", __func__, cpu, lvtthmr_init);

/*
* First check if its enabled already, in which case there might
---

That'll tell us the thermal sensor LVT on both CPUs.

Also do that a couple of times - it'll be interesting to see what those
values are *when* the box locks up.

As always, catch full dmesg each time pls.

Thx.

--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-21 22:13:00

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/20/21 3:21 AM, Borislav Petkov wrote:
> On Wed, May 19, 2021 at 09:12:04PM -0600, James Feeney wrote:
>
>> $ diff .config .config.old
>> 4983c4983,4984
>> < # CONFIG_X86_THERMAL_VECTOR is not set
>> ---
>>> CONFIG_X86_THERMAL_VECTOR=y
>>> CONFIG_X86_PKG_TEMP_THERMAL=m
>>
>> No joy. Still have the same soft lockups and full boots - the full
>> boots interrupted by some mystery delay.
>
> Which means, even with therm_throt disabled, it still locks up. Which
> can't be caused by my patch.
>
>> I don't know about these patches, modifying and moving the location of
>> therm_throt.c, so I'm not in a position to draw any conclusion from
>> these results.
>
> They're just moving the thermal interrupt functionality from the
> MCE code where they don't belong to the thermal code where they do.
> Otherwise there should be no change.
>
>> build 5.11? There are lots of 5.11 kernels from the Arch distribution
>> that I have run. Are you looking for a dmesg log from 5.11?
>
> Take the .config you're normally using, make sure it has
>
> CONFIG_X86_THERMAL_VECTOR=y
>
> and build with it plain 5.11 kernel. No patches ontop, no nothing.
>
> Then add
>
> debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0
>
> to its kernel command line and send me full dmesg again pls.
>

Hmm - for comparison, I'm attaching the full dmesg log for the 7bb39313cd62 build, the one in the git bisect immediately prior to your patch.

dmesglog.7bb39313cd62.full


> Looking how it sometimes boots and sometimes it locks up, try that a
> couple of times.
>

There are *no* 5.11 builds that lock-up.

>> So far, something looks quirky - somewhere. Timing related failures
>> can be a pain. Is there no useful information being provided by the
>> Call Trace in the dmesg log?
>
> What I'm seeing is that *sometimes* - not always - your CPU0 is not
> responding to the TLB flush IPI. Which is really weird. Have you had
> those always or did they start appearing with 5.12?
>

This all started with the 5.12 kernels, on this machine - never before.

> That's why I'm still scratching my head over how my patch would cause
> CPU0 not responding to IPIs.
>
> Well, *maybe* there's a little difference which my patch did: it does
> that APIC_LVTTHMR only on the BSP. And *maybe* there's a problem there,
> who knows with those old CPUs.
>
> So here's two more things to try:
>
> 1. On plain 5.12, with the same kernel cmdline params add also
>
> "idle=nomwait"
>
> to the kernel command line and boot with it a couple of times to see
> whether it still locks up.
>

On the Arch 5.12.0 kernel with "CONFIG_X86_THERMAL_VECTOR is not set", on the first boot, as before, it just hangs at "Trying to unpack rootfs image as initramfs...".

On the Arch 5.12.0 kernel with "CONFIG_X86_THERMAL_VECTOR=y", I got a couple of full boots, and then repeating "soft lockup", same as before. All in all, "idle=nomwait" makes no difference.

It seems significant that, *if* the system boots normally, then, from then on, it seems to run normally. Maybe the issue is something at-play only during the early boot process. Certainly, the issue seems to be at-play even before unpacking the rootfs, or maybe just after unpacking the rootfs. This happens before whatever is able to generate the "watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd:242]".


> 2. On plain 5.12, with the same kernel cmdline params apply this hunk
> ontop:
>
> ---
> diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
> index f8e882592ba5..42db48cd4666 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -630,9 +630,8 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> if (!intel_thermal_supported(c))
> return;
>
> - /* On the BSP? */
> - if (c == &boot_cpu_data)
> - lvtthmr_init = apic_read(APIC_LVTTHMR);
> + lvtthmr_init = apic_read(APIC_LVTTHMR);
> + pr_info("%s: CPU%d, lvtthmr_init: 0x%x\n", __func__, cpu, lvtthmr_init);
>
> /*
> * First check if its enabled already, in which case there might
> ---
>
> That'll tell us the thermal sensor LVT on both CPUs.
>
> Also do that a couple of times - it'll be interesting to see what those
> values are *when* the box locks up.
>
> As always, catch full dmesg each time pls.
>
> Thx.
>

$ diff .config .config.old
4983,4984c4983
< CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_X86_PKG_TEMP_THERMAL=m
---
> # CONFIG_X86_THERMAL_VECTOR is not set

It seems odd, that making one small change to the .config, results in the entire kernel being recompiled. Is there a way to avoid that?

Hmm - this version may be interesting. There are inconsistencies, in the sense of various different boot results, in 10 sequential boot events. The dmesg logs are attached.


$ tail -n3 dmesglog.5.12.lvtthmr.*
==> dmesglog.5.12.lvtthmr.1 <==
[ 3.755697] pci 0000:01:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[ 3.764074] PCI: CLS 64 bytes, default 64
[ 3.768186] Trying to unpack rootfs image as initramfs...

==> dmesglog.5.12.lvtthmr.10 <==
[ 147.270555] ACPI: Preparing to enter system sleep state S5
[ 147.278663] reboot: Power down
[ 147.284389] acpi_power_off called

==> dmesglog.5.12.lvtthmr.2 <==
[ 3.758020] pci 0000:01:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[ 3.766394] PCI: CLS 64 bytes, default 64
[ 3.770507] Trying to unpack rootfs image as initramfs...

==> dmesglog.5.12.lvtthmr.3 <==
[ 1.622582] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[ 1.629249] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[ 1.645854] ACPI: 6 ACPI AML tables successfully acquired and loaded

==> dmesglog.5.12.lvtthmr.4 <==
[ 501.491752] ACPI: Preparing to enter system sleep state S5
[ 501.499967] reboot: Power down
[ 501.505786] acpi_power_off called

==> dmesglog.5.12.lvtthmr.5 <==
[ 225.055556] ACPI: Preparing to enter system sleep state S5
[ 225.063645] reboot: Power down
[ 225.069318] acpi_power_off called

==> dmesglog.5.12.lvtthmr.6 <==
[ 150.364732] ACPI: Preparing to enter system sleep state S5
[ 150.373028] reboot: Power down
[ 150.378965] acpi_power_off called

==> dmesglog.5.12.lvtthmr.7 <==
[ 140.049778] ACPI: Preparing to enter system sleep state S5
[ 140.057886] reboot: Power down
[ 140.063536] acpi_power_off called

==> dmesglog.5.12.lvtthmr.8 <==
[ 1.623091] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[ 1.629757] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[ 1.646421] ACPI: 6 ACPI AML tables successfully acquired and loaded

==> dmesglog.5.12.lvtthmr.9 <==
[ 124.813366] ACPI: Preparing to enter system sleep state S5
[ 124.821411] reboot: Power down
[ 124.827059] acpi_power_off called

6 of the 10 are full boot, ending with acpi_power_off called. 2 of 10 end at "Trying to unpack rootfs image as initramfs...", and - this is new - boot 3 and 8 do not make it that far, ending after "ACPI: 6 ACPI AML tables successfully acquired and loaded". "ACPI: Dynamic OEM Table Load:" would have been next, after "ACPI: 6 ACPI AML tables successfully acquired and loaded".

Of note, *none* of these 10 boot events generated the "soft lockup" Stack Trace events. Also of note, even though boot number 4 is a full boot, there is a 73 second mystery delay. The delay occurs right after adding swap, and just before "Bluetooth: BNEP". Boot 4 has other unique characteristics, mentioned below.


$ grep lvtth dmesglog.5.12.lvtthmr.*
dmesglog.5.12.lvtthmr.1:[ 1.246282] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.1:[ 0.933178] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.10:[ 1.245288] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.10:[ 0.933148] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.2:[ 1.245260] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.2:[ 0.933118] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.3:[ 1.245958] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.3:[ 0.933159] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.4:[ 1.247200] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.5:[ 1.246849] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.5:[ 0.933187] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.6:[ 1.246143] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.6:[ 0.933146] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.7:[ 1.246826] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.7:[ 0.932840] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.8:[ 1.246467] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.8:[ 0.933189] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
dmesglog.5.12.lvtthmr.9:[ 1.246288] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
dmesglog.5.12.lvtthmr.9:[ 0.933210] intel_init_thermal: CPU1, lvtthmr_init: 0x10000

Note that boot number 4 is unique, in that "intel_init_thermal: CPU1, lvtthmr_init: 0x10000" is never seen.


$ grep -i monitor dmesglog.5.12.lvtthmr.*
dmesglog.5.12.lvtthmr.1:[ 0.933178] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.10:[ 0.933148] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.10:[ 4.152122] Monitor-Mwait will be used to enter C-1 state
dmesglog.5.12.lvtthmr.10:[ 4.158004] Monitor-Mwait will be used to enter C-2 state
dmesglog.5.12.lvtthmr.10:[ 8.106471] drop_monitor: Initializing network drop monitor service
dmesglog.5.12.lvtthmr.2:[ 0.933118] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.3:[ 0.933159] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.4:[ 13.613968] Monitor-Mwait will be used to enter C-1 state
dmesglog.5.12.lvtthmr.4:[ 13.619845] Monitor-Mwait will be used to enter C-2 state
dmesglog.5.12.lvtthmr.4:[ 17.456124] drop_monitor: Initializing network drop monitor service
dmesglog.5.12.lvtthmr.5:[ 0.933187] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.5:[ 4.157058] Monitor-Mwait will be used to enter C-1 state
dmesglog.5.12.lvtthmr.5:[ 4.162940] Monitor-Mwait will be used to enter C-2 state
dmesglog.5.12.lvtthmr.5:[ 8.170316] drop_monitor: Initializing network drop monitor service
dmesglog.5.12.lvtthmr.6:[ 0.933146] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.6:[ 4.159566] Monitor-Mwait will be used to enter C-1 state
dmesglog.5.12.lvtthmr.6:[ 4.165450] Monitor-Mwait will be used to enter C-2 state
dmesglog.5.12.lvtthmr.6:[ 8.222319] drop_monitor: Initializing network drop monitor service
dmesglog.5.12.lvtthmr.7:[ 0.932840] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.7:[ 4.158158] Monitor-Mwait will be used to enter C-1 state
dmesglog.5.12.lvtthmr.7:[ 4.164046] Monitor-Mwait will be used to enter C-2 state
dmesglog.5.12.lvtthmr.7:[ 8.263059] drop_monitor: Initializing network drop monitor service
dmesglog.5.12.lvtthmr.8:[ 0.933189] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.9:[ 0.933210] CPU0: Thermal monitoring enabled (TM2)
dmesglog.5.12.lvtthmr.9:[ 4.159696] Monitor-Mwait will be used to enter C-1 state
dmesglog.5.12.lvtthmr.9:[ 4.165581] Monitor-Mwait will be used to enter C-2 state
dmesglog.5.12.lvtthmr.9:[ 8.203103] drop_monitor: Initializing network drop monitor service

Again, boot number 4 is unique, in that "CPU0: Thermal monitoring enabled (TM2)" is never seen. On the abbreviated boot 3 and 8, "CPU0: Thermal monitoring enabled (TM2)" is still seen.

These varying results may indicate that some initialization step is being missed, resulting in random variation in the initial state.

James


Attachments:
dmesglog.7bb39313cd62.full (67.49 kB)
dmesglog.5.12.lvtthmr.1 (37.54 kB)
dmesglog.5.12.lvtthmr.2 (37.54 kB)
dmesglog.5.12.lvtthmr.3 (15.17 kB)
dmesglog.5.12.lvtthmr.4 (67.66 kB)
dmesglog.5.12.lvtthmr.5 (68.17 kB)
dmesglog.5.12.lvtthmr.6 (68.14 kB)
dmesglog.5.12.lvtthmr.7 (68.10 kB)
dmesglog.5.12.lvtthmr.8 (15.17 kB)
dmesglog.5.12.lvtthmr.9 (68.04 kB)
dmesglog.5.12.lvtthmr.10 (68.07 kB)
Download all attachments

2021-05-22 09:09:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On Fri, May 21, 2021 at 04:11:22PM -0600, James Feeney wrote:
> Of note, *none* of these 10 boot events generated the "soft lockup"
> Stack Trace events. Also of note, even though boot number 4 is a full
> boot, there is a 73 second mystery delay. The delay occurs right after
> adding swap, and just before "Bluetooth: BNEP". Boot 4 has other unique
> characteristics, mentioned below.
>
> $ grep lvtth dmesglog.5.12.lvtthmr.*
> dmesglog.5.12.lvtthmr.1:[ 1.246282] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
> dmesglog.5.12.lvtthmr.1:[ 0.933178] intel_init_thermal: CPU1, lvtthmr_init: 0x10000

Aaaha, ok, your thermal interrupt is an SMI. No wonder with HP. So this
is becoming weirder by the minute...

Ok, let's look at what it does at init time. Please remove the previous
diff, apply the below one on 5.12, make sure you have

CONFIG_X86_THERMAL_VECTOR=y

in the .config, boot with the aforementioned cmdline params, catch dmesg
and send it again.

Thx.

---
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e882592ba5..853d4d4b4270 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -625,7 +625,7 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
{
unsigned int cpu = smp_processor_id();
int tm2 = 0;
- u32 l, h;
+ u32 l, h, tmp = -1;

if (!intel_thermal_supported(c))
return;
@@ -652,13 +652,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
* BIOS has programmed on AP based on BSP's info we saved since BIOS
* is always setting the same value for all threads/cores.
*/
- if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
+ if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
apic_write(APIC_LVTTHMR, lvtthmr_init);
+ tmp = apic_read(APIC_LVTTHMR);
+ }

+ pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
+ __func__, cpu, lvtthmr_init, tmp, l);

if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
if (system_state == SYSTEM_BOOTING)
- pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
+ pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
return;
}


--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-22 23:33:54

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/22/21 3:06 AM, Borislav Petkov wrote:
> On Fri, May 21, 2021 at 04:11:22PM -0600, James Feeney wrote:
>> Of note, *none* of these 10 boot events generated the "soft lockup"
>> Stack Trace events. Also of note, even though boot number 4 is a full
>> boot, there is a 73 second mystery delay. The delay occurs right after
>> adding swap, and just before "Bluetooth: BNEP". Boot 4 has other unique
>> characteristics, mentioned below.
>>
>> $ grep lvtth dmesglog.5.12.lvtthmr.*
>> dmesglog.5.12.lvtthmr.1:[ 1.246282] intel_init_thermal: CPU0, lvtthmr_init: 0x10200
>> dmesglog.5.12.lvtthmr.1:[ 0.933178] intel_init_thermal: CPU1, lvtthmr_init: 0x10000
>
> Aaaha, ok, your thermal interrupt is an SMI. No wonder with HP. So this
> is becoming weirder by the minute...
>
> Ok, let's look at what it does at init time. Please remove the previous
> diff, apply the below one on 5.12, make sure you have
>
> CONFIG_X86_THERMAL_VECTOR=y
>
> in the .config, boot with the aforementioned cmdline params, catch dmesg
> and send it again.
>
> Thx.
>

Hey Boris

As an aside, I ran across this, but see your name is already there:

ACPI / processor: Request native thermal interrupt handling via _OSC
https://github.com/Asus-T100/kernel/commit/0d5406b61afb06ca7c9b76c924ddc044127bb791
https://groups.google.com/g/linux.kernel/c/vln_DslTYxY

====
There are several reports of freeze on enabling HWP (Hardware PStates)
feature on Skylake-based systems by the Intel P-states driver. The root
cause is identified as the HWP interrupts causing BIOS code to freeze.
...
====

Also maybe related, from 2007 December:

AMD Thermal Interrupt Support
https://lwn.net/Articles/262469/


After swapping patches, notably, there is:

$ grep -i 'lvtt\|smi' dmesglog.5.12.apic_read.1
[ 1.245826] intel_init_thermal: CPU0, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
[ 1.249115] CPU0: Thermal monitoring handled by SMI
[ 0.933125] intel_init_thermal: CPU1, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489

In the 10 boot events with the previous patch, there was always "CPU1, lvtthmr_init: 0x10000". The only functional difference is the placement of apic_read(APIC_LVTTHMR) after apic_write(APIC_LVTTHMR, lvtthmr_init).

The first boot 1 and 2 run to a normal full boot. Boot 3 ends with "soft lockup", and then "rcu_preempt detected stalls on CPUs/tasks". Boot 4 and 5 end with "soft lockup" and repeats, indefinitely.


I'm tempted to say "If it ain't broke, don't fix it", but then, where's the fun in that? Still, have you spoken with Andi Kleen? Wouldn't hurt? Perhaps there was some arcane reason to have put mcheck_intel_therm_init() with the Machine Check processes. And yet, it does not seem obvious that there is any reason for that to matter. But, what about the timing?

Out of curiosity, I added "pr_info("%s : mcheck_intel_therm_init() use to run here", __func__);" to __init mcheck_init() in arch/x86/kernel/cpu/mce/core.c, just to see *when* it *would have* run, compared to when it is running now.


Boot 1, with the pr_info, hangs at "ACPI: 6 ACPI AML tables successfully acquired and loaded". Boot 2 runs to a full boot, but with a mystery 83 second delay, again, after "adding ... swap" and before "Bluetooth...". Boot 3 is a normal full boot. Boot 4 hangs at "Trying to unpack rootfs image as initramfs...". Boot 5 is another normal full boot.

Consistently, "mcheck_intel_therm_init() use to run here" is always seen immediately following "Booting paravirtualized kernel on bare hardware", rather significantly preceding in sequence where the similar, but renamed, intel_init_thermal() is now running.

All the dmesg logs are attached, but here is a quick look:

$ grep -C 1 'mcheck\|lvtt' dmesglog.5.12.pr_info.*|sort
...
dmesglog.5.12.pr_info.1-[ 0.171697] Booting paravirtualized kernel on bare hardware
dmesglog.5.12.pr_info.1:[ 0.171699] mce: mcheck_init : mcheck_intel_therm_init() use to run here
dmesglog.5.12.pr_info.1-[ 0.171705] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
dmesglog.5.12.pr_info.1:[ 0.939884] intel_init_thermal: CPU1, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.1-[ 1.246791] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
dmesglog.5.12.pr_info.1:[ 1.253473] intel_init_thermal: CPU0, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.1-[ 1.256762] CPU0: Thermal monitoring handled by SMI
dmesglog.5.12.pr_info.1-[ 1.450107] .... node #0, CPUs: #1
dmesglog.5.12.pr_info.1-[ 1.468905] smp: Brought up 1 node, 2 CPUs

dmesglog.5.12.pr_info.2-[ 0.171013] Booting paravirtualized kernel on bare hardware
dmesglog.5.12.pr_info.2:[ 0.171016] mce: mcheck_init : mcheck_intel_therm_init() use to run here
dmesglog.5.12.pr_info.2-[ 0.171022] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
dmesglog.5.12.pr_info.2:[ 0.939759] intel_init_thermal: CPU1, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.2-[ 1.246440] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
dmesglog.5.12.pr_info.2:[ 1.253121] intel_init_thermal: CPU0, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.2-[ 1.256410] CPU0: Thermal monitoring handled by SMI
dmesglog.5.12.pr_info.2-[ 1.449755] .... node #0, CPUs: #1
dmesglog.5.12.pr_info.2-[ 1.468545] smp: Brought up 1 node, 2 CPUs

dmesglog.5.12.pr_info.3-[ 0.171884] Booting paravirtualized kernel on bare hardware
dmesglog.5.12.pr_info.3:[ 0.171887] mce: mcheck_init : mcheck_intel_therm_init() use to run here
dmesglog.5.12.pr_info.3-[ 0.171893] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
dmesglog.5.12.pr_info.3:[ 0.939468] intel_init_thermal: CPU1, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.3-[ 1.247112] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
dmesglog.5.12.pr_info.3:[ 1.254192] intel_init_thermal: CPU0, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.3-[ 1.257086] CPU0: Thermal monitoring handled by SMI
dmesglog.5.12.pr_info.3-[ 1.450428] .... node #0, CPUs: #1
dmesglog.5.12.pr_info.3-[ 1.469234] smp: Brought up 1 node, 2 CPUs

dmesglog.5.12.pr_info.4-[ 0.171707] Booting paravirtualized kernel on bare hardware
dmesglog.5.12.pr_info.4:[ 0.171709] mce: mcheck_init : mcheck_intel_therm_init() use to run here
dmesglog.5.12.pr_info.4-[ 0.171715] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
dmesglog.5.12.pr_info.4:[ 0.939815] intel_init_thermal: CPU1, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.4-[ 1.246509] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
dmesglog.5.12.pr_info.4:[ 1.253191] intel_init_thermal: CPU0, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.4-[ 1.256480] CPU0: Thermal monitoring handled by SMI
dmesglog.5.12.pr_info.4-[ 1.449825] .... node #0, CPUs: #1
dmesglog.5.12.pr_info.4-[ 1.468633] smp: Brought up 1 node, 2 CPUs

dmesglog.5.12.pr_info.5-[ 0.171201] Booting paravirtualized kernel on bare hardware
dmesglog.5.12.pr_info.5:[ 0.171203] mce: mcheck_init : mcheck_intel_therm_init() use to run here
dmesglog.5.12.pr_info.5-[ 0.171209] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
dmesglog.5.12.pr_info.5:[ 0.939935] intel_init_thermal: CPU1, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.5-[ 1.247408] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
dmesglog.5.12.pr_info.5:[ 1.254090] intel_init_thermal: CPU0, lvtthmr_init: 0x10200, read: 0x10200, misc_enable (low): 0x64952489
dmesglog.5.12.pr_info.5-[ 1.257378] CPU0: Thermal monitoring handled by SMI
dmesglog.5.12.pr_info.5-[ 1.450723] .... node #0, CPUs: #1
dmesglog.5.12.pr_info.5-[ 1.469533] smp: Brought up 1 node, 2 CPUs

James

> ---
> diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
> index f8e882592ba5..853d4d4b4270 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -625,7 +625,7 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> {
> unsigned int cpu = smp_processor_id();
> int tm2 = 0;
> - u32 l, h;
> + u32 l, h, tmp = -1;
>
> if (!intel_thermal_supported(c))
> return;
> @@ -652,13 +652,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> * BIOS has programmed on AP based on BSP's info we saved since BIOS
> * is always setting the same value for all threads/cores.
> */
> - if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
> + if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
> apic_write(APIC_LVTTHMR, lvtthmr_init);
> + tmp = apic_read(APIC_LVTTHMR);
> + }
>
> + pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
> + __func__, cpu, lvtthmr_init, tmp, l);
>
> if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
> if (system_state == SYSTEM_BOOTING)
> - pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> + pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> return;
> }
>
>


Attachments:
dmesglog.5.12.apic_read.1 (68.15 kB)
dmesglog.5.12.apic_read.2 (68.58 kB)
dmesglog.5.12.apic_read.3 (70.13 kB)
dmesglog.5.12.apic_read.4 (83.92 kB)
dmesglog.5.12.apic_read.5 (81.07 kB)
dmesglog.5.12.pr_info.1 (15.33 kB)
dmesglog.5.12.pr_info.2 (68.37 kB)
dmesglog.5.12.pr_info.3 (68.18 kB)
dmesglog.5.12.pr_info.4 (37.70 kB)
dmesglog.5.12.pr_info.5 (68.37 kB)
Download all attachments

2021-05-23 17:07:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On Sat, May 22, 2021 at 05:28:42PM -0600, James Feeney wrote:
> Out of curiosity, I added "pr_info("%s : mcheck_intel_therm_init()
> use to run here", __func__);" to __init mcheck_init() in
> arch/x86/kernel/cpu/mce/core.c, just to see *when* it *would have*
> run, compared to when it is running now.

Yes, this is probably the only timing-sensitive difference this patch of
mine introduces. So who knows what's in that LVT APIC register earlier
and *maybe* reading it later might give us different values.

So do the same game again pls, but this time apply the patch below. It
is practically a revert of my patch and with it, your box should *not*
freeze anymore *if* my patch really is the culprit.

Thx.

---
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index ddfb3cad8dff..5ac8b827bc12 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -296,6 +296,12 @@ struct cper_sec_mem_err;
extern void apei_mce_report_mem_error(int corrected,
struct cper_sec_mem_err *mem_err);

+#ifdef CONFIG_X86_THERMAL_VECTOR
+extern void mcheck_intel_therm_init(void);
+#else
+static inline void mcheck_intel_therm_init(void) { }
+#endif
+
/*
* Enumerate new IP types and HWID values in AMD processors which support
* Scalable MCA.
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..ded20b8612fe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2190,6 +2190,7 @@ __setup("mce", mcheck_enable);

int __init mcheck_init(void)
{
+ mcheck_intel_therm_init();
mce_register_decode_chain(&early_nb);
mce_register_decode_chain(&mce_uc_nb);
mce_register_decode_chain(&mce_default_nb);
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e882592ba5..0ebd2386839f 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -621,19 +621,30 @@ bool x86_thermal_enabled(void)
return atomic_read(&therm_throt_en);
}

+void __init mcheck_intel_therm_init(void)
+{
+ /*
+ * This function is only called on boot CPU. Save the init thermal
+ * LVT value on BSP and use that value to restore APs' thermal LVT
+ * entry BIOS programmed later
+ */
+ if (intel_thermal_supported(&boot_cpu_data)) {
+ lvtthmr_init = apic_read(APIC_LVTTHMR);
+ pr_info("%s: lvtthmr_init: 0x%x\n", __func__, lvtthmr_init);
+ } else {
+ pr_info("%s: !intel_thermal_supported\n", __func__);
+ }
+}
+
void intel_init_thermal(struct cpuinfo_x86 *c)
{
unsigned int cpu = smp_processor_id();
int tm2 = 0;
- u32 l, h;
+ u32 l, h, tmp = -1;

if (!intel_thermal_supported(c))
return;

- /* On the BSP? */
- if (c == &boot_cpu_data)
- lvtthmr_init = apic_read(APIC_LVTTHMR);
-
/*
* First check if its enabled already, in which case there might
* be some SMM goo which handles it, so we can't even put a handler
@@ -652,13 +663,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
* BIOS has programmed on AP based on BSP's info we saved since BIOS
* is always setting the same value for all threads/cores.
*/
- if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
+ if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
apic_write(APIC_LVTTHMR, lvtthmr_init);
+ tmp = apic_read(APIC_LVTTHMR);
+ }

+ pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
+ __func__, cpu, lvtthmr_init, tmp, l);

if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
if (system_state == SYSTEM_BOOTING)
- pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
+ pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
return;
}


--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-23 23:05:37

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/23/21 11:05 AM, Borislav Petkov wrote:
> On Sat, May 22, 2021 at 05:28:42PM -0600, James Feeney wrote:
>> Out of curiosity, I added "pr_info("%s : mcheck_intel_therm_init()
>> use to run here", __func__);" to __init mcheck_init() in
>> arch/x86/kernel/cpu/mce/core.c, just to see *when* it *would have*
>> run, compared to when it is running now.
>
> Yes, this is probably the only timing-sensitive difference this patch of
> mine introduces. So who knows what's in that LVT APIC register earlier
> and *maybe* reading it later might give us different values.
>
> So do the same game again pls, but this time apply the patch below. It
> is practically a revert of my patch and with it, your box should *not*
> freeze anymore *if* my patch really is the culprit.
>
> Thx.
>

> patch below.

Yes, pretty much what I had in mind.

> *if* my patch really is the culprit.

Ha! Yes, your patch *is* the culprit. You don't trust git bisect? My first git bisect failure was before realizing that the boot failure was intermittent/probabilistic. That's why I ended-up doing the git bisect *three* times, the last with *10* reboots on each "good" patch.

> who knows what's in that LVT APIC register earlier
> and *maybe* reading it later might give us different values.

"lvtthmr_init: 0x200" != "lvtthmr_init: 0x10200" != "lvtthmr_init: 0x10000"

System Management is *hard*, because it must build upon someone else's undocumented buggy software. Thank Intel.

Well now, reverting the timing seems to completely resolve the issue. Take the win - 20 reboots, "cold" and "hot", and *no* boot failures.

A couple of representative dmesg logs are attached. For a representative quick look:

$ grep -C 1 'mcheck\|lvtt' dmesglog.5.12.mcheck.1|sort
--
--
dmesglog.5.12.mcheck.1-[ 0.172695] Booting paravirtualized kernel on bare hardware
dmesglog.5.12.mcheck.1:[ 0.172698] mcheck_intel_therm_init: lvtthmr_init: 0x200
dmesglog.5.12.mcheck.1:[ 0.172701] mce: mcheck_init : mcheck_intel_therm_init() returned
dmesglog.5.12.mcheck.1-[ 0.172706] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
dmesglog.5.12.mcheck.1:[ 0.944549] intel_init_thermal: CPU1, lvtthmr_init: 0x200, read: 0x200, misc_enable (low): 0x64952489
dmesglog.5.12.mcheck.1-[ 1.252230] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
dmesglog.5.12.mcheck.1:[ 1.258912] intel_init_thermal: CPU0, lvtthmr_init: 0x200, read: 0x200, misc_enable (low): 0x64952489
dmesglog.5.12.mcheck.1-[ 1.262202] CPU0: Thermal monitoring handled by SMI
dmesglog.5.12.mcheck.1-[ 1.455547] .... node #0, CPUs: #1
dmesglog.5.12.mcheck.1-[ 1.472322] smp: Brought up 1 node, 2 CPUs


I'd suggest, leave in your "pr_info" statements. Also, the "#ifdef CONFIG_X86_THERMAL_VECTOR" may make more sense a little further above,
closer to the "#ifdef CONFIG_X86_MCE_INTEL", and not under the "Used by APEI to report memory error via /dev/mcelog" heading.

And, a couple of big fat verbose comments, about the placement and timing issue, probably in arch/x86/kernel/cpu/mce/core.c, and in drivers/thermal/intel/therm_throt.c, would be in order.

Boris, thanks very much for working this issue with me. Let's get the revisions into the mainstream kernel. The 5.11 and 5.12 initial releases both left this laptop un-bootable. I've been gaining lots of character-building experience with git bisect! Hopefully, going forward, new releases will be more boring.

James


> ---
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index ddfb3cad8dff..5ac8b827bc12 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -296,6 +296,12 @@ struct cper_sec_mem_err;
> extern void apei_mce_report_mem_error(int corrected,
> struct cper_sec_mem_err *mem_err);
>
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> +extern void mcheck_intel_therm_init(void);
> +#else
> +static inline void mcheck_intel_therm_init(void) { }
> +#endif
> +
> /*
> * Enumerate new IP types and HWID values in AMD processors which support
> * Scalable MCA.
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index bf7fe87a7e88..ded20b8612fe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2190,6 +2190,7 @@ __setup("mce", mcheck_enable);
>
> int __init mcheck_init(void)
> {
> + mcheck_intel_therm_init();
> mce_register_decode_chain(&early_nb);
> mce_register_decode_chain(&mce_uc_nb);
> mce_register_decode_chain(&mce_default_nb);
> diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
> index f8e882592ba5..0ebd2386839f 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -621,19 +621,30 @@ bool x86_thermal_enabled(void)
> return atomic_read(&therm_throt_en);
> }
>
> +void __init mcheck_intel_therm_init(void)
> +{
> + /*
> + * This function is only called on boot CPU. Save the init thermal
> + * LVT value on BSP and use that value to restore APs' thermal LVT
> + * entry BIOS programmed later
> + */
> + if (intel_thermal_supported(&boot_cpu_data)) {
> + lvtthmr_init = apic_read(APIC_LVTTHMR);
> + pr_info("%s: lvtthmr_init: 0x%x\n", __func__, lvtthmr_init);
> + } else {
> + pr_info("%s: !intel_thermal_supported\n", __func__);
> + }
> +}
> +
> void intel_init_thermal(struct cpuinfo_x86 *c)
> {
> unsigned int cpu = smp_processor_id();
> int tm2 = 0;
> - u32 l, h;
> + u32 l, h, tmp = -1;
>
> if (!intel_thermal_supported(c))
> return;
>
> - /* On the BSP? */
> - if (c == &boot_cpu_data)
> - lvtthmr_init = apic_read(APIC_LVTTHMR);
> -
> /*
> * First check if its enabled already, in which case there might
> * be some SMM goo which handles it, so we can't even put a handler
> @@ -652,13 +663,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> * BIOS has programmed on AP based on BSP's info we saved since BIOS
> * is always setting the same value for all threads/cores.
> */
> - if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
> + if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
> apic_write(APIC_LVTTHMR, lvtthmr_init);
> + tmp = apic_read(APIC_LVTTHMR);
> + }
>
> + pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
> + __func__, cpu, lvtthmr_init, tmp, l);
>
> if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
> if (system_state == SYSTEM_BOOTING)
> - pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> + pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> return;
> }
>
>


Attachments:
dmesglog.5.12.mcheck.1 (68.53 kB)
dmesglog.5.12.mcheck.20.hot (69.33 kB)
Download all attachments

2021-05-24 07:56:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On Sun, May 23, 2021 at 05:02:01PM -0600, James Feeney wrote:
> Ha! Yes, your patch *is* the culprit. You don't trust git bisect?

Well, git-bisect can very easily veer off into the fields if the
bisector makes a mistake. I'm not saying you have made a mistake but I
have done that mistake a bunch of times and have seen others do it too
so it is very easy to get lost.

And with my patch simply moving the LVT THMR read back in the boot
order, I was very sceptical how can that even be?!

But...

> "lvtthmr_init: 0x200" != "lvtthmr_init: 0x10200" != "lvtthmr_init: 0x10000"

... *this* is a good catch, I *think* I know what happens and the next
patch will confirm my theory, see the end of the mail. Remove all diffs
you have ontop of your 5.12 kernel, apply the one below, do the exact
same exercise with it and send me one full dmesg pls.

> System Management is *hard*, because it must build upon someone else's
> undocumented buggy software. Thank Intel.

No, SMM is shit. Complain to the OEMs about it.

However, this time I think it is *we* who are not doing stuff as
correctly as we should but let's see your results first.

Thx.

---

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index ddfb3cad8dff..5ac8b827bc12 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -296,6 +296,12 @@ struct cper_sec_mem_err;
extern void apei_mce_report_mem_error(int corrected,
struct cper_sec_mem_err *mem_err);

+#ifdef CONFIG_X86_THERMAL_VECTOR
+extern void mcheck_intel_therm_init(void);
+#else
+static inline void mcheck_intel_therm_init(void) { }
+#endif
+
/*
* Enumerate new IP types and HWID values in AMD processors which support
* Scalable MCA.
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 4a39fb429f15..f21009786877 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1162,6 +1162,10 @@ void clear_local_APIC(void)
#ifdef CONFIG_X86_THERMAL_VECTOR
if (maxlvt >= 5) {
v = apic_read(APIC_LVTTHMR);
+
+ pr_info("%s: CPU%d, maxlvt: %d, APIC_LVTTHMR: 0x%x, masking LVT\n",
+ __func__, smp_processor_id(), maxlvt, v);
+
apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
}
#endif
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..ded20b8612fe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2190,6 +2190,7 @@ __setup("mce", mcheck_enable);

int __init mcheck_init(void)
{
+ mcheck_intel_therm_init();
mce_register_decode_chain(&early_nb);
mce_register_decode_chain(&mce_uc_nb);
mce_register_decode_chain(&mce_default_nb);
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e882592ba5..0ebd2386839f 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -621,19 +621,30 @@ bool x86_thermal_enabled(void)
return atomic_read(&therm_throt_en);
}

+void __init mcheck_intel_therm_init(void)
+{
+ /*
+ * This function is only called on boot CPU. Save the init thermal
+ * LVT value on BSP and use that value to restore APs' thermal LVT
+ * entry BIOS programmed later
+ */
+ if (intel_thermal_supported(&boot_cpu_data)) {
+ lvtthmr_init = apic_read(APIC_LVTTHMR);
+ pr_info("%s: lvtthmr_init: 0x%x\n", __func__, lvtthmr_init);
+ } else {
+ pr_info("%s: !intel_thermal_supported\n", __func__);
+ }
+}
+
void intel_init_thermal(struct cpuinfo_x86 *c)
{
unsigned int cpu = smp_processor_id();
int tm2 = 0;
- u32 l, h;
+ u32 l, h, tmp = -1;

if (!intel_thermal_supported(c))
return;

- /* On the BSP? */
- if (c == &boot_cpu_data)
- lvtthmr_init = apic_read(APIC_LVTTHMR);
-
/*
* First check if its enabled already, in which case there might
* be some SMM goo which handles it, so we can't even put a handler
@@ -652,13 +663,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
* BIOS has programmed on AP based on BSP's info we saved since BIOS
* is always setting the same value for all threads/cores.
*/
- if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
+ if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
apic_write(APIC_LVTTHMR, lvtthmr_init);
+ tmp = apic_read(APIC_LVTTHMR);
+ }

+ pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
+ __func__, cpu, lvtthmr_init, tmp, l);

if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
if (system_state == SYSTEM_BOOTING)
- pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
+ pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
return;
}


--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-25 04:07:38

by James Feeney

[permalink] [raw]
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

On 5/24/21 1:51 AM, Borislav Petkov wrote:
> On Sun, May 23, 2021 at 05:02:01PM -0600, James Feeney wrote:
>> Ha! Yes, your patch *is* the culprit. You don't trust git bisect?
>
> Well, git-bisect can very easily veer off into the fields if the
> bisector makes a mistake. I'm not saying you have made a mistake but I
> have done that mistake a bunch of times and have seen others do it too
> so it is very easy to get lost.
>

Quite so, quite so. But the process, by necessity, is incremental, a step-wise process of increasing knowledge. It is essentially one of the many levels of the same process described in "The Cathedral and the Bazaar". Though, it can be tedious at times.

> And with my patch simply moving the LVT THMR read back in the boot
> order, I was very sceptical how can that even be?!
>
> But...
>
>> "lvtthmr_init: 0x200" != "lvtthmr_init: 0x10200" != "lvtthmr_init: 0x10000"
>
> ... *this* is a good catch, I *think* I know what happens and the next
> patch will confirm my theory, see the end of the mail. Remove all diffs
> you have ontop of your 5.12 kernel, apply the one below, do the exact
> same exercise with it and send me one full dmesg pls.
>

It is not helpful to describe the addition of one pr_info statement in terms of "Remove all diffs you have ontop of your 5.12 kernel". I prefer to apply each of these by hand, so that I can see what is actually happening, and there are no actual other additional changes, other than this pr_info addition to clear_local_APIC() in arch/x86/kernel/apic/apic.c.

>> System Management is *hard*, because it must build upon someone else's
>> undocumented buggy software. Thank Intel.
>
> No, SMM is shit. Complain to the OEMs about it.
>

Hmm - an important ethical question, determining to whom blame should be assigned. I rather take the position, to put this in melodramatic terms, of not allowing little children, crazy people, fanatics, or politicians, to have guns. Intel know - or should know - that hardware OEMs are irresponsible software providers. There is overwhelming evidence, to this. And yet, Intel continue to choose to provide this SMM "gun" to these OEM "crazy people". The outcome, then, simply becomes predictable, and inevitable. I, therefore, prefer to hold Intel, and by extension, AMD, responsible. A "solid ethical foundation" is the most important element in the machine result.

> However, this time I think it is *we* who are not doing stuff as
> correctly as we should but let's see your results first.
>
> Thx.
>

Since I am not familiar with the details, I am interested to know your final analysis. Still, I take note of the present language in the comment, in intel_init_thermal():

====
* The initial value of thermal LVT entries on all APs always reads
* 0x10000 because APs are woken up by BSP issuing INIT-SIPI-SIPI
* sequence to them and LVT registers are reset to 0s except for
* the mask bits which are set to 1s when APs receive INIT IPI.
* If BIOS takes over the thermal interrupt and sets its interrupt
* delivery mode to SMI (not fixed), it restores the value that the
* BIOS has programmed on AP based on BSP's info we saved since BIOS
* is always setting the same value for all threads/cores.
====

But then, it is entirely unclear to me exactly "who is doing what" here, while we see these values, resulting under various circumstances, at different times, as 0x10000, 0x10200, and 0x200.


Addressing the pr_info patch in clear_local_APIC(), first we note the comment "Also used to cleanout any BIOS leftovers during boot", and then discover that clear_local_APIC() is called during power-down, but is *never called* during boot. To be sure that this is true, I've added
pr_info("%s: before : CPU%d, maxlvt: %d\n", __func__, smp_processor_id(), maxlvt);
and
pr_info("%s: after : CPU%d, maxlvt: %d\n", __func__, smp_processor_id(), maxlvt);
around
if (maxlvt >= 5) {...}

since maybe we want to know the actual value of maxlvt during boot, on the possibility that this value might have changed.

With these additional pr_info statements, we see with certainty that clear_local_APIC() is *never* called during boot. Presumably, this implies smp_found_config in init_bsp_APIC(), which otherwise would have called clear_local_APIC(). dmidecode says "SMBIOS 2.4 present".

dmesg log for a power-up to power-down cycle attached. Here is the quick look:

$ grep -C 1 'mcheck\|lvtt\|clear' dmesglog.5.12.info.context.1

[ 0.171645] Booting paravirtualized kernel on bare hardware
[ 0.171648] mcheck_intel_therm_init: lvtthmr_init: 0x200
[ 0.171651] mce: mcheck_init : mcheck_intel_therm_init() returned
[ 0.171657] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
--
[ 1.251708] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 1.255457] intel_init_thermal: CPU0, lvtthmr_init: 0x200, read: 0x200, misc_enable (low): 0x64952489
[ 1.258348] CPU0: Thermal monitoring handled by SMI
--
[ 1.451691] .... node #0, CPUs: #1
[ 0.944189] intel_init_thermal: CPU1, lvtthmr_init: 0x200, read: 0x200, misc_enable (low): 0x64952489
[ 1.468467] smp: Brought up 1 node, 2 CPUs
--
[ 153.299300] reboot: Power down
[ 153.304433] clear_local_APIC: before : CPU0, maxlvt: 5
[ 153.304433] clear_local_APIC: before : CPU1, maxlvt: 5
[ 153.304437] clear_local_APIC: CPU1, maxlvt: 5, APIC_LVTTHMR: 0x200, masking LVT
[ 153.311404] clear_local_APIC: CPU0, maxlvt: 5, APIC_LVTTHMR: 0x200, masking LVT
[ 153.316525] clear_local_APIC: after : CPU1, maxlvt: 5
[ 153.325673] clear_local_APIC: after : CPU0, maxlvt: 5
[ 153.346744] clear_local_APIC: before : CPU0, maxlvt: 5
[ 153.353718] clear_local_APIC: CPU0, maxlvt: 5, APIC_LVTTHMR: 0x10200, masking LVT
[ 153.363062] clear_local_APIC: after : CPU0, maxlvt: 5
[ 153.369985] acpi_power_off called


James


> ---
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index ddfb3cad8dff..5ac8b827bc12 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -296,6 +296,12 @@ struct cper_sec_mem_err;
> extern void apei_mce_report_mem_error(int corrected,
> struct cper_sec_mem_err *mem_err);
>
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> +extern void mcheck_intel_therm_init(void);
> +#else
> +static inline void mcheck_intel_therm_init(void) { }
> +#endif
> +
> /*
> * Enumerate new IP types and HWID values in AMD processors which support
> * Scalable MCA.
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 4a39fb429f15..f21009786877 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1162,6 +1162,10 @@ void clear_local_APIC(void)
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5) {
> v = apic_read(APIC_LVTTHMR);
> +
> + pr_info("%s: CPU%d, maxlvt: %d, APIC_LVTTHMR: 0x%x, masking LVT\n",
> + __func__, smp_processor_id(), maxlvt, v);
> +
> apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
> }
> #endif
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index bf7fe87a7e88..ded20b8612fe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2190,6 +2190,7 @@ __setup("mce", mcheck_enable);
>
> int __init mcheck_init(void)
> {
> + mcheck_intel_therm_init();
> mce_register_decode_chain(&early_nb);
> mce_register_decode_chain(&mce_uc_nb);
> mce_register_decode_chain(&mce_default_nb);
> diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
> index f8e882592ba5..0ebd2386839f 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -621,19 +621,30 @@ bool x86_thermal_enabled(void)
> return atomic_read(&therm_throt_en);
> }
>
> +void __init mcheck_intel_therm_init(void)
> +{
> + /*
> + * This function is only called on boot CPU. Save the init thermal
> + * LVT value on BSP and use that value to restore APs' thermal LVT
> + * entry BIOS programmed later
> + */
> + if (intel_thermal_supported(&boot_cpu_data)) {
> + lvtthmr_init = apic_read(APIC_LVTTHMR);
> + pr_info("%s: lvtthmr_init: 0x%x\n", __func__, lvtthmr_init);
> + } else {
> + pr_info("%s: !intel_thermal_supported\n", __func__);
> + }
> +}
> +
> void intel_init_thermal(struct cpuinfo_x86 *c)
> {
> unsigned int cpu = smp_processor_id();
> int tm2 = 0;
> - u32 l, h;
> + u32 l, h, tmp = -1;
>
> if (!intel_thermal_supported(c))
> return;
>
> - /* On the BSP? */
> - if (c == &boot_cpu_data)
> - lvtthmr_init = apic_read(APIC_LVTTHMR);
> -
> /*
> * First check if its enabled already, in which case there might
> * be some SMM goo which handles it, so we can't even put a handler
> @@ -652,13 +663,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> * BIOS has programmed on AP based on BSP's info we saved since BIOS
> * is always setting the same value for all threads/cores.
> */
> - if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
> + if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
> apic_write(APIC_LVTTHMR, lvtthmr_init);
> + tmp = apic_read(APIC_LVTTHMR);
> + }
>
> + pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
> + __func__, cpu, lvtthmr_init, tmp, l);
>
> if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
> if (system_state == SYSTEM_BOOTING)
> - pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> + pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> return;
> }
>
>


Attachments:
dmesglog.5.12.info.context.1 (68.92 kB)

2021-05-27 10:32:54

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

Ok,

it took me a while to find a box like yours to reproduce on. Anyway,
here's what looks like the final fix, you could give it a run.

Thx.

---
From: Borislav Petkov <[email protected]>
Date: Thu, 27 May 2021 11:02:26 +0200

There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):

[ 0.033858] read lvtthmr: 0x330, val: 0x200

which means:

- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.

Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:

setup_local_APIC:

...

/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);

which means (from the SDM):

"10.4.7.2 Local APIC State After It Has Been Software Disabled

...

* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."

And this happens too:

[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0

This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...

Now, before

4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")

due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.

But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.

Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.

Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/thermal.h | 4 +++-
arch/x86/kernel/setup.c | 9 +++++++++
drivers/thermal/intel/therm_throt.c | 15 +++++++++++----
3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/thermal.h b/arch/x86/include/asm/thermal.h
index ddbdefd5b94f..91a7b6687c3b 100644
--- a/arch/x86/include/asm/thermal.h
+++ b/arch/x86/include/asm/thermal.h
@@ -3,11 +3,13 @@
#define _ASM_X86_THERMAL_H

#ifdef CONFIG_X86_THERMAL_VECTOR
+void therm_lvt_init(void);
void intel_init_thermal(struct cpuinfo_x86 *c);
bool x86_thermal_enabled(void);
void intel_thermal_interrupt(void);
#else
-static inline void intel_init_thermal(struct cpuinfo_x86 *c) { }
+static inline void therm_lvt_init(void) { }
+static inline void intel_init_thermal(struct cpuinfo_x86 *c) { }
#endif

#endif /* _ASM_X86_THERMAL_H */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 72920af0b3c0..ff653d608d5f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -44,6 +44,7 @@
#include <asm/pci-direct.h>
#include <asm/prom.h>
#include <asm/proto.h>
+#include <asm/thermal.h>
#include <asm/unwind.h>
#include <asm/vsyscall.h>
#include <linux/vmalloc.h>
@@ -1226,6 +1227,14 @@ void __init setup_arch(char **cmdline_p)

x86_init.timers.wallclock_init();

+ /*
+ * This needs to run before setup_local_APIC() which soft-disables the
+ * local APIC temporarily and that masks the thermal LVT interrupt,
+ * leading to softlockups on machines which have configured SMI
+ * interrupt delivery.
+ */
+ therm_lvt_init();
+
mcheck_init();

register_refined_jiffies(CLOCK_TICK_RATE);
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e882592ba5..99abdc03c44c 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -621,6 +621,17 @@ bool x86_thermal_enabled(void)
return atomic_read(&therm_throt_en);
}

+void __init therm_lvt_init(void)
+{
+ /*
+ * This function is only called on boot CPU. Save the init thermal
+ * LVT value on BSP and use that value to restore APs' thermal LVT
+ * entry BIOS programmed later
+ */
+ if (intel_thermal_supported(&boot_cpu_data))
+ lvtthmr_init = apic_read(APIC_LVTTHMR);
+}
+
void intel_init_thermal(struct cpuinfo_x86 *c)
{
unsigned int cpu = smp_processor_id();
@@ -630,10 +641,6 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
if (!intel_thermal_supported(c))
return;

- /* On the BSP? */
- if (c == &boot_cpu_data)
- lvtthmr_init = apic_read(APIC_LVTTHMR);
-
/*
* First check if its enabled already, in which case there might
* be some SMM goo which handles it, so we can't even put a handler
--
2.29.2


--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

2021-05-27 11:52:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, May 27 2021 at 12:31, Borislav Petkov wrote:
> @@ -1226,6 +1227,14 @@ void __init setup_arch(char **cmdline_p)
>
> x86_init.timers.wallclock_init();
>
> + /*
> + * This needs to run before setup_local_APIC() which soft-disables the
> + * local APIC temporarily and that masks the thermal LVT interrupt,
> + * leading to softlockups on machines which have configured SMI
> + * interrupt delivery.
> + */
> + therm_lvt_init();

That works, but TBH, it's a hack....

What I really fail to understand is how disabling that LVT entry makes
the machine lock up.

Also if disabling this entry is causing the BIOS/SMM gunk to go south, then
disabling CONFIG_X86_THERMAL_VECTOR should have the same effect.

Which made me look at other places like lapic_suspend/resume which does
the save/restore Kconfig conditional as well.

Thanks,

tglx





2021-05-27 19:05:41

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, 2021-05-27 at 12:31 +0200, Borislav Petkov wrote:
> Ok,
>
> it took me a while to find a box like yours to reproduce on. Anyway,
> here's what looks like the final fix, you could give it a run.
>
> Thx.
>
> ---
> From: Borislav Petkov <[email protected]>
> Date: Thu, 27 May 2021 11:02:26 +0200
>
> There are machines out there with added value crap^WBIOS which
> provide an
> SMI handler for the local APIC thermal sensor interrupt. Out of
> reset,
> the BSP on those machines has something like 0x200 in that APIC
> register
> (timestamps left in because this whole issue is timing sensitive):
>
> [ 0.033858] read lvtthmr: 0x330, val: 0x200
>
> which means:
>
> - bit 16 - the interrupt mask bit is clear and thus that interrupt
> is enabled
> - bits [10:8] have 010b which means SMI delivery mode.
>
> Now, later during boot, when the kernel programs the local APIC, it
> soft-disables it temporarily through the spurious vector register:
>
> setup_local_APIC:
>
> ...
>
> /*
> * If this comes from kexec/kcrash the APIC might be enabled in
> * SPIV. Soft disable it before doing further initialization.
> */
> value = apic_read(APIC_SPIV);
> value &= ~APIC_SPIV_APIC_ENABLED;
> apic_write(APIC_SPIV, value);
>
> which means (from the SDM):
>
> "10.4.7.2 Local APIC State After It Has Been Software Disabled
>
> ...
>
> * The mask bits for all the LVT entries are set. Attempts to reset
> these
> bits will be ignored."
>
> And this happens too:
>
> [ 0.124111] APIC: Switch to symmetric I/O mode setup
> [ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
> [ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
>
> This results in CPU 0 soft lockups depending on the placement in time
> when the APIC soft-disable happens. Those soft lockups are not 100%
> reproducible and the reason for that can only be speculated as no one
> tells you what SMM does. Likely, it confuses the SMM code that the
> APIC
> is disabled and the thermal interrupt doesn't doesn't fire at all,

My guess is that system is booting hot sometimes. SMM started fan or
some cooling and set a temperature threshold. It is waiting for thermal
interrupt for temperature threshold, which it never got.


Thanks,
Srinivas


> leading to CPU 0 stuck in SMM forever...
>
> Now, before
>
> 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
>
> due to how the APIC_LVTTHMR was read before APIC initialization in
> mcheck_intel_therm_init(), it would read the value with the mask bit
> 16
> clear and then intel_init_thermal() would replicate it onto the APs
> and
> all would be peachy - the thermal interrupt would remain enabled.
>
> But that commit moved that reading to a later moment in
> intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP
> too
> late and with its interrupt mask bit set.
>
> Thus, revert back to the old behavior of reading the thermal LVT
> register before the APIC gets initialized.
>
> Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
> Reported-by: James Feeney <[email protected]>
> Signed-off-by: Borislav Petkov <[email protected]>
> Cc: <[email protected]>
> Cc: Zhang Rui <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> ---
> arch/x86/include/asm/thermal.h | 4 +++-
> arch/x86/kernel/setup.c | 9 +++++++++
> drivers/thermal/intel/therm_throt.c | 15 +++++++++++----
> 3 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/thermal.h
> b/arch/x86/include/asm/thermal.h
> index ddbdefd5b94f..91a7b6687c3b 100644
> --- a/arch/x86/include/asm/thermal.h
> +++ b/arch/x86/include/asm/thermal.h
> @@ -3,11 +3,13 @@
> #define _ASM_X86_THERMAL_H
>
> #ifdef CONFIG_X86_THERMAL_VECTOR
> +void therm_lvt_init(void);
> void intel_init_thermal(struct cpuinfo_x86 *c);
> bool x86_thermal_enabled(void);
> void intel_thermal_interrupt(void);
> #else
> -static inline void intel_init_thermal(struct cpuinfo_x86 *c) { }
> +static inline void therm_lvt_init(void)
> { }
> +static inline void intel_init_thermal(struct cpuinfo_x86 *c) { }
> #endif
>
> #endif /* _ASM_X86_THERMAL_H */
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 72920af0b3c0..ff653d608d5f 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -44,6 +44,7 @@
> #include <asm/pci-direct.h>
> #include <asm/prom.h>
> #include <asm/proto.h>
> +#include <asm/thermal.h>
> #include <asm/unwind.h>
> #include <asm/vsyscall.h>
> #include <linux/vmalloc.h>
> @@ -1226,6 +1227,14 @@ void __init setup_arch(char **cmdline_p)
>
> x86_init.timers.wallclock_init();
>
> + /*
> + * This needs to run before setup_local_APIC() which soft-
> disables the
> + * local APIC temporarily and that masks the thermal LVT
> interrupt,
> + * leading to softlockups on machines which have configured SMI
> + * interrupt delivery.
> + */
> + therm_lvt_init();
> +
> mcheck_init();
>
> register_refined_jiffies(CLOCK_TICK_RATE);
> diff --git a/drivers/thermal/intel/therm_throt.c
> b/drivers/thermal/intel/therm_throt.c
> index f8e882592ba5..99abdc03c44c 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -621,6 +621,17 @@ bool x86_thermal_enabled(void)
> return atomic_read(&therm_throt_en);
> }
>
> +void __init therm_lvt_init(void)
> +{
> + /*
> + * This function is only called on boot CPU. Save the init
> thermal
> + * LVT value on BSP and use that value to restore APs' thermal
> LVT
> + * entry BIOS programmed later
> + */
> + if (intel_thermal_supported(&boot_cpu_data))
> + lvtthmr_init = apic_read(APIC_LVTTHMR);
> +}
> +
> void intel_init_thermal(struct cpuinfo_x86 *c)
> {
> unsigned int cpu = smp_processor_id();
> @@ -630,10 +641,6 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> if (!intel_thermal_supported(c))
> return;
>
> - /* On the BSP? */
> - if (c == &boot_cpu_data)
> - lvtthmr_init = apic_read(APIC_LVTTHMR);
> -
> /*
> * First check if its enabled already, in which case there
> might
> * be some SMM goo which handles it, so we can't even put a
> handler
> --
> 2.29.2
>
>

2021-05-27 19:08:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, May 27, 2021 at 01:49:42PM +0200, Thomas Gleixner wrote:
> Also if disabling this entry is causing the BIOS/SMM gunk to go south, then
> disabling CONFIG_X86_THERMAL_VECTOR should have the same effect.

Well, I believe the latter was practically impossible:

config X86_THERMAL_VECTOR
def_bool y
depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC

so *if* the machine was sporting thermal SMI, then it very much needed
that code in intel_init_thermal() to unstuck the cores. At least it
looks that way...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-27 20:30:21

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, 2021-05-27 at 21:01 +0200, Borislav Petkov wrote:
> On Thu, May 27, 2021 at 11:09:59AM -0700, Srinivas Pandruvada wrote:
> > My guess is that system is booting hot sometimes. SMM started fan
> > or
> > some cooling and set a temperature threshold. It is waiting for
> > thermal
> > interrupt for temperature threshold, which it never got.
>
> Are you saying that that replication of lvtthmr_init to the APs in
> intel_init_thermal() is absolutely needed on those SMI machines
> running
> hot?

We have seen some SMM uses thermal interrupts. We had one issue in one
Yoga systems several years back where SMM handling of thermal interrupt
related to HWP caused hard hang as it crashed there.
So yes, there may be special thing for cooling also.

>
> That thing:
>
> * If BIOS takes over the thermal interrupt and sets its
> interrupt
> * delivery mode to SMI (not fixed), it restores the value
> that the
> * BIOS has programmed on AP based on BSP's info we saved
> since BIOS
> * is always setting the same value for all threads/cores.
>
> ?
>
> Me moving that lvtthmr_init read later would replicate the wrong
> value
> because we'd soft-disable the APIC and thus the core would lockup
> waiting...
I think so.
I will try to force replicate wrong value in Yoga system which used to
crash in thermal interrupt handling of SMM code and check what happens.
This shouldn't crash as it will not get thermal interrupt. Since the
system is not with me, I can try next week.

>
> The other interesting thing is that the core would always lockup when
> trying to IPI another core to remote-flush the TLBs.
>
Here I think the other core didn't exit SMM mode.

Thanks,
Srinivas


2021-05-27 20:49:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, May 27, 2021 at 01:49:42PM +0200, Thomas Gleixner wrote:
> That works, but TBH, it's a hack....

Yeah, it restores to the previous behavior.

> What I really fail to understand is how disabling that LVT entry makes
> the machine lock up.

The observation is is that on James' box, *sometimes* - not always - it
would softlockup on CPU0 without this early LVT thermal read.

I can try to reproduce on the box I have here, lemme see...

> Also if disabling this entry is causing the BIOS/SMM gunk to go south, then
> disabling CONFIG_X86_THERMAL_VECTOR should have the same effect.
>
> Which made me look at other places like lapic_suspend/resume which does
> the save/restore Kconfig conditional as well.

The fact that it doesn't reproduce always shows that there's something
else at play too. As I said in the commit message: "Those soft lockups
are not 100% reproducible and the reason for that can only be speculated
as no one tells you what SMM does."

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-27 23:55:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, May 27, 2021 at 11:09:59AM -0700, Srinivas Pandruvada wrote:
> My guess is that system is booting hot sometimes. SMM started fan or
> some cooling and set a temperature threshold. It is waiting for thermal
> interrupt for temperature threshold, which it never got.

Are you saying that that replication of lvtthmr_init to the APs in
intel_init_thermal() is absolutely needed on those SMI machines running
hot?

That thing:

* If BIOS takes over the thermal interrupt and sets its interrupt
* delivery mode to SMI (not fixed), it restores the value that the
* BIOS has programmed on AP based on BSP's info we saved since BIOS
* is always setting the same value for all threads/cores.

?

Me moving that lvtthmr_init read later would replicate the wrong value
because we'd soft-disable the APIC and thus the core would lockup
waiting...

The other interesting thing is that the core would always lockup when
trying to IPI another core to remote-flush the TLBs.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-28 08:27:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Thu, May 27 2021 at 20:54, Borislav Petkov wrote:
> On Thu, May 27, 2021 at 01:49:42PM +0200, Thomas Gleixner wrote:
>> Also if disabling this entry is causing the BIOS/SMM gunk to go south, then
>> disabling CONFIG_X86_THERMAL_VECTOR should have the same effect.
>
> Well, I believe the latter was practically impossible:
>
> config X86_THERMAL_VECTOR
> def_bool y
> depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC

So you could disable CPU_SUP_INTEL ... Should still boot with reduced
functionality.

> so *if* the machine was sporting thermal SMI, then it very much needed
> that code in intel_init_thermal() to unstuck the cores. At least it
> looks that way...

What a mess...

2021-05-28 11:14:34

by James Feeney

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On 5/27/21 4:31 AM, Borislav Petkov wrote:
> it took me a while to find a box like yours to reproduce on. Anyway,
> here's what looks like the final fix, you could give it a run.

Going through the motions, I can confirm "no surprises". The final patch is working as expected, with both full power-off boots and "warm" reboots.

On 5/27/21 12:09 PM, Srinivas Pandruvada wrote:
> My guess is that system is booting hot sometimes. SMM started fan or
> some cooling and set a temperature threshold. It is waiting for thermal
> interrupt for temperature threshold, which it never got.

I remind, that when the boot would hang at "Trying to unpack rootfs image as initramfs...", then if the machine is let sit in this state, the fan will begin to run full, off and on, suggesting that maybe the processor is still running and running full power. Something - maybe the management engine? - is still running, and able to cycle the fan speed, high and low. There *must*, then, be "high" and "low" temperature thresholds being reached, as the fan speed cycles, and there is *no* continuation of the boot process when either of these thresholds is reached. So that interpretation - waiting for thermal interrupt - does not seem to address all the circumstances.

James

2021-05-28 13:34:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On Fri, May 28, 2021 at 10:23:42AM +0200, Thomas Gleixner wrote:
> So you could disable CPU_SUP_INTEL ... Should still boot with reduced
> functionality.

*If* you disable it and *if* that intel_init_thermal() is really there
to unstick those cores, then you might not even boot successfully.

> What a mess...

When is it not when SMM is involved?

That stinking goo underneath the OS needs to be open sourced so that we
can replace it with something small and sane. BIOS morons will never
learn not to fiddle with architectural components.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-31 18:28:46

by James Feeney

[permalink] [raw]
Subject: Re: [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode

On 5/28/21 5:19 AM, Borislav Petkov wrote:
> On Fri, May 28, 2021 at 10:23:42AM +0200, Thomas Gleixner wrote:
>> So you could disable CPU_SUP_INTEL ... Should still boot with reduced
>> functionality.
>
> *If* you disable it and *if* that intel_init_thermal() is really there
> to unstick those cores, then you might not even boot successfully.
>
>> What a mess...
>
> When is it not when SMM is involved?
>
> That stinking goo underneath the OS needs to be open sourced so that we
> can replace it with something small and sane. BIOS morons will never
> learn not to fiddle with architectural components.
>

Has the patch that resolves - or at least pacifies - this issue been forwarded downstream to GKH for mainline yet? I'd like to be able to upgrade with stock kernels again.

James

2021-05-31 21:48:11

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: x86/urgent] x86/thermal: Fix LVT thermal setup for SMI delivery mode

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 9a90ed065a155d13db0d0ffeaad5cc54e51c90c6
Gitweb: https://git.kernel.org/tip/9a90ed065a155d13db0d0ffeaad5cc54e51c90c6
Author: Borislav Petkov <[email protected]>
AuthorDate: Thu, 27 May 2021 11:02:26 +02:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Mon, 31 May 2021 22:32:26 +02:00

x86/thermal: Fix LVT thermal setup for SMI delivery mode

There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):

[ 0.033858] read lvtthmr: 0x330, val: 0x200

which means:

- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.

Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:

setup_local_APIC:

...

/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);

which means (from the SDM):

"10.4.7.2 Local APIC State After It Has Been Software Disabled

...

* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."

And this happens too:

[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0

This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...

Now, before

4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")

due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.

But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.

Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.

Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/thermal.h | 4 +++-
arch/x86/kernel/setup.c | 9 +++++++++
drivers/thermal/intel/therm_throt.c | 15 +++++++++++----
3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/thermal.h b/arch/x86/include/asm/thermal.h
index ddbdefd..91a7b66 100644
--- a/arch/x86/include/asm/thermal.h
+++ b/arch/x86/include/asm/thermal.h
@@ -3,11 +3,13 @@
#define _ASM_X86_THERMAL_H

#ifdef CONFIG_X86_THERMAL_VECTOR
+void therm_lvt_init(void);
void intel_init_thermal(struct cpuinfo_x86 *c);
bool x86_thermal_enabled(void);
void intel_thermal_interrupt(void);
#else
-static inline void intel_init_thermal(struct cpuinfo_x86 *c) { }
+static inline void therm_lvt_init(void) { }
+static inline void intel_init_thermal(struct cpuinfo_x86 *c) { }
#endif

#endif /* _ASM_X86_THERMAL_H */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 72920af..ff653d6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -44,6 +44,7 @@
#include <asm/pci-direct.h>
#include <asm/prom.h>
#include <asm/proto.h>
+#include <asm/thermal.h>
#include <asm/unwind.h>
#include <asm/vsyscall.h>
#include <linux/vmalloc.h>
@@ -1226,6 +1227,14 @@ void __init setup_arch(char **cmdline_p)

x86_init.timers.wallclock_init();

+ /*
+ * This needs to run before setup_local_APIC() which soft-disables the
+ * local APIC temporarily and that masks the thermal LVT interrupt,
+ * leading to softlockups on machines which have configured SMI
+ * interrupt delivery.
+ */
+ therm_lvt_init();
+
mcheck_init();

register_refined_jiffies(CLOCK_TICK_RATE);
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e8825..99abdc0 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -621,6 +621,17 @@ bool x86_thermal_enabled(void)
return atomic_read(&therm_throt_en);
}

+void __init therm_lvt_init(void)
+{
+ /*
+ * This function is only called on boot CPU. Save the init thermal
+ * LVT value on BSP and use that value to restore APs' thermal LVT
+ * entry BIOS programmed later
+ */
+ if (intel_thermal_supported(&boot_cpu_data))
+ lvtthmr_init = apic_read(APIC_LVTTHMR);
+}
+
void intel_init_thermal(struct cpuinfo_x86 *c)
{
unsigned int cpu = smp_processor_id();
@@ -630,10 +641,6 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
if (!intel_thermal_supported(c))
return;

- /* On the BSP? */
- if (c == &boot_cpu_data)
- lvtthmr_init = apic_read(APIC_LVTTHMR);
-
/*
* First check if its enabled already, in which case there might
* be some SMM goo which handles it, so we can't even put a handler