2023-10-12 09:38:03

by Bagas Sanjaya

[permalink] [raw]
Subject: Fwd: Kernel 6.5 hangs on shutdown

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> I use Dell OptiPlex 7050, and kernel hangs when shutting down the computer.
> Similar symptom has been reported on some forums, and all of them are using
> Dell computers:
> https://bbs.archlinux.org/viewtopic.php?pid=2124429
> https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
> https://forum.artixlinux.org/index.php/topic,5997.0.html
>
> Tested with various kernel and this bug seems to be caused by commit: 88afbb21d4b36fee6acaa167641f9f0fc122f01b.

See Bugzilla for the full thread.

Anyway, I'm adding this regression to be tracked by regzbot:

#regzbot introduced: 88afbb21d4b36f https://bugzilla.kernel.org/show_bug.cgi?id=217995
#regzbot title: x86 core fix pull causes shutdown hang on Dell OptiPlex 7050
#regzbot link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
#regzbot link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
#regzbot link: https://forum.artixlinux.org/index.php/topic,5997.0.html

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217995

--
An old man doll... just what I always wanted! - Clara


2023-10-13 12:05:54

by Thorsten Leemhuis

[permalink] [raw]
Subject: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)

[CCing x86 maintainers]

Hi Thomas!

On 12.10.23 11:37, Bagas Sanjaya wrote:
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>>> I use Dell OptiPlex 7050, and kernel hangs when shutting down the
computer.
>> Similar symptom has been reported on some forums, and all of them are using
>> Dell computers:
>> https://bbs.archlinux.org/viewtopic.php?pid=2124429
>> https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
>> https://forum.artixlinux.org/index.php/topic,5997.0.html

Another report: https://bugzilla.redhat.com/show_bug.cgi?id=2241279

From all those links it seems quite a lot of users with Dell machines
are affected by this problem.

>> Tested with various kernel and this bug seems to be caused by commit: 88afbb21d4b36fee6acaa167641f9f0fc122f01b.

Thomas, turns out that bisection result was slightly wrong: a recheck
confirmed that the regression is actually caused by 45e34c8af58f23
("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.

Ciao, Thorsten

> Anyway, I'm adding this regression to be tracked by regzbot:
> [...]

#regzbot introduced: 45e34c8af58f
#regzbot link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279

2023-10-13 17:48:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)

On Fri, 13 Oct 2023 at 05:05, Linux regression tracking (Thorsten
Leemhuis) <[email protected]> wrote:
>
> Thomas, turns out that bisection result was slightly wrong: a recheck
> confirmed that the regression is actually caused by 45e34c8af58f23
> ("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
> yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.

That commit does look pretty dangerous.

If *anything* is done through SMI after the code does that
smp_park_other_cpus_in_init() sequence, I wouldn't be surprised in the
least if the machine is hung.

That's made worse since it looks like the shutdown sequence isn't
necessarily run on the boot CPU, so the boot CPU itself may be in
INIT, and any SMI quite possibly ends up treating that CPU specially.

Who knows what SMI does, but the fact that the affected machines seem
to be mainly from one particular manufacturer does tend to imply it's
something like that.

And the code does do a fair amount *after* shutting down cpu's. Not
just things like calling x86_platform.iommu_shutdown(), but also
things like possibly the tboot shutdown sequence (which almost
*certainly* is some SMI thing).

I dunno. Thomas - I htink the argument for that commit was fairly
theoretical, and reverting it seems the obvious thing, unless you have
some idea of what might be wrong.

Linus

2023-10-13 18:28:57

by Ashok Raj

[permalink] [raw]
Subject: Re: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)

Hi

On Fri, Oct 13, 2023 at 10:48:19AM -0700, Linus Torvalds wrote:
> On Fri, 13 Oct 2023 at 05:05, Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
> >
> > Thomas, turns out that bisection result was slightly wrong: a recheck
> > confirmed that the regression is actually caused by 45e34c8af58f23
> > ("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
> > yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.
>
> That commit does look pretty dangerous.
>
> If *anything* is done through SMI after the code does that
> smp_park_other_cpus_in_init() sequence, I wouldn't be surprised in the
> least if the machine is hung.
>
> That's made worse since it looks like the shutdown sequence isn't
> necessarily run on the boot CPU, so the boot CPU itself may be in
> INIT, and any SMI quite possibly ends up treating that CPU specially.

Sending INIT to processor marked as BSP will tank the system.

>
> Who knows what SMI does, but the fact that the affected machines seem
> to be mainly from one particular manufacturer does tend to imply it's
> something like that.

There was a report (probably this same one), and it turns out it was a
bug in the BIOS SMI handler.

The client BIOS's were waiting for the lowest APICID to be the SMI
rendevous master. If this is MeteorLake, the BSP wasn't the one
with the lowest APIC and it triped here.

The BIOS change is also being pushed to others for assimilation :)

Server BIOS's had this correctly for a while now.
>
> And the code does do a fair amount *after* shutting down cpu's. Not
> just things like calling x86_platform.iommu_shutdown(), but also
> things like possibly the tboot shutdown sequence (which almost
> *certainly* is some SMI thing).
>
> I dunno. Thomas - I htink the argument for that commit was fairly
> theoretical, and reverting it seems the obvious thing, unless you have
> some idea of what might be wrong.
>
> Linus

--
Cheers,
Ashok

2023-10-13 19:40:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)

On Fri, Oct 13 2023 at 10:48, Linus Torvalds wrote:
> On Fri, 13 Oct 2023 at 05:05, Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
>>
>> Thomas, turns out that bisection result was slightly wrong: a recheck
>> confirmed that the regression is actually caused by 45e34c8af58f23
>> ("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
>> yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.
>
> That commit does look pretty dangerous.
>
> If *anything* is done through SMI after the code does that
> smp_park_other_cpus_in_init() sequence, I wouldn't be surprised in the
> least if the machine is hung.
>
> That's made worse since it looks like the shutdown sequence isn't
> necessarily run on the boot CPU, so the boot CPU itself may be in
> INIT, and any SMI quite possibly ends up treating that CPU specially.

smp_park_other_cpus_in_init() bails out early when it's not invoked on
the boot CPU because sending INIT to the BSP results in a full machine
reset. So that's definitely not the problem.

> Who knows what SMI does, but the fact that the affected machines seem
> to be mainly from one particular manufacturer does tend to imply it's
> something like that.

It's mostly DELL machines. The rest seems to be Lenovo and Sony with
Alderlake/Raptorlake CPUs - at least that's what I could figure out from
the various bug reports. I don't know which CPUs the DELL machines have,
so I can't say it's a pattern.

Bagas, can you please provide the output of /proc/cpuinfo ?

> And the code does do a fair amount *after* shutting down cpu's. Not
> just things like calling x86_platform.iommu_shutdown(), but also
> things like possibly the tboot shutdown sequence (which almost
> *certainly* is some SMI thing).

That should not matter, but who the heck knows.

> I dunno. Thomas - I htink the argument for that commit was fairly
> theoretical, and reverting it seems the obvious thing, unless you have
> some idea of what might be wrong.

I agree with the revert for now.

The problem is not entirely theoretical in the kexec() case, but yes for
shutdown/reboot it's irrelevant.

The reason why I ended up with this is the initial problem of soft
offlined CPUs sitting in MWAIT. The kexec() kernel can end up writing to
the monitor cache line reliably after it overwrote the original kernel
mappings, which results in completely undebugable chaos or triple
faults.

The MWAIT issue is mitigated by writing to the monitor cache lines and
forcing the CPUs into HLT.

Extensive testing revealed that HLT is not entirely safe either, so we
ended up with the INIT trick, which turned out to be very reliable in
testing. Though it's obviously making some BIOSes very unhappy. Sigh...

Did I mention before that I hate computers with a passion?

Thanks,

tglx

2023-10-16 08:46:52

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Fwd: Kernel 6.5 hangs on shutdown

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 12.10.23 11:37, Bagas Sanjaya wrote:
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
>> I use Dell OptiPlex 7050, and kernel hangs when shutting down the computer.
>> Similar symptom has been reported on some forums, and all of them are using
>> Dell computers:
>> https://bbs.archlinux.org/viewtopic.php?pid=2124429
>> https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
>> https://forum.artixlinux.org/index.php/topic,5997.0.html
>>
>> Tested with various kernel and this bug seems to be caused by commit: 88afbb21d4b36fee6acaa167641f9f0fc122f01b.
>
> See Bugzilla for the full thread.
>
> Anyway, I'm adding this regression to be tracked by regzbot:
>
> #regzbot introduced: 88afbb21d4b36f https://bugzilla.kernel.org/show_bug.cgi?id=217995
> #regzbot title: x86 core fix pull causes shutdown hang on Dell OptiPlex 7050
> #regzbot link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
> #regzbot link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
> #regzbot link: https://forum.artixlinux.org/index.php/topic,5997.0.html

#regzbot fix: fbe1bf1e5ff1e3b298420d7a8434983ef8d72bd1
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.