2019-03-14 15:30:45

by Thomas Müller

[permalink] [raw]
Subject: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

Hi,

starting with kernel 4.19 my Lenovo ThinkPad X1 Carbon 5th no longer properly suspends.

This is 100% reproducible and git bisect points to the following commit:
> [be45bf5395e0886a93fc816bbe41a008ec2e42e2] watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
> be45bf5395e0886a93fc816bbe41a008ec2e42e2 is the first bad commit
> commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Jul 13 12:42:08 2018 +0200
>
> watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
>
> When scheduling is delayed for longer than the softlockup interrupt
> period it is possible to double-queue the cpu_stop_work, causing list
> corruption.
>
> Cure this by adding a completion to track the cpu_stop_work's
> progress.
>
> Reported-by: kernel test robot <[email protected]>
> Tested-by: Rong Chen <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Ingo Molnar <[email protected]>
>
> :040000 040000 6aca2dbb84bc33fe442b18b3d0a135c27adff7b9 2710af12d32e4b98df07768716689b213bce45fc M kernel

The bugzilla reports have some additional details:
* https://bugzilla.redhat.com/show_bug.cgi?id=1671504
* https://bugzilla.kernel.org/show_bug.cgi?id=202679
* https://bugzilla.kernel.org/show_bug.cgi?id=202137

I'm happy to provide additional information or test a patch or two (as long as it doesn't
eat up my notebook ;))


Best regards
Thomas


2019-03-15 09:10:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

On Thu, Mar 14, 2019 at 04:17:28PM +0100, Thomas M?ller wrote:
> Hi,
>
> starting with kernel 4.19 my Lenovo ThinkPad X1 Carbon 5th no longer properly suspends.
>
> This is 100% reproducible and git bisect points to the following commit:
> > [be45bf5395e0886a93fc816bbe41a008ec2e42e2] watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
> > be45bf5395e0886a93fc816bbe41a008ec2e42e2 is the first bad commit
> > commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
> > Author: Peter Zijlstra <[email protected]>
> > Date: Fri Jul 13 12:42:08 2018 +0200
> >
> > watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
> >
> > When scheduling is delayed for longer than the softlockup interrupt
> > period it is possible to double-queue the cpu_stop_work, causing list
> > corruption.
> >
> > Cure this by adding a completion to track the cpu_stop_work's
> > progress.
> >
> > Reported-by: kernel test robot <[email protected]>
> > Tested-by: Rong Chen <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
> > Link: http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Ingo Molnar <[email protected]>
> >
> > :040000 040000 6aca2dbb84bc33fe442b18b3d0a135c27adff7b9 2710af12d32e4b98df07768716689b213bce45fc M kernel
>
> The bugzilla reports have some additional details:
> * https://bugzilla.redhat.com/show_bug.cgi?id=1671504
> * https://bugzilla.kernel.org/show_bug.cgi?id=202679
> * https://bugzilla.kernel.org/show_bug.cgi?id=202137
>
> I'm happy to provide additional information or test a patch or two (as long as it doesn't
> eat up my notebook ;))

I obviously cannot reproduce :/ Both cpu-hotplug and suspend works just
fine on my test boxes. I even tried my thinkpad (x240) and that too goes
to sleep and wakes up just fine.

What .config do you have? And what, if anything do you see on the
console when it goes funny?

I think you wrote that hot-un-plug never completes? Is there anything in
dmesg when it's stuck in:

echo 0 > /sys/devices/system/cpu/cpu1/online

?

2019-03-15 11:43:37

by Thomas Müller

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

Hi,

Am 15.03.19 um 10:09 schrieb Peter Zijlstra:
> On Thu, Mar 14, 2019 at 04:17:28PM +0100, Thomas Müller wrote:
>> Hi,
>>
>> starting with kernel 4.19 my Lenovo ThinkPad X1 Carbon 5th no longer properly suspends.
>>
>> This is 100% reproducible and git bisect points to the following commit:
>>> [be45bf5395e0886a93fc816bbe41a008ec2e42e2] watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
>>> be45bf5395e0886a93fc816bbe41a008ec2e42e2 is the first bad commit
>>> commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
>>> Author: Peter Zijlstra <[email protected]>
>>> Date: Fri Jul 13 12:42:08 2018 +0200
>>>
>>> watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
>>>
>>> When scheduling is delayed for longer than the softlockup interrupt
>>> period it is possible to double-queue the cpu_stop_work, causing list
>>> corruption.
>>>
>>> Cure this by adding a completion to track the cpu_stop_work's
>>> progress.
>>>
>>> Reported-by: kernel test robot <[email protected]>
>>> Tested-by: Rong Chen <[email protected]>
>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>> Cc: Linus Torvalds <[email protected]>
>>> Cc: Peter Zijlstra <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
>>> Link: http://lkml.kernel.org/r/[email protected]
>>> Signed-off-by: Ingo Molnar <[email protected]>
>>>
>>> :040000 040000 6aca2dbb84bc33fe442b18b3d0a135c27adff7b9 2710af12d32e4b98df07768716689b213bce45fc M kernel
>>
>> The bugzilla reports have some additional details:
>> * https://bugzilla.redhat.com/show_bug.cgi?id=1671504
>> * https://bugzilla.kernel.org/show_bug.cgi?id=202679
>> * https://bugzilla.kernel.org/show_bug.cgi?id=202137
>>
>> I'm happy to provide additional information or test a patch or two (as long as it doesn't
>> eat up my notebook ;))
>
> I obviously cannot reproduce :/ Both cpu-hotplug and suspend works just
> fine on my test boxes. I even tried my thinkpad (x240) and that too goes
> to sleep and wakes up just fine.
>
> What .config do you have?
The one packaged by Fedora. I've attached the one for 4.20.15 as reference.

> And what, if anything do you see on the
> console when it goes funny?
Nothing unfortunately.
When trying to suspend the display immediately goes blank, the system becomes unresponsive and the
status LED within the power button start flashing rapidly (just like it does when the power cord is
attached).


> I think you wrote that hot-un-plug never completes? Is there anything in
> dmesg when it's stuck in:
>
> echo 0 > /sys/devices/system/cpu/cpu1/online
>
> ?
I've just tried that again and the system immediately froze.
`journalctl -f` was running in a second window but it had no chance to output anything... :/


Best regards
Thomas


Attachments:
config-4.20.15-200.fc29.x86_64 (196.63 kB)

2019-03-15 12:16:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

On Fri, Mar 15, 2019 at 12:41:00PM +0100, Thomas M?ller wrote:

> > What .config do you have?
> The one packaged by Fedora. I've attached the one for 4.20.15 as reference.

Thanks, I'll have a poke, see what, if anything, is different from the
kernels I ran.

> > And what, if anything do you see on the
> > console when it goes funny?
> Nothing unfortunately.
> When trying to suspend the display immediately goes blank, the system becomes unresponsive and the
> status LED within the power button start flashing rapidly (just like it does when the power cord is
> attached).
>
>
> > I think you wrote that hot-un-plug never completes? Is there anything in
> > dmesg when it's stuck in:
> >
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> >
> > ?
> I've just tried that again and the system immediately froze.

Hmm, I tought you said the system remained semi usable, just that reboot
stopped working thereafter and it needed a power cycle.

> `journalctl -f` was running in a second window but it had no chance to output anything... :/

Ah, you're using a GUI!

Stop doing that ;-)

See if you can use the VGA console; not a FB console or a DRM console,
but the real ancient, proper text mode, VGA console.

Now, don't ask me how to do that, because I don't know, I've been
running on pure serial console output for the past 10 years or so, heck
I don't even have systemd.

And you might have to do something like: dmesg -n8, to get the console
to print the kernel messages or something.


2019-03-15 20:22:13

by Thomas Müller

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

Hi,

Am 15.03.19 um 13:15 schrieb Peter Zijlstra:
> On Fri, Mar 15, 2019 at 12:41:00PM +0100, Thomas Müller wrote:
>
>>> What .config do you have?
>> The one packaged by Fedora. I've attached the one for 4.20.15 as reference.
>
> Thanks, I'll have a poke, see what, if anything, is different from the
> kernels I ran.
>
>>> And what, if anything do you see on the
>>> console when it goes funny?
>> Nothing unfortunately.
>> When trying to suspend the display immediately goes blank, the system becomes unresponsive and the
>> status LED within the power button start flashing rapidly (just like it does when the power cord is
>> attached).
>>
>>
>>> I think you wrote that hot-un-plug never completes? Is there anything in
>>> dmesg when it's stuck in:
>>>
>>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>>
>>> ?
>> I've just tried that again and the system immediately froze.
>
> Hmm, I tought you said the system remained semi usable, just that reboot
> stopped working thereafter and it needed a power cycle.
>
>> `journalctl -f` was running in a second window but it had no chance to output anything... :/
>
> Ah, you're using a GUI!
>
> Stop doing that ;-)
Easier said than done ;)

> See if you can use the VGA console; not a FB console or a DRM console,
> but the real ancient, proper text mode, VGA console.
I've just re-tested with runlevel 3.
Not a real VGA console, but at least no Wayland or Gnome to interfere...

`echo 0 > /sys/...` just blocks and no message whatsoever is visible in dmesg.

I've executed `echo 0 > ...` in the background to keep my console functional and I can e.g. echo
something to /dev/kmsg and it shows up, so reading/updating the log buffer appears to be working
just fine.
A power cycle is still necessary to recover the system.


> Now, don't ask me how to do that, because I don't know, I've been
> running on pure serial console output for the past 10 years or so, heck
> I don't even have systemd.
>
> And you might have to do something like: dmesg -n8, to get the console
> to print the kernel messages or something.
>

2019-03-18 11:58:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

On Fri, Mar 15, 2019 at 09:21:02PM +0100, Thomas M?ller wrote:
> I've just re-tested with runlevel 3.
> Not a real VGA console, but at least no Wayland or Gnome to interfere...
>
> `echo 0 > /sys/...` just blocks and no message whatsoever is visible in dmesg.
>
> I've executed `echo 0 > ...` in the background to keep my console functional and I can e.g. echo
> something to /dev/kmsg and it shows up, so reading/updating the log buffer appears to be working
> just fine.

Damn.. Thanks for trying. I'll see if I can come up with something, but
I'm out of idea for now :/

2019-03-29 09:25:11

by Thomas Müller

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

Hi,

Am 18.03.19 um 12:57 schrieb Peter Zijlstra:
> On Fri, Mar 15, 2019 at 09:21:02PM +0100, Thomas Müller wrote:
>> I've just re-tested with runlevel 3.
>> Not a real VGA console, but at least no Wayland or Gnome to interfere...
>>
>> `echo 0 > /sys/...` just blocks and no message whatsoever is visible in dmesg.
>>
>> I've executed `echo 0 > ...` in the background to keep my console functional and I can e.g. echo
>> something to /dev/kmsg and it shows up, so reading/updating the log buffer appears to be working
>> just fine.
>
> Damn.. Thanks for trying. I'll see if I can come up with something, but
> I'm out of idea for now :/
>
Any new ideas so far?

For reference:
I've just tested a vanilla 5.0.5 with localmodconfig (attached)... same behavior :(


Best regards
Thomas


Attachments:
.config (153.93 kB)

2019-04-12 05:32:46

by Thomas Müller

[permalink] [raw]
Subject: Re: disabling secondary CPU hangs / system fails to suspend with kernel 4.19+

Hi,

good news: starting with 5.0.6 suspend is working again.

Best regards
Thomas

Am 29.03.19 um 10:22 schrieb Thomas Müller:
> Hi,
>
> Am 18.03.19 um 12:57 schrieb Peter Zijlstra:
>> On Fri, Mar 15, 2019 at 09:21:02PM +0100, Thomas Müller wrote:
>>> I've just re-tested with runlevel 3.
>>> Not a real VGA console, but at least no Wayland or Gnome to interfere...
>>>
>>> `echo 0 > /sys/...` just blocks and no message whatsoever is visible in dmesg.
>>>
>>> I've executed `echo 0 > ...` in the background to keep my console functional and I can e.g. echo
>>> something to /dev/kmsg and it shows up, so reading/updating the log buffer appears to be working
>>> just fine.
>>
>> Damn.. Thanks for trying. I'll see if I can come up with something, but
>> I'm out of idea for now :/
>>
> Any new ideas so far?
>
> For reference:
> I've just tested a vanilla 5.0.5 with localmodconfig (attached)... same behavior :(
>
>
> Best regards
> Thomas
>