2024-02-01 20:58:10

by Konrad Dybcio

[permalink] [raw]
Subject: Workqueue regression

Hi,

So, commit "Implement system-wide nr_active enforcement for unbound workqueues"
broke *something* and now performing a suspend-wakeup cycle on a Qualcomm
SC8280XP-based (arm64) platform hangs when performing the resume tasks,
presumably somewhere near PCIe reinitialization (but that may be a red herring).

Reverting the commit (and the ones on top of it due to conflicts) fixes
the issue on next-20240130 and later (plus some out-of-tree patches that
are largely unrelated).

Not sure where to start looking.

Konrad


2024-02-02 01:52:31

by Tejun Heo

[permalink] [raw]
Subject: Re: Workqueue regression

Hello,

On Thu, Feb 01, 2024 at 09:57:59PM +0100, Konrad Dybcio wrote:
> So, commit "Implement system-wide nr_active enforcement for unbound workqueues"
> broke *something* and now performing a suspend-wakeup cycle on a Qualcomm
> SC8280XP-based (arm64) platform hangs when performing the resume tasks,
> presumably somewhere near PCIe reinitialization (but that may be a red herring).
>
> Reverting the commit (and the ones on top of it due to conflicts) fixes
> the issue on next-20240130 and later (plus some out-of-tree patches that
> are largely unrelated).
>
> Not sure where to start looking.

Hmm... sorry about that. Can you please boot with `console_no_suspend` and
retry? Once the system gets stuck, you can wait for several minutes till the
workqueue watchdog triggers and dumps the state or, if you can, trigger
`sysrq-t` which has workqueue state dump at the end.

If the system doesn't become live enough after suspend/resume cycle to get
more info, the following might help:

$ echo test_resume > /sys/power/disk
$ echo disk > /sys/power/state

That should walk most of the hibernation/wakeup path which is pretty simliar
to suspend/resume path without touching system power state.

Thanks.

--
tejun

2024-02-02 12:32:31

by Konrad Dybcio

[permalink] [raw]
Subject: Re: Workqueue regression

On 2.02.2024 02:52, Tejun Heo wrote:
> Hello,
>
> On Thu, Feb 01, 2024 at 09:57:59PM +0100, Konrad Dybcio wrote:
>> So, commit "Implement system-wide nr_active enforcement for unbound workqueues"
>> broke *something* and now performing a suspend-wakeup cycle on a Qualcomm
>> SC8280XP-based (arm64) platform hangs when performing the resume tasks,
>> presumably somewhere near PCIe reinitialization (but that may be a red herring).
>>
>> Reverting the commit (and the ones on top of it due to conflicts) fixes
>> the issue on next-20240130 and later (plus some out-of-tree patches that
>> are largely unrelated).
>>
>> Not sure where to start looking.
>
> Hmm... sorry about that. Can you please boot with `console_no_suspend` and
> retry? Once the system gets stuck, you can wait for several minutes till the
> workqueue watchdog triggers and dumps the state or, if you can, trigger
> `sysrq-t` which has workqueue state dump at the end.
>
> If the system doesn't become live enough after suspend/resume cycle to get
> more info, the following might help:

Looks like it's too far gone indeed..

>
> $ echo test_resume > /sys/power/disk
> $ echo disk > /sys/power/state

Sadly, hibernation is not a thing on this platform.. Without going into much
detail of how messy the power management stuff is, you can either have
"on", "off" or "power collapsed" (bound to s2idle).. Trying to trigger this
sequence makes the thing lock up and die due to unclocked accesses with or
without the WQ regression.

Konrad

2024-02-02 18:50:59

by Tejun Heo

[permalink] [raw]
Subject: Re: Workqueue regression

Hello,

On Fri, Feb 02, 2024 at 01:31:01PM +0100, Konrad Dybcio wrote:
> > If the system doesn't become live enough after suspend/resume cycle to get
> > more info, the following might help:
>
> Looks like it's too far gone indeed..
>
> >
> > $ echo test_resume > /sys/power/disk
> > $ echo disk > /sys/power/state
>
> Sadly, hibernation is not a thing on this platform.. Without going into much
> detail of how messy the power management stuff is, you can either have
> "on", "off" or "power collapsed" (bound to s2idle).. Trying to trigger this
> sequence makes the thing lock up and die due to unclocked accesses with or
> without the WQ regression.

I see, so, if you enable CONFIG_PM_DEBUG, CONFIG_PM_ADVANCED_DEBUG and
CONFIG_PM_SLEEP_DEBUG, there will be /sys/power/pm_test file which allows to
select the stage at which suspend is going to abort. Can you please play
with it and see whether you can reproduce the issue while maintaining the
console output?

Can you also make sure that the system is actually dead, not just the
console? e.g. by pinging from network?

Thanks.

--
tejun

2024-02-04 21:19:55

by Tejun Heo

[permalink] [raw]
Subject: Re: Workqueue regression

Hello,

There was a bug which could easily stall flush_workqueue() which just got
fixed (http://lkml.kernel.org/r/[email protected]). Can you
please see whether the patch fixes the suspend problem?

Thanks.

--
tejun

2024-02-05 11:55:15

by Konrad Dybcio

[permalink] [raw]
Subject: Re: Workqueue regression

On 4.02.2024 22:19, Tejun Heo wrote:
> Hello,
>
> There was a bug which could easily stall flush_workqueue() which just got
> fixed (http://lkml.kernel.org/r/[email protected]). Can you
> please see whether the patch fixes the suspend problem?

Thanks for the pointer!

Unfortunately, it doesn't seem to fix my issue :/
I'll try to look into it more in the coming days, though my calendar is
somewhat wavy..

Konrad