LinuxLists.cc - [PATCH] sched/core: fix affine_move

2024-03-18 11:18:59

Subject: [PATCH] sched/core: fix affine_move_task failure case

Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.

Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
can fail when racing with off-/on-lining resulting in a deadlock waiting for
the pending migration stop work completion which is never done.

Fix this by correctly handling such a condition.

Fixes: 9e81889c7648 ("sched: Fix affine_move_task() self-concurrency")
Cc: [email protected]
Reported-by: Bill Peters <[email protected]>
Tested-by: Bill Peters <[email protected]>
Signed-off-by: Daniel Vacek <[email protected]>
---
kernel/sched/core.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc903467..d0ff5c611a1c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3069,8 +3069,17 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
preempt_disable();
task_rq_unlock(rq, p, rf);
if (!stop_pending) {
- stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
- &pending->arg, &pending->stop_work);
+ stop_pending =
+ stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
+ &pending->arg, &pending->stop_work);
+
+ if (!stop_pending) {
+ rq = task_rq_lock(p, rf);
+ pending->stop_pending = false;
+ p->migration_pending = NULL;
+ task_rq_unlock(rq, p, rf);
+ complete_all(&pending->done);
+ }
}
preempt_enable();

--
2.43.0

2024-03-18 17:36:41

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH] sched/core: fix affine_move_task failure case

On 18/03/24 12:17, Daniel Vacek wrote:
> Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
>
> Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
> can fail when racing with off-/on-lining resulting in a deadlock waiting for
> the pending migration stop work completion which is never done.
>
> Fix this by correctly handling such a condition.
>

IIUC the problem is that the dest_cpu and its stopper thread can be taken
down by take_cpu_down(), and affine_move_task() currently isn't aware of
that. I thought we had tested this vs hotplug, but oh well...

> Fixes: 9e81889c7648 ("sched: Fix affine_move_task() self-concurrency")
> Cc: [email protected]
> Reported-by: Bill Peters <[email protected]>
> Tested-by: Bill Peters <[email protected]>
> Signed-off-by: Daniel Vacek <[email protected]>
> ---
> kernel/sched/core.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9116bcc903467..d0ff5c611a1c8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3069,8 +3069,17 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
> preempt_disable();
> task_rq_unlock(rq, p, rf);
> if (!stop_pending) {
> - stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> - &pending->arg, &pending->stop_work);
> + stop_pending =
> + stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> + &pending->arg, &pending->stop_work);
> +
> + if (!stop_pending) {
> + rq = task_rq_lock(p, rf);
> + pending->stop_pending = false;
> + p->migration_pending = NULL;
> + task_rq_unlock(rq, p, rf);
> + complete_all(&pending->done);
> + }

This can leave the task @p on a now-illegal CPU; consider a task affined to
CPUs 0-1, migrate_disable(); then affined to CPUs 2-3, then in
migrate_enable() the dest_cpu is chosen as 3 but that's racing with it
being brought down. The stop_one_cpu_nowait() fails, and we leave the task
on CPUs 0-1.

Issuing a redo of affine_move_task() with a different dest_cpu doesn't
sound great, and while very unlikely that doesn't have forward progress
guarantees.

Unfortunately we can't hold the hotplug lock during the affinity change of
migrate_enable(), as migrate_enable() isn't allowed to block.

Now, the CPU selection in __set_cpus_allowed_ptr_locked() that is passed
down to affine_move_task() relies on the active mask, which itself is
cleared in sched_cpu_deactivate() and is followed by a
synchronize_rcu().

What if we made the affinity change of migrate_enable() an RCU read-side
section? Then if a CPU X is observed as active in
migrate_enable()->__set_cpus_allowed_ptr_locked()
, then its' hotplug state cannot go lower than CPUHP_AP_ACTIVE until the task is
migrated away.

Something like the below. Thoughts?
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 929fce69f555e..c6d128711d1a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2450,8 +2450,11 @@ void migrate_enable(void)
* __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
*/
guard(preempt)();
- if (p->cpus_ptr != &p->cpus_mask)
+ if (p->cpus_ptr != &p->cpus_mask) {
+ guard(rcu)();
__set_cpus_allowed_ptr(p, &ac);
+ }
+
/*
* Mustn't clear migration_disabled() until cpus_ptr points back at the
* regular cpus_mask, otherwise things that race (eg.

2024-03-20 14:06:33

by Daniel Vacek

[permalink] [raw]

Subject: Re: [PATCH] sched/core: fix affine_move_task failure case

Hi Valentin,

On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider <[email protected]> wrote:
> On 18/03/24 12:17, Daniel Vacek wrote:
> > Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
> >
> > Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
> > can fail when racing with off-/on-lining resulting in a deadlock waiting for
> > the pending migration stop work completion which is never done.
> >
> > Fix this by correctly handling such a condition.
> >
>
> IIUC the problem is that the dest_cpu and its stopper thread can be taken
> down by take_cpu_down(), and affine_move_task() currently isn't aware of
> that. I thought we had tested this vs hotplug, but oh well...

I'm sorry, I should have provided more context in the first place. The machine
is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces
show this scenario:

CPU 0 | CPU 1
|
cpuplugd task 1429 |
holds the `cpu_hotplug_lock` |
for writing in _cpu_up+0x16a |
blocked on `cpuhp_state:1.done_up` |
completion in __cpuhp_kick_ap+0x76 |
|
| cpuhp/1 task 17
|supposed to complete bringup of the CPU
| (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108
|blocked on `wq_pool_attach_mutex`
| in workqueue_online_cpu+0x9e
|
xfs-conv/dm-0 task 745 |
holds the `wq_pool_attach_mutex` |
in worker_attach_to_pool+0x66 \
blocked on `task->migration_pending->done`|
completion in affine_move_task+0x10a/

~~~
crash> b 1429
PID: 1429 TASK: 99398000 CPU: 0 COMMAND: "cpuplugd"
#0 [997df970] __schedule+0x34c at 3089c424
#1 [997df9e0] schedule+0x7e at 3089cafe
#2 [997dfa20] schedule_timeout+0x26e at 308a1d8e
[inlined] do_wait_for_common
[inlined] __wait_for_common
#3 [997dfad8] wait_for_common+0x14a at 3089d902
[ret call] wait_for_completion+0x1a at 3089d96a

[inlined] wait_for_ap_thread <<< blocked on `cpuhp_state:1.done_up` completion
[ret call] __cpuhp_kick_ap+0x76 at 300c610e
#4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc
[inlined] bringup_wait_for_ap
[ret call] bringup_cpu+0xea at 300c6402
#5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14
#6 [997dfc40] _cpu_up+0x16a at 300c798a <<< holds the `cpu_hotplug_lock` for writing
#7 [997dfc98] do_cpu_up+0xc6 at 300c7b66
#8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00
#9 [997dfd28] device_online+0x9e at 30598e7e
#10 [997dfd68] online_store+0x88 at 30598f28
#11 [997dfda8] kernfs_fop_write+0xdc at 3040276c
#12 [997dfdf8] vfs_write+0xa8 at 30354760
#13 [997dfe58] ksys_write+0x62 at 30354a32

crash> cpuhp_cpu_state.state cpuhp_state:a | paste - -
[0]: 1aef424e0 state = CPUHP_ONLINE, # (195)
[1]: 1aef654e0 state = CPUHP_AP_WORKQUEUE_ONLINE, # (159)

crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - -
[1]: 1aef654e0 bringup = true, thread = 0x81134400, done_up.done = 0, <<<

crash> b 17
PID: 17 TASK: 81134400 CPU: 1 COMMAND: "cpuhp/1"
#0 [81143b68] __schedule+0x34c at 3089c424
#1 [81143bd8] schedule+0x7e at 3089cafe
#2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba
#3 [81143c30] __mutex_lock+0x320 at 3089df60

#4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e <<< blocked on `wq_pool_attach_mutex`
#5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14
#6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848 <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`)

crash> b 745
PID: 745 TASK: 82359100 CPU: 0 COMMAND: "xfs-conv/dm-0"
#0 [8b4bfa20] __schedule+0x34c at 3089c424
#1 [8b4bfa90] schedule+0x7e at 3089cafe
#2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e
[inlined] do_wait_for_common
[inlined] __wait_for_common
#3 [8b4bfb88] wait_for_common+0x14a at 3089d902
[ret call] wait_for_completion+0x1a at 3089d96a

#4 [8b4bfc08] affine_move_task+0x10a at 300fb51a <<< blocked on `task->migration_pending->done` completion
#5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926
[ret call] set_cpus_allowed_ptr+0xa at 300fba32
#6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae <<< holds the `wq_pool_attach_mutex`
#7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac

crash> rx 8b4bfea0
8b4bfea0: [863373c0:kmalloc-192]

crash> worker.task,rescue_wq 863373c0
task = 0x82359100,
rescue_wq = 0x8aa44400,

crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - -
1fffff7f751900 pool = 0x1aef56a00,

crash> worker_pool.attrs 0x1aef56a00
attrs = 0x80088180,

crash> workqueue_attrs.cpumask[0].bits 0x80088180
cpumask[0].bits = {0x1, 0x0, ...

crash> cpumask.bits __cpu_active_mask
bits = {0x1, 0x0, ...

crash> cpumask.bits __cpu_online_mask
bits = {0x3, 0x0, ...

crash> task_struct.migration_pending,flags 0x82359100
migration_pending = 0x8b4bfce8,
flags = 0x4208060,
^ PF_KTHREAD

crash> pd distribute_cpu_mask_prev:0
per_cpu(distribute_cpu_mask_prev, 0) = 0

crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d
refs.refs.counter = 1
arg = {
task = 0x82359100,
dest_cpu = 0,
pending = 0x8b4bfce8
}
stop_pending = 1,
done.done = 0,
~~~

In other words the `set_cpus_allowed_ptr()` is called from a worker thread which
tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed
to be the destination as per the stack structure. In this case I thought it's OK
to leave the task on the old CPU and the Bill's testing scenario was successful
with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.

This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's
still reproducible. Also, I'm not sure why this only happens on s390 and not on
x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing
sensitive and the timing will differ greatly for sure.

> > Fixes: 9e81889c7648 ("sched: Fix affine_move_task() self-concurrency")
> > Cc: [email protected]
> > Reported-by: Bill Peters <[email protected]>
> > Tested-by: Bill Peters <[email protected]>
> > Signed-off-by: Daniel Vacek <[email protected]>
> > ---
> > kernel/sched/core.c | 13 +++++++++++--
> > 1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9116bcc903467..d0ff5c611a1c8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3069,8 +3069,17 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
> > preempt_disable();
> > task_rq_unlock(rq, p, rf);
> > if (!stop_pending) {
> > - stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > - &pending->arg, &pending->stop_work);
> > + stop_pending =
> > + stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > + &pending->arg, &pending->stop_work);
> > +
> > + if (!stop_pending) {
> > + rq = task_rq_lock(p, rf);
> > + pending->stop_pending = false;
> > + p->migration_pending = NULL;
> > + task_rq_unlock(rq, p, rf);
> > + complete_all(&pending->done);
> > + }
>
> This can leave the task @p on a now-illegal CPU; consider a task affined to
> CPUs 0-1, migrate_disable(); then affined to CPUs 2-3, then in
> migrate_enable() the dest_cpu is chosen as 3 but that's racing with it
> being brought down. The stop_one_cpu_nowait() fails, and we leave the task
> on CPUs 0-1.
>
> Issuing a redo of affine_move_task() with a different dest_cpu doesn't
> sound great, and while very unlikely that doesn't have forward progress
> guarantees.
>
> Unfortunately we can't hold the hotplug lock during the affinity change of
> migrate_enable(), as migrate_enable() isn't allowed to block.
>
> Now, the CPU selection in __set_cpus_allowed_ptr_locked() that is passed
> down to affine_move_task() relies on the active mask, which itself is
> cleared in sched_cpu_deactivate() and is followed by a
> synchronize_rcu().
>
> What if we made the affinity change of migrate_enable() an RCU read-side
> section? Then if a CPU X is observed as active in
> migrate_enable()->__set_cpus_allowed_ptr_locked()
> , then its' hotplug state cannot go lower than CPUHP_AP_ACTIVE until the task is
> migrated away.
>
> Something like the below. Thoughts?
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 929fce69f555e..c6d128711d1a9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2450,8 +2450,11 @@ void migrate_enable(void)
> * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
> */
> guard(preempt)();
> - if (p->cpus_ptr != &p->cpus_mask)
> + if (p->cpus_ptr != &p->cpus_mask) {
> + guard(rcu)();
> __set_cpus_allowed_ptr(p, &ac);
> + }
> +
> /*
> * Mustn't clear migration_disabled() until cpus_ptr points back at the
> * regular cpus_mask, otherwise things that race (eg.

With the above being said, I don't see any relation to `migrate_enable()` and
this change looks irrelevant in the context.

Any fresh ideas?

--nX

2024-03-20 18:18:16

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH] sched/core: fix affine_move_task failure case

On 20/03/24 15:03, Daniel Vacek wrote:
> Hi Valentin,
>
> On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider <[email protected]> wrote:
>> On 18/03/24 12:17, Daniel Vacek wrote:
>> > Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
>> >
>> > Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
>> > can fail when racing with off-/on-lining resulting in a deadlock waiting for
>> > the pending migration stop work completion which is never done.
>> >
>> > Fix this by correctly handling such a condition.
>> >
>>
>> IIUC the problem is that the dest_cpu and its stopper thread can be taken
>> down by take_cpu_down(), and affine_move_task() currently isn't aware of
>> that. I thought we had tested this vs hotplug, but oh well...
>
> I'm sorry, I should have provided more context in the first place. The machine
> is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces
> show this scenario:
>
> CPU 0 | CPU 1
> |
> cpuplugd task 1429 |
> holds the `cpu_hotplug_lock` |
> for writing in _cpu_up+0x16a |
> blocked on `cpuhp_state:1.done_up` |
> completion in __cpuhp_kick_ap+0x76 |
> |
> | cpuhp/1 task 17
> |supposed to complete bringup of the CPU
> | (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108
> |blocked on `wq_pool_attach_mutex`
> | in workqueue_online_cpu+0x9e
> |
> xfs-conv/dm-0 task 745 |
> holds the `wq_pool_attach_mutex` |
> in worker_attach_to_pool+0x66 \
> blocked on `task->migration_pending->done`|
> completion in affine_move_task+0x10a/
>
> ~~~
> crash> b 1429
> PID: 1429 TASK: 99398000 CPU: 0 COMMAND: "cpuplugd"
> #0 [997df970] __schedule+0x34c at 3089c424
> #1 [997df9e0] schedule+0x7e at 3089cafe
> #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e
> [inlined] do_wait_for_common
> [inlined] __wait_for_common
> #3 [997dfad8] wait_for_common+0x14a at 3089d902
> [ret call] wait_for_completion+0x1a at 3089d96a
>
> [inlined] wait_for_ap_thread <<< blocked on `cpuhp_state:1.done_up` completion
> [ret call] __cpuhp_kick_ap+0x76 at 300c610e
> #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc
> [inlined] bringup_wait_for_ap
> [ret call] bringup_cpu+0xea at 300c6402
> #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14
> #6 [997dfc40] _cpu_up+0x16a at 300c798a <<< holds the `cpu_hotplug_lock` for writing
> #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66
> #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00
> #9 [997dfd28] device_online+0x9e at 30598e7e
> #10 [997dfd68] online_store+0x88 at 30598f28
> #11 [997dfda8] kernfs_fop_write+0xdc at 3040276c
> #12 [997dfdf8] vfs_write+0xa8 at 30354760
> #13 [997dfe58] ksys_write+0x62 at 30354a32
>
> crash> cpuhp_cpu_state.state cpuhp_state:a | paste - -
> [0]: 1aef424e0 state = CPUHP_ONLINE, # (195)
> [1]: 1aef654e0 state = CPUHP_AP_WORKQUEUE_ONLINE, # (159)
>
> crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - -
> [1]: 1aef654e0 bringup = true, thread = 0x81134400, done_up.done = 0, <<<
>
> crash> b 17
> PID: 17 TASK: 81134400 CPU: 1 COMMAND: "cpuhp/1"
> #0 [81143b68] __schedule+0x34c at 3089c424
> #1 [81143bd8] schedule+0x7e at 3089cafe
> #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba
> #3 [81143c30] __mutex_lock+0x320 at 3089df60
>
> #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e <<< blocked on `wq_pool_attach_mutex`
> #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14
> #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848 <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`)
>
> crash> b 745
> PID: 745 TASK: 82359100 CPU: 0 COMMAND: "xfs-conv/dm-0"
> #0 [8b4bfa20] __schedule+0x34c at 3089c424
> #1 [8b4bfa90] schedule+0x7e at 3089cafe
> #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e
> [inlined] do_wait_for_common
> [inlined] __wait_for_common
> #3 [8b4bfb88] wait_for_common+0x14a at 3089d902
> [ret call] wait_for_completion+0x1a at 3089d96a
>
> #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a <<< blocked on `task->migration_pending->done` completion
> #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926
> [ret call] set_cpus_allowed_ptr+0xa at 300fba32
> #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae <<< holds the `wq_pool_attach_mutex`
> #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac
>
> crash> rx 8b4bfea0
> 8b4bfea0: [863373c0:kmalloc-192]
>
> crash> worker.task,rescue_wq 863373c0
> task = 0x82359100,
> rescue_wq = 0x8aa44400,
>
> crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - -
> 1fffff7f751900 pool = 0x1aef56a00,
>
> crash> worker_pool.attrs 0x1aef56a00
> attrs = 0x80088180,
>
> crash> workqueue_attrs.cpumask[0].bits 0x80088180
> cpumask[0].bits = {0x1, 0x0, ...
>
> crash> cpumask.bits __cpu_active_mask
> bits = {0x1, 0x0, ...
>
> crash> cpumask.bits __cpu_online_mask
> bits = {0x3, 0x0, ...
>
> crash> task_struct.migration_pending,flags 0x82359100
> migration_pending = 0x8b4bfce8,
> flags = 0x4208060,
> ^ PF_KTHREAD
>
> crash> pd distribute_cpu_mask_prev:0
> per_cpu(distribute_cpu_mask_prev, 0) = 0
>
> crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d
> refs.refs.counter = 1
> arg = {
> task = 0x82359100,
> dest_cpu = 0,
> pending = 0x8b4bfce8
> }
> stop_pending = 1,
> done.done = 0,
> ~~~
>
> In other words the `set_cpus_allowed_ptr()` is called from a worker thread which
> tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed
> to be the destination as per the stack structure. In this case I thought it's OK
> to leave the task on the old CPU

AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task()
and down to the stopper call, that means the task isn't allowed on its
current CPU and needs to be moved.

> and the Bill's testing scenario was successful
> with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.
>
> This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's
> still reproducible. Also, I'm not sure why this only happens on s390 and not on
> x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing
> sensitive and the timing will differ greatly for sure.
>

Thanks for the extra context!

Double checking what I wrote before, I forgot RCU considers preempt-off
sections as read-side critical sections. __set_cpus_allowed_ptr() already
has preemption disabled all the way from reading the cpu_active_mask to the
stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable().

IOW we have:

__set_cpus_allowed_ptr()
task_rq_lock() <-- PREEMPT OFF
__set_cpus_allowed_ptr_locked()
cpu_valid_mask = cpu_active_mask;
dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
affine_move_task()
preempt_disable();
task_rq_unlock();
stop_one_cpu_nowait(); <-- preemption still OFF

And, considering:

sched_cpu_deactivate()
set_cpu_active(cpu, false);
synchronize_rcu();

Then, if __set_cpus_allowed_ptr() observes a CPU as being in the
cpu_active_mask and uses that one as a destination CPU, said CPU cannot
reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread
because its hotplug machinery will wait on the synchronize_rcu() in
CPUHP_AP_ACTIVE:sched_cpu_deactivate().

So "in theory", this shouldn't happen upstream.

2024-03-20 18:33:18

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH] sched/core: fix affine_move_task failure case

On 20/03/24 19:17, Valentin Schneider wrote:
> On 20/03/24 15:03, Daniel Vacek wrote:
>> Hi Valentin,
>>
>> On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider <[email protected]> wrote:
>>> On 18/03/24 12:17, Daniel Vacek wrote:
>>> > Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
>>> >
>>> > Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
>>> > can fail when racing with off-/on-lining resulting in a deadlock waiting for
>>> > the pending migration stop work completion which is never done.
>>> >
>>> > Fix this by correctly handling such a condition.
>>> >
>>>
>>> IIUC the problem is that the dest_cpu and its stopper thread can be taken
>>> down by take_cpu_down(), and affine_move_task() currently isn't aware of
>>> that. I thought we had tested this vs hotplug, but oh well...
>>
>> I'm sorry, I should have provided more context in the first place. The machine
>> is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces
>> show this scenario:
>>
>> CPU 0 | CPU 1
>> |
>> cpuplugd task 1429 |
>> holds the `cpu_hotplug_lock` |
>> for writing in _cpu_up+0x16a |
>> blocked on `cpuhp_state:1.done_up` |
>> completion in __cpuhp_kick_ap+0x76 |
>> |
>> | cpuhp/1 task 17
>> |supposed to complete bringup of the CPU
>> | (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108
>> |blocked on `wq_pool_attach_mutex`
>> | in workqueue_online_cpu+0x9e
>> |
>> xfs-conv/dm-0 task 745 |
>> holds the `wq_pool_attach_mutex` |
>> in worker_attach_to_pool+0x66 \
>> blocked on `task->migration_pending->done`|
>> completion in affine_move_task+0x10a/
>>
>> ~~~
>> crash> b 1429
>> PID: 1429 TASK: 99398000 CPU: 0 COMMAND: "cpuplugd"
>> #0 [997df970] __schedule+0x34c at 3089c424
>> #1 [997df9e0] schedule+0x7e at 3089cafe
>> #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e
>> [inlined] do_wait_for_common
>> [inlined] __wait_for_common
>> #3 [997dfad8] wait_for_common+0x14a at 3089d902
>> [ret call] wait_for_completion+0x1a at 3089d96a
>>
>> [inlined] wait_for_ap_thread <<< blocked on `cpuhp_state:1.done_up` completion
>> [ret call] __cpuhp_kick_ap+0x76 at 300c610e
>> #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc
>> [inlined] bringup_wait_for_ap
>> [ret call] bringup_cpu+0xea at 300c6402
>> #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14
>> #6 [997dfc40] _cpu_up+0x16a at 300c798a <<< holds the `cpu_hotplug_lock` for writing
>> #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66
>> #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00
>> #9 [997dfd28] device_online+0x9e at 30598e7e
>> #10 [997dfd68] online_store+0x88 at 30598f28
>> #11 [997dfda8] kernfs_fop_write+0xdc at 3040276c
>> #12 [997dfdf8] vfs_write+0xa8 at 30354760
>> #13 [997dfe58] ksys_write+0x62 at 30354a32
>>
>> crash> cpuhp_cpu_state.state cpuhp_state:a | paste - -
>> [0]: 1aef424e0 state = CPUHP_ONLINE, # (195)
>> [1]: 1aef654e0 state = CPUHP_AP_WORKQUEUE_ONLINE, # (159)
>>
>> crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - -
>> [1]: 1aef654e0 bringup = true, thread = 0x81134400, done_up.done = 0, <<<
>>
>> crash> b 17
>> PID: 17 TASK: 81134400 CPU: 1 COMMAND: "cpuhp/1"
>> #0 [81143b68] __schedule+0x34c at 3089c424
>> #1 [81143bd8] schedule+0x7e at 3089cafe
>> #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba
>> #3 [81143c30] __mutex_lock+0x320 at 3089df60
>>
>> #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e <<< blocked on `wq_pool_attach_mutex`
>> #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14
>> #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848 <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`)
>>
>> crash> b 745
>> PID: 745 TASK: 82359100 CPU: 0 COMMAND: "xfs-conv/dm-0"
>> #0 [8b4bfa20] __schedule+0x34c at 3089c424
>> #1 [8b4bfa90] schedule+0x7e at 3089cafe
>> #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e
>> [inlined] do_wait_for_common
>> [inlined] __wait_for_common
>> #3 [8b4bfb88] wait_for_common+0x14a at 3089d902
>> [ret call] wait_for_completion+0x1a at 3089d96a
>>
>> #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a <<< blocked on `task->migration_pending->done` completion
>> #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926
>> [ret call] set_cpus_allowed_ptr+0xa at 300fba32
>> #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae <<< holds the `wq_pool_attach_mutex`
>> #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac
>>
>> crash> rx 8b4bfea0
>> 8b4bfea0: [863373c0:kmalloc-192]
>>
>> crash> worker.task,rescue_wq 863373c0
>> task = 0x82359100,
>> rescue_wq = 0x8aa44400,
>>
>> crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - -
>> 1fffff7f751900 pool = 0x1aef56a00,
>>
>> crash> worker_pool.attrs 0x1aef56a00
>> attrs = 0x80088180,
>>
>> crash> workqueue_attrs.cpumask[0].bits 0x80088180
>> cpumask[0].bits = {0x1, 0x0, ...
>>
>> crash> cpumask.bits __cpu_active_mask
>> bits = {0x1, 0x0, ...
>>
>> crash> cpumask.bits __cpu_online_mask
>> bits = {0x3, 0x0, ...
>>
>> crash> task_struct.migration_pending,flags 0x82359100
>> migration_pending = 0x8b4bfce8,
>> flags = 0x4208060,
>> ^ PF_KTHREAD
>>
>> crash> pd distribute_cpu_mask_prev:0
>> per_cpu(distribute_cpu_mask_prev, 0) = 0
>>
>> crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d
>> refs.refs.counter = 1
>> arg = {
>> task = 0x82359100,
>> dest_cpu = 0,
>> pending = 0x8b4bfce8
>> }
>> stop_pending = 1,
>> done.done = 0,
>> ~~~
>>
>> In other words the `set_cpus_allowed_ptr()` is called from a worker thread which
>> tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed
>> to be the destination as per the stack structure. In this case I thought it's OK
>> to leave the task on the old CPU
>
> AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task()
> and down to the stopper call, that means the task isn't allowed on its
> current CPU and needs to be moved.
>
>> and the Bill's testing scenario was successful
>> with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.
>>
>> This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's
>> still reproducible. Also, I'm not sure why this only happens on s390 and not on
>> x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing
>> sensitive and the timing will differ greatly for sure.
>>
>
> Thanks for the extra context!
>
> Double checking what I wrote before, I forgot RCU considers preempt-off
> sections as read-side critical sections. __set_cpus_allowed_ptr() already
> has preemption disabled all the way from reading the cpu_active_mask to the
> stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable().
>
> IOW we have:
>
> __set_cpus_allowed_ptr()
> task_rq_lock() <-- PREEMPT OFF
> __set_cpus_allowed_ptr_locked()
> cpu_valid_mask = cpu_active_mask;
> dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
> affine_move_task()
> preempt_disable();
> task_rq_unlock();
> stop_one_cpu_nowait(); <-- preemption still OFF
>
> And, considering:
>
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> synchronize_rcu();
>
> Then, if __set_cpus_allowed_ptr() observes a CPU as being in the
> cpu_active_mask and uses that one as a destination CPU, said CPU cannot
> reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread
> because its hotplug machinery will wait on the synchronize_rcu() in
> CPUHP_AP_ACTIVE:sched_cpu_deactivate().
>
> So "in theory", this shouldn't happen upstream.

Eh nevermind, in your stacktrace the relevant task is a rescuer thread
which is a kthread, so the cpu_valid_mask in use there is
cpu_online_mask, not cpu_valid_mask... Back to reading code :-)

2024-03-20 20:19:11

by Daniel Vacek

[permalink] [raw]

Subject: Re: [PATCH] sched/core: fix affine_move_task failure case

On Wed, Mar 20, 2024 at 7:33 PM Valentin Schneider <vschneid@redhatcom> wrote:
> On 20/03/24 19:17, Valentin Schneider wrote:
> > On 20/03/24 15:03, Daniel Vacek wrote:
> >> Hi Valentin,
> >>
> >> On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider <[email protected]> wrote:
> >>> On 18/03/24 12:17, Daniel Vacek wrote:
> >>> > Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
> >>> >
> >>> > Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
> >>> > can fail when racing with off-/on-lining resulting in a deadlock waiting for
> >>> > the pending migration stop work completion which is never done.
> >>> >
> >>> > Fix this by correctly handling such a condition.
> >>> >
> >>>
> >>> IIUC the problem is that the dest_cpu and its stopper thread can be taken
> >>> down by take_cpu_down(), and affine_move_task() currently isn't aware of
> >>> that. I thought we had tested this vs hotplug, but oh well...
> >>
> >> I'm sorry, I should have provided more context in the first place. The machine
> >> is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces
> >> show this scenario:
> >>
> >> CPU 0 | CPU 1
> >> |
> >> cpuplugd task 1429 |
> >> holds the `cpu_hotplug_lock` |
> >> for writing in _cpu_up+0x16a |
> >> blocked on `cpuhp_state:1.done_up` |
> >> completion in __cpuhp_kick_ap+0x76 |
> >> |
> >> | cpuhp/1 task 17
> >> |supposed to complete bringup of the CPU
> >> | (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108
> >> |blocked on `wq_pool_attach_mutex`
> >> | in workqueue_online_cpu+0x9e
> >> |
> >> xfs-conv/dm-0 task 745 |
> >> holds the `wq_pool_attach_mutex` |
> >> in worker_attach_to_pool+0x66 \
> >> blocked on `task->migration_pending->done`|
> >> completion in affine_move_task+0x10a/
> >>
> >> ~~~
> >> crash> b 1429
> >> PID: 1429 TASK: 99398000 CPU: 0 COMMAND: "cpuplugd"
> >> #0 [997df970] __schedule+0x34c at 3089c424
> >> #1 [997df9e0] schedule+0x7e at 3089cafe
> >> #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e
> >> [inlined] do_wait_for_common
> >> [inlined] __wait_for_common
> >> #3 [997dfad8] wait_for_common+0x14a at 3089d902
> >> [ret call] wait_for_completion+0x1a at 3089d96a
> >>
> >> [inlined] wait_for_ap_thread <<< blocked on `cpuhp_state:1.done_up` completion
> >> [ret call] __cpuhp_kick_ap+0x76 at 300c610e
> >> #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc
> >> [inlined] bringup_wait_for_ap
> >> [ret call] bringup_cpu+0xea at 300c6402
> >> #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14
> >> #6 [997dfc40] _cpu_up+0x16a at 300c798a <<< holds the `cpu_hotplug_lock` for writing
> >> #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66
> >> #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00
> >> #9 [997dfd28] device_online+0x9e at 30598e7e
> >> #10 [997dfd68] online_store+0x88 at 30598f28
> >> #11 [997dfda8] kernfs_fop_write+0xdc at 3040276c
> >> #12 [997dfdf8] vfs_write+0xa8 at 30354760
> >> #13 [997dfe58] ksys_write+0x62 at 30354a32
> >>
> >> crash> cpuhp_cpu_state.state cpuhp_state:a | paste - -
> >> [0]: 1aef424e0 state = CPUHP_ONLINE, # (195)
> >> [1]: 1aef654e0 state = CPUHP_AP_WORKQUEUE_ONLINE, # (159)
> >>
> >> crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - -
> >> [1]: 1aef654e0 bringup = true, thread = 0x81134400, done_up.done = 0, <<<
> >>
> >> crash> b 17
> >> PID: 17 TASK: 81134400 CPU: 1 COMMAND: "cpuhp/1"
> >> #0 [81143b68] __schedule+0x34c at 3089c424
> >> #1 [81143bd8] schedule+0x7e at 3089cafe
> >> #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba
> >> #3 [81143c30] __mutex_lock+0x320 at 3089df60
> >>
> >> #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e <<< blocked on `wq_pool_attach_mutex`
> >> #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14
> >> #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848 <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`)
> >>
> >> crash> b 745
> >> PID: 745 TASK: 82359100 CPU: 0 COMMAND: "xfs-conv/dm-0"
> >> #0 [8b4bfa20] __schedule+0x34c at 3089c424
> >> #1 [8b4bfa90] schedule+0x7e at 3089cafe
> >> #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e
> >> [inlined] do_wait_for_common
> >> [inlined] __wait_for_common
> >> #3 [8b4bfb88] wait_for_common+0x14a at 3089d902
> >> [ret call] wait_for_completion+0x1a at 3089d96a
> >>
> >> #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a <<< blocked on `task->migration_pending->done` completion
> >> #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926
> >> [ret call] set_cpus_allowed_ptr+0xa at 300fba32
> >> #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae <<< holds the `wq_pool_attach_mutex`
> >> #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac
> >>
> >> crash> rx 8b4bfea0
> >> 8b4bfea0: [863373c0:kmalloc-192]
> >>
> >> crash> worker.task,rescue_wq 863373c0
> >> task = 0x82359100,
> >> rescue_wq = 0x8aa44400,
> >>
> >> crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - -
> >> 1fffff7f751900 pool = 0x1aef56a00,
> >>
> >> crash> worker_pool.attrs 0x1aef56a00
> >> attrs = 0x80088180,
> >>
> >> crash> workqueue_attrs.cpumask[0].bits 0x80088180
> >> cpumask[0].bits = {0x1, 0x0, ...
> >>
> >> crash> cpumask.bits __cpu_active_mask
> >> bits = {0x1, 0x0, ...
> >>
> >> crash> cpumask.bits __cpu_online_mask
> >> bits = {0x3, 0x0, ...
> >>
> >> crash> task_struct.migration_pending,flags 0x82359100
> >> migration_pending = 0x8b4bfce8,
> >> flags = 0x4208060,
> >> ^ PF_KTHREAD
> >>
> >> crash> pd distribute_cpu_mask_prev:0
> >> per_cpu(distribute_cpu_mask_prev, 0) = 0
> >>
> >> crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d
> >> refs.refs.counter = 1
> >> arg = {
> >> task = 0x82359100,
> >> dest_cpu = 0,
> >> pending = 0x8b4bfce8
> >> }
> >> stop_pending = 1,
> >> done.done = 0,
> >> ~~~
> >>
> >> In other words the `set_cpus_allowed_ptr()` is called from a worker thread which
> >> tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed
> >> to be the destination as per the stack structure. In this case I thought it's OK
> >> to leave the task on the old CPU
> >
> > AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task()
> > and down to the stopper call, that means the task isn't allowed on its
> > current CPU and needs to be moved.
> >
> >> and the Bill's testing scenario was successful
> >> with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.
> >>
> >> This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's
> >> still reproducible. Also, I'm not sure why this only happens on s390 and not on
> >> x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing
> >> sensitive and the timing will differ greatly for sure.
> >>
> >
> > Thanks for the extra context!
> >
> > Double checking what I wrote before, I forgot RCU considers preempt-off
> > sections as read-side critical sections. __set_cpus_allowed_ptr() already
> > has preemption disabled all the way from reading the cpu_active_mask to the
> > stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable().
> >
> > IOW we have:
> >
> > __set_cpus_allowed_ptr()
> > task_rq_lock() <-- PREEMPT OFF
> > __set_cpus_allowed_ptr_locked()
> > cpu_valid_mask = cpu_active_mask;
> > dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
> > affine_move_task()
> > preempt_disable();
> > task_rq_unlock();
> > stop_one_cpu_nowait(); <-- preemption still OFF
> >
> > And, considering:
> >
> > sched_cpu_deactivate()
> > set_cpu_active(cpu, false);
> > synchronize_rcu();
> >
> > Then, if __set_cpus_allowed_ptr() observes a CPU as being in the
> > cpu_active_mask and uses that one as a destination CPU, said CPU cannot
> > reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread
> > because its hotplug machinery will wait on the synchronize_rcu() in
> > CPUHP_AP_ACTIVE:sched_cpu_deactivate().
> >
> > So "in theory", this shouldn't happen upstream.
>
> Eh nevermind, in your stacktrace the relevant task is a rescuer thread
> which is a kthread, so the cpu_valid_mask in use there is
> cpu_online_mask, not cpu_valid_mask... Back to reading code :-)

I just wanted to correct you on this point but you figured it out for
yourself after all.

On the other hand the `new_mask` (which comes from the worker pool allowed CPUs)

~~~
2102 static void worker_attach_to_pool(struct worker *worker,
2103 struct worker_pool *pool)
2104 {
2105 mutex_lock(&wq_pool_attach_mutex);
--
2117 if (worker->rescue_wq)
2118 set_cpus_allowed_ptr(worker->task,
pool_allowed_cpus(pool));

2827 static int rescuer_thread(void *__rescuer)
2828 {
2829 struct worker *rescuer = __rescuer;
2830 struct workqueue_struct *wq = rescuer->rescue_wq;
--
2856 while (!list_empty(&wq->maydays)) {
2857 struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
2858 struct pool_workqueue,
mayday_node);
2859 struct worker_pool *pool = pwq->pool;
--
2863 list_del_init(&pwq->mayday_node);
2864
2865 raw_spin_unlock_irq(&wq_mayday_lock);
2866
2867 worker_attach_to_pool(rescuer, pool);
2868
2869 raw_spin_lock_irq(&pool->lock);
~~~

matches the `cpu_valid_mask` and is ANDed to the `cpu_online_mask` so
the final result is the same...

Actually double checking now, I may have pulled the wrong pool as the
right `pwq` is removed from the `maydays` list on line 2863 (the code
is similar in RHEL 8.8) and I have used whatever `pwq` was remaining
on that list, but possibly unrelated.

~~~
crash> rx 8b4bfe30
8b4bfe30: 00000001aef56a00

crash> worker_pool.attrs 1aef56a00
attrs = 0x80088180,

crash> workqueue_attrs.cpumask[0].bits 0x80088180
cpumask[0].bits = {0x1, 0x0, ...

crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done
0x8b4bfce8
refs.refs.counter = 0x1
arg = {
task = 0x82359100,
dest_cpu = 0x0,
pending = 0x8b4bfce8
}
stop_pending = 0x1,
done.done = 0x0,
~~~

Nah, so not unrelated. It's the right one. The rescuer was deadlocked
for almost 3 days at the time the vmcore was collected, so it seems
the `pwq` returned to the `maydays` list in the meantime. And I guess
the pool cpumask could have changed as well. Nevertheless, the local
`affine_move_task::my_pending.arg.dest_cpu` on stack still shows CPU 0
was selected that time.

So far I don't see why upstream would not be prone to the same issue.
But I may be missing something.

--nX