Hi folks,
This should fix the issue reported by Qian at [1]. Dietmar mentioned some
other issue with hotplug & deadline tasks, but that's being investigated by
someone else ATM.
I would like to mention I suspect this bug comes straight from $hell, as
I've never ever had to fight off so many (mostly unrelated) issues while
looking into it. Distro kernel being mangled, git tree corruption, periods of
heisenbugism...
Cheers,
Valentin
[1]: https://lore.kernel.org/r/[email protected]
Valentin Schneider (2):
stop_machine: Add caller debug info to queue_stop_cpus_work
workqueue: Fix affinity of kworkers attached during late hotplug
kernel/stop_machine.c | 1 +
kernel/workqueue.c | 24 +++++++++++++++++-------
2 files changed, 18 insertions(+), 7 deletions(-)
--
2.27.0
Per-CPU kworkers forcefully migrated away by hotplug via
workqueue_offline_cpu() can end up spawning more kworkers via
manage_workers() -> maybe_create_worker()
Workers created at this point will be bound using
pool->attrs->cpumask
which in this case is wrong, as the hotplug state machine already migrated
all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
dying rq).
Special-case workers being attached to DISASSOCIATED pools and bind them to
cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
was invoked.
Link: https://lore.kernel.org/r/[email protected]
Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Reported-by: Qian Cai <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/workqueue.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9880b6c0e272..fb1418edf85c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
{
mutex_lock(&wq_pool_attach_mutex);
- /*
- * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
- * online CPUs. It'll be re-applied when any of the CPUs come up.
- */
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
/*
* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
* stable across this function. See the comments above the flag
* definition for details.
+ *
+ * Worker might get attached to a pool *after* workqueue_offline_cpu()
+ * was run - e.g. created by manage_workers() from a kworker which was
+ * forcefully moved away by hotplug. Kworkers created from this point on
+ * need to have their affinity changed as if they were present during
+ * workqueue_offline_cpu().
+ *
+ * This will be resolved in rebind_workers().
*/
- if (pool->flags & POOL_DISASSOCIATED)
+ if (pool->flags & POOL_DISASSOCIATED) {
worker->flags |= WORKER_UNBOUND;
+ set_cpus_allowed_ptr(worker->task, cpu_active_mask);
+ } else {
+ /*
+ * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
+ * online CPUs. It'll be re-applied when any of the CPUs come up.
+ */
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ }
list_add_tail(&worker->node, &pool->workers);
worker->pool = pool;
--
2.27.0
Most callsites were covered by commit
a8b62fd08505 ("stop_machine: Add function and caller debug info")
but this skipped queue_stop_cpus_work(). Add caller debug info to it.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/stop_machine.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 971d8acceaec..cbc30271ea4d 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -409,6 +409,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
work->fn = fn;
work->arg = arg;
work->done = done;
+ work->caller = _RET_IP_;
if (cpu_stop_queue_work(cpu, work))
queued = true;
}
--
2.27.0
On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > + if (pool->flags & POOL_DISASSOCIATED) {
> > worker->flags |= WORKER_UNBOUND;
> > + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> > + } else {
> > + /*
> > + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> > + * online CPUs. It'll be re-applied when any of the CPUs come up.
> > + */
>
> Does this comment still stand ? IIUC, we should always be in the
> POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
> pool->attrs->cpumask can have several CPUs. In that case maybe we should check
> for the cpu_active_mask here too ?
IIUC it can be a numa mask, and would still be valid in that case.
> > + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> > + }
On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > Per-CPU kworkers forcefully migrated away by hotplug via
> > workqueue_offline_cpu() can end up spawning more kworkers via
> >
> > manage_workers() -> maybe_create_worker()
> >
> > Workers created at this point will be bound using
> >
> > pool->attrs->cpumask
> >
> > which in this case is wrong, as the hotplug state machine already migrated
> > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> > dying rq).
> >
> > Special-case workers being attached to DISASSOCIATED pools and bind them to
> > cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> > was invoked.
> >
> > Link: https://lore.kernel.org/r/[email protected]
> > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
> AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
> ...
> TEARDOWN_CPU -> clear CPU in cpu_online_mask
> |
> |-AP_SCHED_STARTING -> migrate_tasks()
> |
> AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.
Yes, very much so, however the commit Valentin picked was supposed to
preemptively fix this. So we can consider this a fix for the fix.
But I don't mind an alternative or perhaps even second Fixes tag on
this.
On Thu, Dec 10, 2020 at 04:38:28PM +0000, Valentin Schneider wrote:
> Valentin Schneider (2):
> stop_machine: Add caller debug info to queue_stop_cpus_work
> workqueue: Fix affinity of kworkers attached during late hotplug
>
> kernel/stop_machine.c | 1 +
> kernel/workqueue.c | 24 +++++++++++++++++-------
> 2 files changed, 18 insertions(+), 7 deletions(-)
TJ, would you be okay if I take these through the sched tree that
contain the migrate_disable() patches, such that problem and fix stay
together?
On Fri, Dec 11, 2020 at 01:13:35PM +0000, Valentin Schneider wrote:
> On 11/12/20 12:51, Valentin Schneider wrote:
> >> In that case maybe we should check for the cpu_active_mask here too ?
> >
> > Looking at it again, I think we might need to.
> >
> > IIUC you can end up with pools bound to a single NUMA node (?). In that
> > case, say the last CPU of a node is going down, then:
> >
> > workqueue_offline_cpu()
> > wq_update_unbound_numa()
> > alloc_unbound_pwq()
> > get_unbound_pool()
> >
> > would still pick that node, because it doesn't look at the online / active
> > mask. And at this point, we would affine the
> > kworkers to that node, and we're back to having kworkers enqueued on a
> > (!active, online) CPU that is going down...
>
> Assuming a node covers at least 2 CPUs, that can't actually happen per
> is_cpu_allowed().
Yes indeed, my bad, no problem here.
Hi Valentin,
On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> Per-CPU kworkers forcefully migrated away by hotplug via
> workqueue_offline_cpu() can end up spawning more kworkers via
>
> manage_workers() -> maybe_create_worker()
>
> Workers created at this point will be bound using
>
> pool->attrs->cpumask
>
> which in this case is wrong, as the hotplug state machine already migrated
> all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> dying rq).
>
> Special-case workers being attached to DISASSOCIATED pools and bind them to
> cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> was invoked.
>
> Link: https://lore.kernel.org/r/[email protected]
> Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
task migration on CPU unplug") ?
Previously we had:
AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
...
TEARDOWN_CPU -> clear CPU in cpu_online_mask
|
|-AP_SCHED_STARTING -> migrate_tasks()
|
AP_OFFLINE
worker_attach_to_pool(), is "protected" by the cpu_online_mask in
set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
cpu_online_mask is actually flipped, there's a window, between
CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
a new one, for the hotunplugged pool that wouldn't be caught by the
hotunplug migration.
> Reported-by: Qian Cai <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/workqueue.c | 24 +++++++++++++++++-------
> 1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 9880b6c0e272..fb1418edf85c 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
> {
> mutex_lock(&wq_pool_attach_mutex);
>
> - /*
> - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> - * online CPUs. It'll be re-applied when any of the CPUs come up.
> - */
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> -
> /*
> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> * stable across this function. See the comments above the flag
> * definition for details.
> + *
> + * Worker might get attached to a pool *after* workqueue_offline_cpu()
> + * was run - e.g. created by manage_workers() from a kworker which was
> + * forcefully moved away by hotplug. Kworkers created from this point on
> + * need to have their affinity changed as if they were present during
> + * workqueue_offline_cpu().
> + *
> + * This will be resolved in rebind_workers().
> */
> - if (pool->flags & POOL_DISASSOCIATED)
> + if (pool->flags & POOL_DISASSOCIATED) {
> worker->flags |= WORKER_UNBOUND;
> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> + } else {
> + /*
> + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> + * online CPUs. It'll be re-applied when any of the CPUs come up.
> + */
Does this comment still stand ? IIUC, we should always be in the
POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
pool->attrs->cpumask can have several CPUs. In that case maybe we should check
for the cpu_active_mask here too ?
--
Vincent
> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + }
>
> list_add_tail(&worker->node, &pool->workers);
> worker->pool = pool;
> --
> 2.27.0
>
Hi Vincent,
On 11/12/20 11:39, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
>> Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
> AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
> ...
> TEARDOWN_CPU -> clear CPU in cpu_online_mask
> |
> |-AP_SCHED_STARTING -> migrate_tasks()
> |
> AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.
>
You're right, the splat should only happen with that other commit. That
said, this fix complements the one referred to in Fixes:, which is the
"logic" I went for.
>> Reported-by: Qian Cai <[email protected]>
>> Signed-off-by: Valentin Schneider <[email protected]>
>> ---
>> kernel/workqueue.c | 24 +++++++++++++++++-------
>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 9880b6c0e272..fb1418edf85c 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
>> {
>> mutex_lock(&wq_pool_attach_mutex);
>>
>> - /*
>> - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>> - * online CPUs. It'll be re-applied when any of the CPUs come up.
>> - */
>> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> -
>> /*
>> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
>> * stable across this function. See the comments above the flag
>> * definition for details.
>> + *
>> + * Worker might get attached to a pool *after* workqueue_offline_cpu()
>> + * was run - e.g. created by manage_workers() from a kworker which was
>> + * forcefully moved away by hotplug. Kworkers created from this point on
>> + * need to have their affinity changed as if they were present during
>> + * workqueue_offline_cpu().
>> + *
>> + * This will be resolved in rebind_workers().
>> */
>> - if (pool->flags & POOL_DISASSOCIATED)
>> + if (pool->flags & POOL_DISASSOCIATED) {
>> worker->flags |= WORKER_UNBOUND;
>> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
>> + } else {
>> + /*
>> + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>> + * online CPUs. It'll be re-applied when any of the CPUs come up.
>> + */
>
> Does this comment still stand ? IIUC, we should always be in the
> POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
> pool->attrs->cpumask can have several CPUs.
AIUI that should the case for unbound pools
> In that case maybe we should check for the cpu_active_mask here too ?
Looking at it again, I think we might need to.
IIUC you can end up with pools bound to a single NUMA node (?). In that
case, say the last CPU of a node is going down, then:
workqueue_offline_cpu()
wq_update_unbound_numa()
alloc_unbound_pwq()
get_unbound_pool()
would still pick that node, because it doesn't look at the online / active
mask. And at this point, we would affine the
kworkers to that node, and we're back to having kworkers enqueued on a
(!active, online) CPU that is going down...
The annoying thing is we can't just compare attrs->cpumask with
cpu_active_mask, because workqueue_offline_cpu() happens a few steps below
sched_cpu_deactivate() (CPUHP_AP_ACTIVE):
CPUHP_ONLINE -> CPUHP_AP_ACTIVE # CPU X is !active
# Some new kworker gets created here
worker_attach_to_pool()
!cpumask_subset(attrs->cpumask, cpu_active_mask)
-> affine worker to active CPUs
CPUHP_AP_ACTIVE -> CPUHP_ONLINE # CPU X is active
# Nothing will ever correct the kworker's affinity :(
On 11/12/20 12:51, Valentin Schneider wrote:
>> In that case maybe we should check for the cpu_active_mask here too ?
>
> Looking at it again, I think we might need to.
>
> IIUC you can end up with pools bound to a single NUMA node (?). In that
> case, say the last CPU of a node is going down, then:
>
> workqueue_offline_cpu()
> wq_update_unbound_numa()
> alloc_unbound_pwq()
> get_unbound_pool()
>
> would still pick that node, because it doesn't look at the online / active
> mask. And at this point, we would affine the
> kworkers to that node, and we're back to having kworkers enqueued on a
> (!active, online) CPU that is going down...
Assuming a node covers at least 2 CPUs, that can't actually happen per
is_cpu_allowed().
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 2a2f80ff63bc36a874ed569bcaef932a8fe43514
Gitweb: https://git.kernel.org/tip/2a2f80ff63bc36a874ed569bcaef932a8fe43514
Author: Valentin Schneider <[email protected]>
AuthorDate: Thu, 10 Dec 2020 16:38:29
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 23 Mar 2021 16:01:58 +01:00
stop_machine: Add caller debug info to queue_stop_cpus_work
Most callsites were covered by commit
a8b62fd08505 ("stop_machine: Add function and caller debug info")
but this skipped queue_stop_cpus_work(). Add caller debug info to it.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/stop_machine.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 971d8ac..cbc3027 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -409,6 +409,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
work->fn = fn;
work->arg = arg;
work->done = done;
+ work->caller = _RET_IP_;
if (cpu_stop_queue_work(cpu, work))
queued = true;
}