From: Lai Jiangshan <[email protected]>
06249738a41a ("workqueue: Manually break affinity on hotplug")
said that scheduler will not force break affinity for us.
But workqueue highly depends on the old behavior. Many parts of the codes
relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
is not enough to change it, and the commit has flaws in itself too.
It doesn't handle for worker detachment.
It doesn't handle for worker attachement, mainly worker creation
which is handled by Valentin Schneider's patch [1].
It doesn't handle for unbound workers which might be possible
per-cpu-kthread.
We need to thoroughly update the way workqueue handles affinity
in cpu hot[un]plug, what is this patchset intends to do and
replace the Valentin Schneider's patch [1]. The equivalent patch
is patch 10.
Patch 1 fixes a flaw reported by Hillf Danton <[email protected]>.
I have to include this fix because later patches depends on it.
The patchset is based on tip/master rather than workqueue tree,
because the patchset is a complement for 06249738a41a ("workqueue:
Manually break affinity on hotplug") which is only in tip/master by now.
And TJ acked to route the series through tip.
Changed from V1:
Add TJ's acked-by for the whole patchset
Add more words to the comments and the changelog, mainly derived
from discussion with Peter.
Update the comments as TJ suggested.
Update a line of code as Valentin suggested.
Add Valentin's ack for patch 10 because "Seems alright to me." and
add Valentin's comments to the changelog which is integral.
[1]: https://lore.kernel.org/r/[email protected]
[V1 patcheset]: https://lore.kernel.org/lkml/[email protected]/
Cc: Hillf Danton <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Donnefort <[email protected]>
Cc: Tejun Heo <[email protected]>
Lai Jiangshan (10):
workqueue: restore unbound_workers' cpumask correctly
workqueue: use cpu_possible_mask instead of cpu_active_mask to break
affinity
workqueue: Manually break affinity on pool detachment
workqueue: don't set the worker's cpumask when kthread_bind_mask()
workqueue: introduce wq_online_cpumask
workqueue: use wq_online_cpumask in restore_unbound_workers_cpumask()
workqueue: Manually break affinity on hotplug for unbound pool
workqueue: reorganize workqueue_online_cpu()
workqueue: reorganize workqueue_offline_cpu() unbind_workers()
workqueue: Fix affinity of kworkers when attaching into pool
kernel/workqueue.c | 214 ++++++++++++++++++++++++++++-----------------
1 file changed, 132 insertions(+), 82 deletions(-)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The scheduler won't break affinity for us any more, and we should
"emulate" the same behavior when the scheduler breaks affinity for
us. The behavior is "changing the cpumask to cpu_possible_mask".
And there might be some other CPUs online later while the worker is
still running with the pending work items. The worker should be allowed
to use the later online CPUs as before and process the work items ASAP.
If we use cpu_active_mask here, we can't achieve this goal but
using cpu_possible_mask can.
Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aba71ab359dd..fa71520822f0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4910,7 +4910,7 @@ static void unbind_workers(int cpu)
raw_spin_unlock_irq(&pool->lock);
for_each_pool_worker(worker, pool)
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_active_mask) < 0);
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
mutex_unlock(&wq_pool_attach_mutex);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
wq_online_cpumask is the cached result of cpu_online_mask with the
going-down cpu cleared. It is needed for later patches for setting
correct cpumask for workers and break affinity initiatively.
The first usage of wq_online_cpumask is also in this patch.
wq_calc_node_cpumask() and wq_update_unbound_numa() can be simplified
a little.
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 34 ++++++++++++++++++----------------
1 file changed, 18 insertions(+), 16 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5f3c86eaed7a..84842f10e6a2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -310,6 +310,9 @@ static bool workqueue_freezing; /* PL: have wqs started freezing? */
/* PL: allowable cpus for unbound wqs and work items */
static cpumask_var_t wq_unbound_cpumask;
+/* PL: online cpus (cpu_online_mask with the going-down cpu cleared) */
+static cpumask_var_t wq_online_cpumask;
+
/* CPU where unbound work was last round robin scheduled from this CPU */
static DEFINE_PER_CPU(int, wq_rr_cpu_last);
@@ -3833,12 +3836,10 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
* wq_calc_node_cpumask - calculate a wq_attrs' cpumask for the specified node
* @attrs: the wq_attrs of the default pwq of the target workqueue
* @node: the target NUMA node
- * @cpu_going_down: if >= 0, the CPU to consider as offline
* @cpumask: outarg, the resulting cpumask
*
- * Calculate the cpumask a workqueue with @attrs should use on @node. If
- * @cpu_going_down is >= 0, that cpu is considered offline during
- * calculation. The result is stored in @cpumask.
+ * Calculate the cpumask a workqueue with @attrs should use on @node.
+ * The result is stored in @cpumask.
*
* If NUMA affinity is not enabled, @attrs->cpumask is always used. If
* enabled and @node has online CPUs requested by @attrs, the returned
@@ -3852,15 +3853,14 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
* %false if equal.
*/
static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
- int cpu_going_down, cpumask_t *cpumask)
+ cpumask_t *cpumask)
{
if (!wq_numa_enabled || attrs->no_numa)
goto use_dfl;
/* does @node have any online CPUs @attrs wants? */
cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
- if (cpu_going_down >= 0)
- cpumask_clear_cpu(cpu_going_down, cpumask);
+ cpumask_and(cpumask, cpumask, wq_online_cpumask);
if (cpumask_empty(cpumask))
goto use_dfl;
@@ -3969,7 +3969,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
goto out_free;
for_each_node(node) {
- if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) {
+ if (wq_calc_node_cpumask(new_attrs, node, tmp_attrs->cpumask)) {
ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
if (!ctx->pwq_tbl[node])
goto out_free;
@@ -4094,7 +4094,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
* wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
* @wq: the target workqueue
* @cpu: the CPU coming up or going down
- * @online: whether @cpu is coming up or going down
*
* This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and
* %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update NUMA affinity of
@@ -4112,11 +4111,9 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
* affinity, it's the user's responsibility to flush the work item from
* CPU_DOWN_PREPARE.
*/
-static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
- bool online)
+static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu)
{
int node = cpu_to_node(cpu);
- int cpu_off = online ? -1 : cpu;
struct pool_workqueue *old_pwq = NULL, *pwq;
struct workqueue_attrs *target_attrs;
cpumask_t *cpumask;
@@ -4144,7 +4141,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
* and create a new one if they don't match. If the target cpumask
* equals the default pwq's, the default pwq should be used.
*/
- if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask)) {
+ if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpumask)) {
if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
return;
} else {
@@ -5081,6 +5078,7 @@ int workqueue_online_cpu(unsigned int cpu)
int pi;
mutex_lock(&wq_pool_mutex);
+ cpumask_set_cpu(cpu, wq_online_cpumask);
for_each_pool(pool, pi) {
mutex_lock(&wq_pool_attach_mutex);
@@ -5095,7 +5093,7 @@ int workqueue_online_cpu(unsigned int cpu)
/* update NUMA affinity of unbound workqueues */
list_for_each_entry(wq, &workqueues, list)
- wq_update_unbound_numa(wq, cpu, true);
+ wq_update_unbound_numa(wq, cpu);
mutex_unlock(&wq_pool_mutex);
return 0;
@@ -5113,8 +5111,9 @@ int workqueue_offline_cpu(unsigned int cpu)
/* update NUMA affinity of unbound workqueues */
mutex_lock(&wq_pool_mutex);
+ cpumask_clear_cpu(cpu, wq_online_cpumask);
list_for_each_entry(wq, &workqueues, list)
- wq_update_unbound_numa(wq, cpu, false);
+ wq_update_unbound_numa(wq, cpu);
mutex_unlock(&wq_pool_mutex);
return 0;
@@ -5951,6 +5950,9 @@ void __init workqueue_init_early(void)
BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
+ BUG_ON(!alloc_cpumask_var(&wq_online_cpumask, GFP_KERNEL));
+ cpumask_copy(wq_online_cpumask, cpu_online_mask);
+
BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(hk_flags));
@@ -6047,7 +6049,7 @@ void __init workqueue_init(void)
}
list_for_each_entry(wq, &workqueues, list) {
- wq_update_unbound_numa(wq, smp_processor_id(), true);
+ wq_update_unbound_numa(wq, smp_processor_id());
WARN(init_rescuer(wq),
"workqueue: failed to create early rescuer for %s",
wq->name);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The pool->attrs->cpumask might be a single CPU and it may go
down after detachment, and the scheduler won't force to break
affinity for us since it is a per-cpu-ktrhead. So we have to
do it on our own and unbind this worker which can't be unbound
by workqueue_offline_cpu() since it doesn't belong to any pool
after detachment. Do it unconditionally for there is no harm
to break affinity for non-per-cpu-ktrhead and we don't need to
rely on the scheduler's policy on when to break affinity.
Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fa71520822f0..4d7575311198 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1885,6 +1885,19 @@ static void worker_detach_from_pool(struct worker *worker)
if (list_empty(&pool->workers))
detach_completion = pool->detach_completion;
+
+ /*
+ * The pool->attrs->cpumask might be a single CPU and it may go
+ * down after detachment, and the scheduler won't force to break
+ * affinity for us since it is a per-cpu-ktrhead. So we have to
+ * do it on our own and unbind this worker which can't be unbound
+ * by workqueue_offline_cpu() since it doesn't belong to any pool
+ * after detachment. Do it unconditionally for there is no harm
+ * to break affinity for non-per-cpu-ktrhead and we don't need to
+ * rely on the scheduler's policy on when to break affinity.
+ */
+ set_cpus_allowed_ptr(worker->task, cpu_possible_mask);
+
mutex_unlock(&wq_pool_attach_mutex);
/* clear leftover flags without pool->lock after it is detached */
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
When we restore workers' cpumask, we should restore them to the
designed pool->attrs->cpumask. And we need to only do it at
the first time.
Cc: Hillf Danton <[email protected]>
Reported-by: Hillf Danton <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c71da2a59e12..aba71ab359dd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5031,9 +5031,13 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
cpumask_and(&cpumask, pool->attrs->cpumask, cpu_online_mask);
+ /* is @cpu the first one onlined for the @pool? */
+ if (cpumask_weight(&cpumask) > 1)
+ return;
+
/* as we're called from CPU_ONLINE, the following shouldn't fail */
for_each_pool_worker(worker, pool)
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, &cpumask) < 0);
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
}
int workqueue_prepare_cpu(unsigned int cpu)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
There might be no online cpu in the pool->attrs->cpumask.
We will set the worker's cpumask later in worker_attach_to_pool().
Cc: Peter Zijlstra <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4d7575311198..5f3c86eaed7a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1948,7 +1948,15 @@ static struct worker *create_worker(struct worker_pool *pool)
goto fail;
set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool->attrs->cpumask);
+
+ /*
+ * Set PF_NO_SETAFFINITY via kthread_bind_mask(). We use
+ * cpu_possible_mask instead of pool->attrs->cpumask, because
+ * there might not be any online cpu in the pool->attrs->cpumask.
+ * The cpumask of the worker will be set properly later in
+ * worker_attach_to_pool().
+ */
+ kthread_bind_mask(worker->task, cpu_possible_mask);
/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
restore_unbound_workers_cpumask() is called when CPU_ONLINE, where
wq_online_cpumask equals to cpu_online_mask. So no fucntionality
changed.
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 84842f10e6a2..eda293097fe1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5041,13 +5041,14 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
static cpumask_t cpumask;
struct worker *worker;
+ lockdep_assert_held(&wq_pool_mutex);
lockdep_assert_held(&wq_pool_attach_mutex);
/* is @cpu allowed for @pool? */
if (!cpumask_test_cpu(cpu, pool->attrs->cpumask))
return;
- cpumask_and(&cpumask, pool->attrs->cpumask, cpu_online_mask);
+ cpumask_and(&cpumask, pool->attrs->cpumask, wq_online_cpumask);
/* is @cpu the first one onlined for the @pool? */
if (cpumask_weight(&cpumask) > 1)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
There is possible that a per-node pool/woker's affinity is a single
CPU. It can happen when wq_unbound_cpumask is changed by system adim
via /sys/devices/virtual/workqueue/cpumask. And pool->attrs->cpumask
is wq_unbound_cpumask & possible_cpumask_of_the_node, which can be a
single CPU and makes the pool's workers to be "per cpu kthread".
And the scheduler won't break affinity on the "per cpu kthread" workers
when the CPU is going down, so we have to do it by our own.
We do it by reusing existing restore_unbound_workers_cpumask() and rename
it to update_unbound_workers_cpumask(). When the number of the online
CPU of the pool goes from 1 to 0, we break the affinity initiatively.
Note here, we even break the affinity for non-per-cpu-kthread workers,
because first, the code path is slow path which is not worth too much to
optimize, second, we don't need to rely on the code/conditions when the
scheduler forces breaking affinity for us.
The way to break affinity is to set the workers' affinity to
cpu_possible_mask, so that we preserve the same behavisor when
the scheduler breaks affinity for us.
Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 49 ++++++++++++++++++++++++++++++++--------------
1 file changed, 34 insertions(+), 15 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index eda293097fe1..c2b66679c0aa 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5027,16 +5027,16 @@ static void rebind_workers(struct worker_pool *pool)
}
/**
- * restore_unbound_workers_cpumask - restore cpumask of unbound workers
+ * update_unbound_workers_cpumask - update cpumask of unbound workers
* @pool: unbound pool of interest
- * @cpu: the CPU which is coming up
+ * @cpu: the CPU which is coming up or going down
*
* An unbound pool may end up with a cpumask which doesn't have any online
- * CPUs. When a worker of such pool get scheduled, the scheduler resets
- * its cpus_allowed. If @cpu is in @pool's cpumask which didn't have any
- * online CPU before, cpus_allowed of all its workers should be restored.
+ * CPUs. We have to reset workers' cpus_allowed of such pool. And we
+ * restore the workers' cpus_allowed when the pool's cpumask has online
+ * CPU for the first time after reset.
*/
-static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
+static void update_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
{
static cpumask_t cpumask;
struct worker *worker;
@@ -5050,13 +5050,19 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
cpumask_and(&cpumask, pool->attrs->cpumask, wq_online_cpumask);
- /* is @cpu the first one onlined for the @pool? */
- if (cpumask_weight(&cpumask) > 1)
- return;
-
- /* as we're called from CPU_ONLINE, the following shouldn't fail */
- for_each_pool_worker(worker, pool)
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
+ switch (cpumask_weight(&cpumask)) {
+ case 0: /* @cpu is the last one going down for the @pool. */
+ for_each_pool_worker(worker, pool)
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+ break;
+ case 1: /* @cpu is the first one onlined for the @pool. */
+ /* as we're called from CPU_ONLINE, the following shouldn't fail */
+ for_each_pool_worker(worker, pool)
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
+ break;
+ default: /* other cases, nothing to do */
+ break;
+ }
}
int workqueue_prepare_cpu(unsigned int cpu)
@@ -5087,7 +5093,7 @@ int workqueue_online_cpu(unsigned int cpu)
if (pool->cpu == cpu)
rebind_workers(pool);
else if (pool->cpu < 0)
- restore_unbound_workers_cpumask(pool, cpu);
+ update_unbound_workers_cpumask(pool, cpu);
mutex_unlock(&wq_pool_attach_mutex);
}
@@ -5102,7 +5108,9 @@ int workqueue_online_cpu(unsigned int cpu)
int workqueue_offline_cpu(unsigned int cpu)
{
+ struct worker_pool *pool;
struct workqueue_struct *wq;
+ int pi;
/* unbinding per-cpu workers should happen on the local CPU */
if (WARN_ON(cpu != smp_processor_id()))
@@ -5110,9 +5118,20 @@ int workqueue_offline_cpu(unsigned int cpu)
unbind_workers(cpu);
- /* update NUMA affinity of unbound workqueues */
mutex_lock(&wq_pool_mutex);
cpumask_clear_cpu(cpu, wq_online_cpumask);
+
+ /* update CPU affinity of workers of unbound pools */
+ for_each_pool(pool, pi) {
+ mutex_lock(&wq_pool_attach_mutex);
+
+ if (pool->cpu < 0)
+ update_unbound_workers_cpumask(pool, cpu);
+
+ mutex_unlock(&wq_pool_attach_mutex);
+ }
+
+ /* update NUMA affinity of unbound workqueues */
list_for_each_entry(wq, &workqueues, list)
wq_update_unbound_numa(wq, cpu);
mutex_unlock(&wq_pool_mutex);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Just move around the code, no functionality changed.
It prepares for later patch protecting wq_online_cpumask
in wq_pool_attach_mutex.
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c2b66679c0aa..dc891b5c0868 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5087,12 +5087,17 @@ int workqueue_online_cpu(unsigned int cpu)
mutex_lock(&wq_pool_mutex);
cpumask_set_cpu(cpu, wq_online_cpumask);
+ for_each_cpu_worker_pool(pool, cpu) {
+ mutex_lock(&wq_pool_attach_mutex);
+ rebind_workers(pool);
+ mutex_unlock(&wq_pool_attach_mutex);
+ }
+
+ /* update CPU affinity of workers of unbound pools */
for_each_pool(pool, pi) {
mutex_lock(&wq_pool_attach_mutex);
- if (pool->cpu == cpu)
- rebind_workers(pool);
- else if (pool->cpu < 0)
+ if (pool->cpu < 0)
update_unbound_workers_cpumask(pool, cpu);
mutex_unlock(&wq_pool_attach_mutex);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
When worker_attach_to_pool() is called, we should not put the workers
to pool->attrs->cpumask when there is not CPU online in it.
We have to use wq_online_cpumask in worker_attach_to_pool() to check
if pool->attrs->cpumask is valid rather than cpu_online_mask or
cpu_active_mask due to gaps between stages in cpu hot[un]plug.
So for that late-spawned per-CPU kworker case: the outgoing CPU should have
already been cleared from wq_online_cpumask, so it gets its affinity reset
to the possible mask and the subsequent wakeup will ensure it's put on an
active CPU.
To use wq_online_cpumask in worker_attach_to_pool(), we need to protect
wq_online_cpumask in wq_pool_attach_mutex and we modify workqueue_online_cpu()
and workqueue_offline_cpu() to enlarge wq_pool_attach_mutex protected
region. We also put updating wq_online_cpumask and [re|un]bind_workers()
in the same wq_pool_attach_mutex protected region to make the update
for percpu workqueue atomically.
Cc: Qian Cai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Donnefort <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Acked-by: Valentin Schneider <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 32 +++++++++++++++-----------------
1 file changed, 15 insertions(+), 17 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 65270729454c..eeb726598f80 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -310,7 +310,7 @@ static bool workqueue_freezing; /* PL: have wqs started freezing? */
/* PL: allowable cpus for unbound wqs and work items */
static cpumask_var_t wq_unbound_cpumask;
-/* PL: online cpus (cpu_online_mask with the going-down cpu cleared) */
+/* PL&A: online cpus (cpu_online_mask with the going-down cpu cleared) */
static cpumask_var_t wq_online_cpumask;
/* CPU where unbound work was last round robin scheduled from this CPU */
@@ -1848,11 +1848,11 @@ static void worker_attach_to_pool(struct worker *worker,
{
mutex_lock(&wq_pool_attach_mutex);
- /*
- * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
- * online CPUs. It'll be re-applied when any of the CPUs come up.
- */
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ /* Is there any cpu in pool->attrs->cpumask online? */
+ if (cpumask_intersects(pool->attrs->cpumask, wq_online_cpumask))
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
+ else
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
/*
* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -5081,13 +5081,12 @@ int workqueue_online_cpu(unsigned int cpu)
int pi;
mutex_lock(&wq_pool_mutex);
- cpumask_set_cpu(cpu, wq_online_cpumask);
- for_each_cpu_worker_pool(pool, cpu) {
- mutex_lock(&wq_pool_attach_mutex);
+ mutex_lock(&wq_pool_attach_mutex);
+ cpumask_set_cpu(cpu, wq_online_cpumask);
+ for_each_cpu_worker_pool(pool, cpu)
rebind_workers(pool);
- mutex_unlock(&wq_pool_attach_mutex);
- }
+ mutex_unlock(&wq_pool_attach_mutex);
/* update CPU affinity of workers of unbound pools */
for_each_pool(pool, pi) {
@@ -5117,14 +5116,13 @@ int workqueue_offline_cpu(unsigned int cpu)
if (WARN_ON(cpu != smp_processor_id()))
return -1;
- for_each_cpu_worker_pool(pool, cpu) {
- mutex_lock(&wq_pool_attach_mutex);
- unbind_workers(pool);
- mutex_unlock(&wq_pool_attach_mutex);
- }
-
mutex_lock(&wq_pool_mutex);
+
+ mutex_lock(&wq_pool_attach_mutex);
cpumask_clear_cpu(cpu, wq_online_cpumask);
+ for_each_cpu_worker_pool(pool, cpu)
+ unbind_workers(pool);
+ mutex_unlock(&wq_pool_attach_mutex);
/* update CPU affinity of workers of unbound pools */
for_each_pool(pool, pi) {
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Just move around the code, no functionality changed.
Only wq_pool_attach_mutex protected region becomes a little larger.
It prepares for later patch protecting wq_online_cpumask
in wq_pool_attach_mutex.
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 90 +++++++++++++++++++++++-----------------------
1 file changed, 45 insertions(+), 45 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dc891b5c0868..65270729454c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4904,61 +4904,57 @@ void wq_worker_comm(char *buf, size_t size, struct task_struct *task)
* cpu comes back online.
*/
-static void unbind_workers(int cpu)
+static void unbind_workers(struct worker_pool *pool)
{
- struct worker_pool *pool;
struct worker *worker;
- for_each_cpu_worker_pool(pool, cpu) {
- mutex_lock(&wq_pool_attach_mutex);
- raw_spin_lock_irq(&pool->lock);
+ lockdep_assert_held(&wq_pool_attach_mutex);
- /*
- * We've blocked all attach/detach operations. Make all workers
- * unbound and set DISASSOCIATED. Before this, all workers
- * except for the ones which are still executing works from
- * before the last CPU down must be on the cpu. After
- * this, they may become diasporas.
- */
- for_each_pool_worker(worker, pool)
- worker->flags |= WORKER_UNBOUND;
+ raw_spin_lock_irq(&pool->lock);
- pool->flags |= POOL_DISASSOCIATED;
+ /*
+ * We've blocked all attach/detach operations. Make all workers
+ * unbound and set DISASSOCIATED. Before this, all workers
+ * except for the ones which are still executing works from
+ * before the last CPU down must be on the cpu. After
+ * this, they may become diasporas.
+ */
+ for_each_pool_worker(worker, pool)
+ worker->flags |= WORKER_UNBOUND;
- raw_spin_unlock_irq(&pool->lock);
+ pool->flags |= POOL_DISASSOCIATED;
- for_each_pool_worker(worker, pool)
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+ raw_spin_unlock_irq(&pool->lock);
- mutex_unlock(&wq_pool_attach_mutex);
+ for_each_pool_worker(worker, pool)
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
- /*
- * Call schedule() so that we cross rq->lock and thus can
- * guarantee sched callbacks see the %WORKER_UNBOUND flag.
- * This is necessary as scheduler callbacks may be invoked
- * from other cpus.
- */
- schedule();
+ /*
+ * Call schedule() so that we cross rq->lock and thus can
+ * guarantee sched callbacks see the %WORKER_UNBOUND flag.
+ * This is necessary as scheduler callbacks may be invoked
+ * from other cpus.
+ */
+ schedule();
- /*
- * Sched callbacks are disabled now. Zap nr_running.
- * After this, nr_running stays zero and need_more_worker()
- * and keep_working() are always true as long as the
- * worklist is not empty. This pool now behaves as an
- * unbound (in terms of concurrency management) pool which
- * are served by workers tied to the pool.
- */
- atomic_set(&pool->nr_running, 0);
+ /*
+ * Sched callbacks are disabled now. Zap nr_running.
+ * After this, nr_running stays zero and need_more_worker()
+ * and keep_working() are always true as long as the
+ * worklist is not empty. This pool now behaves as an
+ * unbound (in terms of concurrency management) pool which
+ * are served by workers tied to the pool.
+ */
+ atomic_set(&pool->nr_running, 0);
- /*
- * With concurrency management just turned off, a busy
- * worker blocking could lead to lengthy stalls. Kick off
- * unbound chain execution of currently pending work items.
- */
- raw_spin_lock_irq(&pool->lock);
- wake_up_worker(pool);
- raw_spin_unlock_irq(&pool->lock);
- }
+ /*
+ * With concurrency management just turned off, a busy
+ * worker blocking could lead to lengthy stalls. Kick off
+ * unbound chain execution of currently pending work items.
+ */
+ raw_spin_lock_irq(&pool->lock);
+ wake_up_worker(pool);
+ raw_spin_unlock_irq(&pool->lock);
}
/**
@@ -5121,7 +5117,11 @@ int workqueue_offline_cpu(unsigned int cpu)
if (WARN_ON(cpu != smp_processor_id()))
return -1;
- unbind_workers(cpu);
+ for_each_cpu_worker_pool(pool, cpu) {
+ mutex_lock(&wq_pool_attach_mutex);
+ unbind_workers(pool);
+ mutex_unlock(&wq_pool_attach_mutex);
+ }
mutex_lock(&wq_pool_mutex);
cpumask_clear_cpu(cpu, wq_online_cpumask);
--
2.19.1.6.gb485710b
On 18/12/20 17:09, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> When worker_attach_to_pool() is called, we should not put the workers
> to pool->attrs->cpumask when there is not CPU online in it.
>
> We have to use wq_online_cpumask in worker_attach_to_pool() to check
> if pool->attrs->cpumask is valid rather than cpu_online_mask or
> cpu_active_mask due to gaps between stages in cpu hot[un]plug.
>
> So for that late-spawned per-CPU kworker case: the outgoing CPU should have
> already been cleared from wq_online_cpumask, so it gets its affinity reset
> to the possible mask and the subsequent wakeup will ensure it's put on an
> active CPU.
>
> To use wq_online_cpumask in worker_attach_to_pool(), we need to protect
> wq_online_cpumask in wq_pool_attach_mutex and we modify workqueue_online_cpu()
> and workqueue_offline_cpu() to enlarge wq_pool_attach_mutex protected
> region. We also put updating wq_online_cpumask and [re|un]bind_workers()
> in the same wq_pool_attach_mutex protected region to make the update
> for percpu workqueue atomically.
>
> Cc: Qian Cai <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Donnefort <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Acked-by: Valentin Schneider <[email protected]>
So an etiquette thing: I never actually gave an Acked-by. I did say it
looked good to me, and that probably should've been bundled with a
Reviewed-by, but it wasn't (I figured I'd wait for v2). Forging is bad,
m'kay.
When in doubt (e.g. someone says they're ok with your patch but don't give
any Ack/Reviewed-by), just ask via mail or on IRC.
For now, please make this a:
Reviewed-by: Valentin Schneider <[email protected]>
> Acked-by: Tejun Heo <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> kernel/workqueue.c | 32 +++++++++++++++-----------------
> 1 file changed, 15 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 65270729454c..eeb726598f80 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -310,7 +310,7 @@ static bool workqueue_freezing; /* PL: have wqs started freezing? */
> /* PL: allowable cpus for unbound wqs and work items */
> static cpumask_var_t wq_unbound_cpumask;
>
> -/* PL: online cpus (cpu_online_mask with the going-down cpu cleared) */
> +/* PL&A: online cpus (cpu_online_mask with the going-down cpu cleared) */
> static cpumask_var_t wq_online_cpumask;
>
> /* CPU where unbound work was last round robin scheduled from this CPU */
> @@ -1848,11 +1848,11 @@ static void worker_attach_to_pool(struct worker *worker,
> {
> mutex_lock(&wq_pool_attach_mutex);
>
> - /*
> - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> - * online CPUs. It'll be re-applied when any of the CPUs come up.
> - */
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + /* Is there any cpu in pool->attrs->cpumask online? */
> + if (cpumask_intersects(pool->attrs->cpumask, wq_online_cpumask))
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
> + else
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
>
> /*
> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> @@ -5081,13 +5081,12 @@ int workqueue_online_cpu(unsigned int cpu)
> int pi;
>
> mutex_lock(&wq_pool_mutex);
> - cpumask_set_cpu(cpu, wq_online_cpumask);
>
> - for_each_cpu_worker_pool(pool, cpu) {
> - mutex_lock(&wq_pool_attach_mutex);
> + mutex_lock(&wq_pool_attach_mutex);
> + cpumask_set_cpu(cpu, wq_online_cpumask);
> + for_each_cpu_worker_pool(pool, cpu)
> rebind_workers(pool);
> - mutex_unlock(&wq_pool_attach_mutex);
> - }
> + mutex_unlock(&wq_pool_attach_mutex);
>
> /* update CPU affinity of workers of unbound pools */
> for_each_pool(pool, pi) {
> @@ -5117,14 +5116,13 @@ int workqueue_offline_cpu(unsigned int cpu)
> if (WARN_ON(cpu != smp_processor_id()))
> return -1;
>
> - for_each_cpu_worker_pool(pool, cpu) {
> - mutex_lock(&wq_pool_attach_mutex);
> - unbind_workers(pool);
> - mutex_unlock(&wq_pool_attach_mutex);
> - }
> -
> mutex_lock(&wq_pool_mutex);
> +
> + mutex_lock(&wq_pool_attach_mutex);
> cpumask_clear_cpu(cpu, wq_online_cpumask);
> + for_each_cpu_worker_pool(pool, cpu)
> + unbind_workers(pool);
> + mutex_unlock(&wq_pool_attach_mutex);
>
> /* update CPU affinity of workers of unbound pools */
> for_each_pool(pool, pi) {
On Sat, Dec 19, 2020 at 1:59 AM Valentin Schneider
<[email protected]> wrote:
>
>
> On 18/12/20 17:09, Lai Jiangshan wrote:
> > From: Lai Jiangshan <[email protected]>
> >
> > When worker_attach_to_pool() is called, we should not put the workers
> > to pool->attrs->cpumask when there is not CPU online in it.
> >
> > We have to use wq_online_cpumask in worker_attach_to_pool() to check
> > if pool->attrs->cpumask is valid rather than cpu_online_mask or
> > cpu_active_mask due to gaps between stages in cpu hot[un]plug.
> >
> > So for that late-spawned per-CPU kworker case: the outgoing CPU should have
> > already been cleared from wq_online_cpumask, so it gets its affinity reset
> > to the possible mask and the subsequent wakeup will ensure it's put on an
> > active CPU.
> >
> > To use wq_online_cpumask in worker_attach_to_pool(), we need to protect
> > wq_online_cpumask in wq_pool_attach_mutex and we modify workqueue_online_cpu()
> > and workqueue_offline_cpu() to enlarge wq_pool_attach_mutex protected
> > region. We also put updating wq_online_cpumask and [re|un]bind_workers()
> > in the same wq_pool_attach_mutex protected region to make the update
> > for percpu workqueue atomically.
> >
> > Cc: Qian Cai <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Donnefort <[email protected]>
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Acked-by: Valentin Schneider <[email protected]>
>
> So an etiquette thing: I never actually gave an Acked-by. I did say it
> looked good to me, and that probably should've been bundled with a
> Reviewed-by, but it wasn't (I figured I'd wait for v2). Forging is bad,
> m'kay.
>
> When in doubt (e.g. someone says they're ok with your patch but don't give
> any Ack/Reviewed-by), just ask via mail or on IRC.
Hello, Valentin
I'm sorry not to have asked for your option. When I saw
"Seems alright to me." I felt a huge encouragement and rushed.
I was in doubt should I promote "Seems alright to me." to "Ack".
Instead of asking, I wrongly did it right the way. I knew may I'm
just forging, and added a log in the cover letter:
> Add Valentin's ack for patch 10 because "Seems alright to me." and
> add Valentin's comments to the changelog which is integral.
Anyway, it is my bad and I learnt.
>
> For now, please make this a:
>
> Reviewed-by: Valentin Schneider <[email protected]>
Hello Peter, cloud you help change it if there is no other
feedback that causes V3 patchset to be made.
Thanks
Lai
>
> > Acked-by: Tejun Heo <[email protected]>
> > Signed-off-by: Lai Jiangshan <[email protected]>
> > ---
> > kernel/workqueue.c | 32 +++++++++++++++-----------------
> > 1 file changed, 15 insertions(+), 17 deletions(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 65270729454c..eeb726598f80 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -310,7 +310,7 @@ static bool workqueue_freezing; /* PL: have wqs started freezing? */
> > /* PL: allowable cpus for unbound wqs and work items */
> > static cpumask_var_t wq_unbound_cpumask;
> >
> > -/* PL: online cpus (cpu_online_mask with the going-down cpu cleared) */
> > +/* PL&A: online cpus (cpu_online_mask with the going-down cpu cleared) */
> > static cpumask_var_t wq_online_cpumask;
> >
> > /* CPU where unbound work was last round robin scheduled from this CPU */
> > @@ -1848,11 +1848,11 @@ static void worker_attach_to_pool(struct worker *worker,
> > {
> > mutex_lock(&wq_pool_attach_mutex);
> >
> > - /*
> > - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> > - * online CPUs. It'll be re-applied when any of the CPUs come up.
> > - */
> > - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> > + /* Is there any cpu in pool->attrs->cpumask online? */
> > + if (cpumask_intersects(pool->attrs->cpumask, wq_online_cpumask))
> > + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
> > + else
> > + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
> >
> > /*
> > * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> > @@ -5081,13 +5081,12 @@ int workqueue_online_cpu(unsigned int cpu)
> > int pi;
> >
> > mutex_lock(&wq_pool_mutex);
> > - cpumask_set_cpu(cpu, wq_online_cpumask);
> >
> > - for_each_cpu_worker_pool(pool, cpu) {
> > - mutex_lock(&wq_pool_attach_mutex);
> > + mutex_lock(&wq_pool_attach_mutex);
> > + cpumask_set_cpu(cpu, wq_online_cpumask);
> > + for_each_cpu_worker_pool(pool, cpu)
> > rebind_workers(pool);
> > - mutex_unlock(&wq_pool_attach_mutex);
> > - }
> > + mutex_unlock(&wq_pool_attach_mutex);
> >
> > /* update CPU affinity of workers of unbound pools */
> > for_each_pool(pool, pi) {
> > @@ -5117,14 +5116,13 @@ int workqueue_offline_cpu(unsigned int cpu)
> > if (WARN_ON(cpu != smp_processor_id()))
> > return -1;
> >
> > - for_each_cpu_worker_pool(pool, cpu) {
> > - mutex_lock(&wq_pool_attach_mutex);
> > - unbind_workers(pool);
> > - mutex_unlock(&wq_pool_attach_mutex);
> > - }
> > -
> > mutex_lock(&wq_pool_mutex);
> > +
> > + mutex_lock(&wq_pool_attach_mutex);
> > cpumask_clear_cpu(cpu, wq_online_cpumask);
> > + for_each_cpu_worker_pool(pool, cpu)
> > + unbind_workers(pool);
> > + mutex_unlock(&wq_pool_attach_mutex);
> >
> > /* update CPU affinity of workers of unbound pools */
> > for_each_pool(pool, pi) {
On Fri, Dec 18, 2020 at 8:11 AM Lai Jiangshan <[email protected]> wrote:
>
> From: Lai Jiangshan <[email protected]>
>
> 06249738a41a ("workqueue: Manually break affinity on hotplug")
> said that scheduler will not force break affinity for us.
>
> But workqueue highly depends on the old behavior. Many parts of the codes
> relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
> is not enough to change it, and the commit has flaws in itself too.
>
> It doesn't handle for worker detachment.
> It doesn't handle for worker attachement, mainly worker creation
> which is handled by Valentin Schneider's patch [1].
> It doesn't handle for unbound workers which might be possible
> per-cpu-kthread.
>
> We need to thoroughly update the way workqueue handles affinity
> in cpu hot[un]plug, what is this patchset intends to do and
> replace the Valentin Schneider's patch [1]. The equivalent patch
> is patch 10.
>
> Patch 1 fixes a flaw reported by Hillf Danton <[email protected]>.
> I have to include this fix because later patches depends on it.
>
> The patchset is based on tip/master rather than workqueue tree,
> because the patchset is a complement for 06249738a41a ("workqueue:
> Manually break affinity on hotplug") which is only in tip/master by now.
>
> And TJ acked to route the series through tip.
>
> Changed from V1:
> Add TJ's acked-by for the whole patchset
>
> Add more words to the comments and the changelog, mainly derived
> from discussion with Peter.
>
> Update the comments as TJ suggested.
>
> Update a line of code as Valentin suggested.
>
> Add Valentin's ack for patch 10 because "Seems alright to me." and
> add Valentin's comments to the changelog which is integral.
>
> [1]: https://lore.kernel.org/r/[email protected]
> [V1 patcheset]: https://lore.kernel.org/lkml/[email protected]/
>
> Cc: Hillf Danton <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Qian Cai <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Donnefort <[email protected]>
> Cc: Tejun Heo <[email protected]>
>
> Lai Jiangshan (10):
> workqueue: restore unbound_workers' cpumask correctly
> workqueue: use cpu_possible_mask instead of cpu_active_mask to break
> affinity
> workqueue: Manually break affinity on pool detachment
> workqueue: don't set the worker's cpumask when kthread_bind_mask()
> workqueue: introduce wq_online_cpumask
> workqueue: use wq_online_cpumask in restore_unbound_workers_cpumask()
> workqueue: Manually break affinity on hotplug for unbound pool
> workqueue: reorganize workqueue_online_cpu()
> workqueue: reorganize workqueue_offline_cpu() unbind_workers()
> workqueue: Fix affinity of kworkers when attaching into pool
>
> kernel/workqueue.c | 214 ++++++++++++++++++++++++++++-----------------
> 1 file changed, 132 insertions(+), 82 deletions(-)
>
> --
> 2.19.1.6.gb485710b
Hi,
I tested this patchset on today's tip.git's master branch
(981316394e35 ("Merge branch 'locking/urgent'")).
Every time the kernel boots with 32 CPUs (I'm running the Linux VM on
Hyper-V), I get the below warning.
(BTW, with 8 or 16 CPUs, I don't see the warning).
By printing the cpumasks with "%*pbl", I know the warning happens because:
new_mask = 16-31
cpu_online_mask= 0-16
cpu_active_mask= 0-15
p->nr_cpus_allowed=16
2374 if (p->flags & PF_KTHREAD) {
2375 /*
2376 * For kernel threads that do indeed end up on online &&
2377 * !active we want to ensure they are strict
per-CPU threads.
2378 */
2379 WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
2380 !cpumask_intersects(new_mask, cpu_active_mask) &&
2381 p->nr_cpus_allowed != 1);
2382 }
2383
(FWIW, it looks like this patchset can fix a panic I noticed during
hibernation:
https://lkml.org/lkml/2020/12/22/141, though I see the same warning
during hibernation.)
[ 1.698042] smp: Bringing up secondary CPUs ...
[ 1.701707] x86: Booting SMP configuration:
[ 1.705368] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
#8 #9 #10 #11 #12 #13 #14 #15
[ 1.721589] .... node #1, CPUs: #16
[ 1.013388] smpboot: CPU 16 Converting physical 0 to logical die 1
[ 1.809716] ------------[ cut here ]------------
[ 1.813553] WARNING: CPU: 16 PID: 90 at kernel/sched/core.c:2381
__set_cpus_allowed_ptr+0x19e/0x1b0
[ 1.813553] Modules linked in:
[ 1.813553] CPU: 16 PID: 90 Comm: cpuhp/16 Not tainted 5.10.0+ #1
[ 1.813553] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 1.813553] RIP: 0010:__set_cpus_allowed_ptr+0x19e/0x1b0
[ 1.813553] Code: e8 e7 a3 39 00 85 c0 74 a7 ba 00 02 00 00 48 c7
c6 20 4b 9b 84 4c 89 ff e8 cf a3 39 00 85 c0 75 8f 83 bb a0 03 00 00
01 74 86 <0f> 0b eb 82 e8 49 ba 74 00 66 0f 1f 84 00 00 00 00 00 0f 1f
44 00
[ 1.813553] RSP: 0000:ffffba9bc1ca7cf8 EFLAGS: 00010016
[ 1.813553] RAX: 0000000000000000 RBX: ffff98ed48d58000 RCX: 0000000000000008
[ 1.813553] RDX: 0000000000000200 RSI: ffffffff849b4b20 RDI: ffff98ed48d035a8
[ 1.813553] RBP: ffff98ed42a2ac00 R08: 0000000000000008 R09: 0000000000000008
[ 1.813553] R10: ffff98ed48d035a8 R11: ffffffff8484da40 R12: 0000000000000000
[ 1.813553] R13: 0000000000000010 R14: ffffffff849b4ba0 R15: ffff98ed48d035a8
[ 1.813553] FS: 0000000000000000(0000) GS:ffff98ee3aa00000(0000)
knlGS:0000000000000000
[ 1.813553] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.813553] CR2: 0000000000000000 CR3: 000000019980a001 CR4: 00000000003706e0
[ 1.813553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.813553] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.813553] Call Trace:
[ 1.813553] worker_attach_to_pool+0x53/0xd0
[ 1.813553] create_worker+0xf9/0x190
[ 1.813553] alloc_unbound_pwq+0x3a5/0x3b0
[ 1.813553] wq_update_unbound_numa+0x112/0x1c0
[ 1.813553] workqueue_online_cpu+0x1d0/0x220
[ 1.813553] ? workqueue_prepare_cpu+0x70/0x70
[ 1.813553] cpuhp_invoke_callback+0x82/0x4a0
[ 1.813553] ? sort_range+0x20/0x20
[ 1.813553] cpuhp_thread_fun+0xb8/0x120
[ 1.813553] smpboot_thread_fn+0x198/0x230
[ 1.813553] kthread+0x13d/0x160
[ 1.813553] ? kthread_create_on_node+0x60/0x60
[ 1.813553] ret_from_fork+0x22/0x30
[ 1.813553] ---[ end trace bc73d8bab71235fe ]---
[ 1.817553] #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
[ 1.826499] smp: Brought up 2 nodes, 32 CPUs
[ 1.833345] smpboot: Max logical packages: 2
[ 1.833574] smpboot: Total of 32 processors activated (146959.07 BogoMIPS)
Thanks,
Dexuan
On Wed, Dec 23, 2020 at 5:39 AM Dexuan-Linux Cui <[email protected]> wrote:
>
> On Fri, Dec 18, 2020 at 8:11 AM Lai Jiangshan <[email protected]> wrote:
> >
> > From: Lai Jiangshan <[email protected]>
> >
> > 06249738a41a ("workqueue: Manually break affinity on hotplug")
> > said that scheduler will not force break affinity for us.
> >
> > But workqueue highly depends on the old behavior. Many parts of the codes
> > relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
> > is not enough to change it, and the commit has flaws in itself too.
> >
> > It doesn't handle for worker detachment.
> > It doesn't handle for worker attachement, mainly worker creation
> > which is handled by Valentin Schneider's patch [1].
> > It doesn't handle for unbound workers which might be possible
> > per-cpu-kthread.
> >
> > We need to thoroughly update the way workqueue handles affinity
> > in cpu hot[un]plug, what is this patchset intends to do and
> > replace the Valentin Schneider's patch [1]. The equivalent patch
> > is patch 10.
> >
> > Patch 1 fixes a flaw reported by Hillf Danton <[email protected]>.
> > I have to include this fix because later patches depends on it.
> >
> > The patchset is based on tip/master rather than workqueue tree,
> > because the patchset is a complement for 06249738a41a ("workqueue:
> > Manually break affinity on hotplug") which is only in tip/master by now.
> >
> > And TJ acked to route the series through tip.
> >
> > Changed from V1:
> > Add TJ's acked-by for the whole patchset
> >
> > Add more words to the comments and the changelog, mainly derived
> > from discussion with Peter.
> >
> > Update the comments as TJ suggested.
> >
> > Update a line of code as Valentin suggested.
> >
> > Add Valentin's ack for patch 10 because "Seems alright to me." and
> > add Valentin's comments to the changelog which is integral.
> >
> > [1]: https://lore.kernel.org/r/[email protected]
> > [V1 patcheset]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Cc: Hillf Danton <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Qian Cai <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Donnefort <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> >
> > Lai Jiangshan (10):
> > workqueue: restore unbound_workers' cpumask correctly
> > workqueue: use cpu_possible_mask instead of cpu_active_mask to break
> > affinity
> > workqueue: Manually break affinity on pool detachment
> > workqueue: don't set the worker's cpumask when kthread_bind_mask()
> > workqueue: introduce wq_online_cpumask
> > workqueue: use wq_online_cpumask in restore_unbound_workers_cpumask()
> > workqueue: Manually break affinity on hotplug for unbound pool
> > workqueue: reorganize workqueue_online_cpu()
> > workqueue: reorganize workqueue_offline_cpu() unbind_workers()
> > workqueue: Fix affinity of kworkers when attaching into pool
> >
> > kernel/workqueue.c | 214 ++++++++++++++++++++++++++++-----------------
> > 1 file changed, 132 insertions(+), 82 deletions(-)
> >
> > --
> > 2.19.1.6.gb485710b
>
> Hi,
Hello,
thanks for reporting.
I have just been debugging it in a short time, I will continue tomorrow.
> I tested this patchset on today's tip.git's master branch
> (981316394e35 ("Merge branch 'locking/urgent'")).
>
> Every time the kernel boots with 32 CPUs (I'm running the Linux VM on
> Hyper-V), I get the below warning.
> (BTW, with 8 or 16 CPUs, I don't see the warning).
> By printing the cpumasks with "%*pbl", I know the warning happens because:
> new_mask = 16-31
> cpu_online_mask= 0-16
> cpu_active_mask= 0-15
> p->nr_cpus_allowed=16
From the call stack, we can see that we are bringing cpu#16 up.
And workqueue_online_cpu is being called and sched_cpu_activate()
is not called. So cpu_online_mask= 0-16, cpu_active_mask= 0-15.
Why isn't it legitimate to set the worker's cpumask
to be new_mask(16-31) since cpu#16 is being brought up?
Anyway, it revealed there must be a problem in the patchset
which raised the warning.
>
> 2374 if (p->flags & PF_KTHREAD) {
> 2375 /*
> 2376 * For kernel threads that do indeed end up on online &&
> 2377 * !active we want to ensure they are strict
> per-CPU threads.
> 2378 */
> 2379 WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
> 2380 !cpumask_intersects(new_mask, cpu_active_mask) &&
> 2381 p->nr_cpus_allowed != 1);
> 2382 }
> 2383
>
> (FWIW, it looks like this patchset can fix a panic I noticed during
> hibernation:
> https://lkml.org/lkml/2020/12/22/141, though I see the same warning
> during hibernation.)
>
> [ 1.698042] smp: Bringing up secondary CPUs ...
> [ 1.701707] x86: Booting SMP configuration:
> [ 1.705368] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
> #8 #9 #10 #11 #12 #13 #14 #15
> [ 1.721589] .... node #1, CPUs: #16
> [ 1.013388] smpboot: CPU 16 Converting physical 0 to logical die 1
> [ 1.809716] ------------[ cut here ]------------
> [ 1.813553] WARNING: CPU: 16 PID: 90 at kernel/sched/core.c:2381
> __set_cpus_allowed_ptr+0x19e/0x1b0
> [ 1.813553] Modules linked in:
> [ 1.813553] CPU: 16 PID: 90 Comm: cpuhp/16 Not tainted 5.10.0+ #1
> [ 1.813553] Hardware name: Microsoft Corporation Virtual
> Machine/Virtual Machine, BIOS 090008 12/07/2018
> [ 1.813553] RIP: 0010:__set_cpus_allowed_ptr+0x19e/0x1b0
> [ 1.813553] Code: e8 e7 a3 39 00 85 c0 74 a7 ba 00 02 00 00 48 c7
> c6 20 4b 9b 84 4c 89 ff e8 cf a3 39 00 85 c0 75 8f 83 bb a0 03 00 00
> 01 74 86 <0f> 0b eb 82 e8 49 ba 74 00 66 0f 1f 84 00 00 00 00 00 0f 1f
> 44 00
> [ 1.813553] RSP: 0000:ffffba9bc1ca7cf8 EFLAGS: 00010016
> [ 1.813553] RAX: 0000000000000000 RBX: ffff98ed48d58000 RCX: 0000000000000008
> [ 1.813553] RDX: 0000000000000200 RSI: ffffffff849b4b20 RDI: ffff98ed48d035a8
> [ 1.813553] RBP: ffff98ed42a2ac00 R08: 0000000000000008 R09: 0000000000000008
> [ 1.813553] R10: ffff98ed48d035a8 R11: ffffffff8484da40 R12: 0000000000000000
> [ 1.813553] R13: 0000000000000010 R14: ffffffff849b4ba0 R15: ffff98ed48d035a8
> [ 1.813553] FS: 0000000000000000(0000) GS:ffff98ee3aa00000(0000)
> knlGS:0000000000000000
> [ 1.813553] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.813553] CR2: 0000000000000000 CR3: 000000019980a001 CR4: 00000000003706e0
> [ 1.813553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1.813553] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1.813553] Call Trace:
> [ 1.813553] worker_attach_to_pool+0x53/0xd0
> [ 1.813553] create_worker+0xf9/0x190
> [ 1.813553] alloc_unbound_pwq+0x3a5/0x3b0
> [ 1.813553] wq_update_unbound_numa+0x112/0x1c0
> [ 1.813553] workqueue_online_cpu+0x1d0/0x220
> [ 1.813553] ? workqueue_prepare_cpu+0x70/0x70
> [ 1.813553] cpuhp_invoke_callback+0x82/0x4a0
> [ 1.813553] ? sort_range+0x20/0x20
> [ 1.813553] cpuhp_thread_fun+0xb8/0x120
> [ 1.813553] smpboot_thread_fn+0x198/0x230
> [ 1.813553] kthread+0x13d/0x160
> [ 1.813553] ? kthread_create_on_node+0x60/0x60
> [ 1.813553] ret_from_fork+0x22/0x30
> [ 1.813553] ---[ end trace bc73d8bab71235fe ]---
> [ 1.817553] #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
> [ 1.826499] smp: Brought up 2 nodes, 32 CPUs
> [ 1.833345] smpboot: Max logical packages: 2
> [ 1.833574] smpboot: Total of 32 processors activated (146959.07 BogoMIPS)
>
>
> Thanks,
> Dexuan
On Wed, Dec 23, 2020 at 5:39 AM Dexuan-Linux Cui <[email protected]> wrote:
>
> On Fri, Dec 18, 2020 at 8:11 AM Lai Jiangshan <[email protected]> wrote:
> >
> > From: Lai Jiangshan <[email protected]>
> >
> > 06249738a41a ("workqueue: Manually break affinity on hotplug")
> > said that scheduler will not force break affinity for us.
> >
> > But workqueue highly depends on the old behavior. Many parts of the codes
> > relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
> > is not enough to change it, and the commit has flaws in itself too.
> >
> > It doesn't handle for worker detachment.
> > It doesn't handle for worker attachement, mainly worker creation
> > which is handled by Valentin Schneider's patch [1].
> > It doesn't handle for unbound workers which might be possible
> > per-cpu-kthread.
> >
> > We need to thoroughly update the way workqueue handles affinity
> > in cpu hot[un]plug, what is this patchset intends to do and
> > replace the Valentin Schneider's patch [1]. The equivalent patch
> > is patch 10.
> >
> > Patch 1 fixes a flaw reported by Hillf Danton <[email protected]>.
> > I have to include this fix because later patches depends on it.
> >
> > The patchset is based on tip/master rather than workqueue tree,
> > because the patchset is a complement for 06249738a41a ("workqueue:
> > Manually break affinity on hotplug") which is only in tip/master by now.
> >
> > And TJ acked to route the series through tip.
> >
> > Changed from V1:
> > Add TJ's acked-by for the whole patchset
> >
> > Add more words to the comments and the changelog, mainly derived
> > from discussion with Peter.
> >
> > Update the comments as TJ suggested.
> >
> > Update a line of code as Valentin suggested.
> >
> > Add Valentin's ack for patch 10 because "Seems alright to me." and
> > add Valentin's comments to the changelog which is integral.
> >
> > [1]: https://lore.kernel.org/r/[email protected]
> > [V1 patcheset]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Cc: Hillf Danton <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Qian Cai <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Donnefort <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> >
> > Lai Jiangshan (10):
> > workqueue: restore unbound_workers' cpumask correctly
> > workqueue: use cpu_possible_mask instead of cpu_active_mask to break
> > affinity
> > workqueue: Manually break affinity on pool detachment
> > workqueue: don't set the worker's cpumask when kthread_bind_mask()
> > workqueue: introduce wq_online_cpumask
> > workqueue: use wq_online_cpumask in restore_unbound_workers_cpumask()
> > workqueue: Manually break affinity on hotplug for unbound pool
> > workqueue: reorganize workqueue_online_cpu()
> > workqueue: reorganize workqueue_offline_cpu() unbind_workers()
> > workqueue: Fix affinity of kworkers when attaching into pool
> >
> > kernel/workqueue.c | 214 ++++++++++++++++++++++++++++-----------------
> > 1 file changed, 132 insertions(+), 82 deletions(-)
> >
> > --
> > 2.19.1.6.gb485710b
>
> Hi,
> I tested this patchset on today's tip.git's master branch
> (981316394e35 ("Merge branch 'locking/urgent'")).
>
> Every time the kernel boots with 32 CPUs (I'm running the Linux VM on
> Hyper-V), I get the below warning.
> (BTW, with 8 or 16 CPUs, I don't see the warning).
> By printing the cpumasks with "%*pbl", I know the warning happens because:
> new_mask = 16-31
> cpu_online_mask= 0-16
> cpu_active_mask= 0-15
> p->nr_cpus_allowed=16
>
> 2374 if (p->flags & PF_KTHREAD) {
> 2375 /*
> 2376 * For kernel threads that do indeed end up on online &&
> 2377 * !active we want to ensure they are strict
> per-CPU threads.
> 2378 */
> 2379 WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
> 2380 !cpumask_intersects(new_mask, cpu_active_mask) &&
> 2381 p->nr_cpus_allowed != 1);
> 2382 }
> 2383
>
Hello, Dexuan
Could you omit patch4 of the patchset and test it again, please?
("workqueue: don't set the worker's cpumask when kthread_bind_mask()")
kthread_bind_mask() set the worker task to the pool's cpumask without
any check. And set_cpus_allowed_ptr() finds that the task's cpumask
is unchanged (already set by kthread_bind_mask()) and skips all the checks.
And I found that numa=fake=2U seems broken on cpumask_of_node() in my box.
Thanks,
Lai
On Sat, Dec 19, 2020 at 01:09:09AM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> 06249738a41a ("workqueue: Manually break affinity on hotplug")
> said that scheduler will not force break affinity for us.
>
> But workqueue highly depends on the old behavior. Many parts of the codes
> relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
> is not enough to change it, and the commit has flaws in itself too.
>
> It doesn't handle for worker detachment.
> It doesn't handle for worker attachement, mainly worker creation
> which is handled by Valentin Schneider's patch [1].
> It doesn't handle for unbound workers which might be possible
> per-cpu-kthread.
>
> We need to thoroughly update the way workqueue handles affinity
> in cpu hot[un]plug, what is this patchset intends to do and
> replace the Valentin Schneider's patch [1]. The equivalent patch
> is patch 10.
>
> Patch 1 fixes a flaw reported by Hillf Danton <[email protected]>.
> I have to include this fix because later patches depends on it.
>
> The patchset is based on tip/master rather than workqueue tree,
> because the patchset is a complement for 06249738a41a ("workqueue:
> Manually break affinity on hotplug") which is only in tip/master by now.
>
> And TJ acked to route the series through tip.
>
> Changed from V1:
> Add TJ's acked-by for the whole patchset
>
> Add more words to the comments and the changelog, mainly derived
> from discussion with Peter.
>
> Update the comments as TJ suggested.
>
> Update a line of code as Valentin suggested.
>
> Add Valentin's ack for patch 10 because "Seems alright to me." and
> add Valentin's comments to the changelog which is integral.
>
> [1]: https://lore.kernel.org/r/[email protected]
> [V1 patcheset]: https://lore.kernel.org/lkml/[email protected]/
>
> Cc: Hillf Danton <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Qian Cai <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Donnefort <[email protected]>
> Cc: Tejun Heo <[email protected]>
And rcutorture hits this, so thank you for the fix!
Tested-by: Paul E. McKenney <[email protected]>
> Lai Jiangshan (10):
> workqueue: restore unbound_workers' cpumask correctly
> workqueue: use cpu_possible_mask instead of cpu_active_mask to break
> affinity
> workqueue: Manually break affinity on pool detachment
> workqueue: don't set the worker's cpumask when kthread_bind_mask()
> workqueue: introduce wq_online_cpumask
> workqueue: use wq_online_cpumask in restore_unbound_workers_cpumask()
> workqueue: Manually break affinity on hotplug for unbound pool
> workqueue: reorganize workqueue_online_cpu()
> workqueue: reorganize workqueue_offline_cpu() unbind_workers()
> workqueue: Fix affinity of kworkers when attaching into pool
>
> kernel/workqueue.c | 214 ++++++++++++++++++++++++++++-----------------
> 1 file changed, 132 insertions(+), 82 deletions(-)
>
> --
> 2.19.1.6.gb485710b
>
> From: Lai Jiangshan <[email protected]>
> Sent: Wednesday, December 23, 2020 7:02 AM
> >
> > Hi,
> > I tested this patchset on today's tip.git's master branch
> > (981316394e35 ("Merge branch 'locking/urgent'")).
> >
> > Every time the kernel boots with 32 CPUs (I'm running the Linux VM on
> > Hyper-V), I get the below warning.
> > (BTW, with 8 or 16 CPUs, I don't see the warning).
> > By printing the cpumasks with "%*pbl", I know the warning happens
> > because:
> > new_mask = 16-31
> > cpu_online_mask= 0-16
> > cpu_active_mask= 0-15
> > p->nr_cpus_allowed=16
> >
>
> Hello, Dexuan
>
> Could you omit patch4 of the patchset and test it again, please?
> ("workqueue: don't set the worker's cpumask when kthread_bind_mask()")
>
> kthread_bind_mask() set the worker task to the pool's cpumask without
> any check. And set_cpus_allowed_ptr() finds that the task's cpumask
> is unchanged (already set by kthread_bind_mask()) and skips all the checks.
>
> And I found that numa=fake=2U seems broken on cpumask_of_node() in my
> box.
>
> Thanks,
> Lai
Looks like your analysis is correct: the warning can't repro if I configure all
the 32 vCPUs into 1 virtual NUMA node (and I don't see the message
"smpboot: CPU 16 Converting physical 0 to logical die 1"):
[ 1.495440] smp: Bringing up secondary CPUs ...
[ 1.499207] x86: Booting SMP configuration:
[ 1.503038] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
#8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26
#27 #28 #29 #30 #31
[ 1.531930] smp: Brought up 1 node, 32 CPUs
[ 1.538779] smpboot: Max logical packages: 1
[ 1.539041] smpboot: Total of 32 processors activated (146859.90 BogoMIPS)
The warning only repros if there are more than 1 node, and it only prints once
for the first vCPU of the second node (i.e. node #1).
With more than 1 node, if I don't use patch4, the warning does not repro.
Thanks,
-- Dexuan
> From: Dexuan Cui
> Sent: Wednesday, December 23, 2020 12:27 PM
> ...
> The warning only repros if there are more than 1 node, and it only prints once
> for the first vCPU of the second node (i.e. node #1).
A correction: if I configure the 32 vCPUs evenly into 4 nodes, I get the warning
once for node #1~#3, respectively.
Thanks,
-- Dexuan
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2376,9 +2376,14 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
* For kernel threads that do indeed end up on online &&
* !active we want to ensure they are strict per-CPU threads.
*/
- WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
+ WARN(cpumask_intersects(new_mask, cpu_online_mask) &&
!cpumask_intersects(new_mask, cpu_active_mask) &&
- p->nr_cpus_allowed != 1);
+ p->nr_cpus_allowed != 1, "%*pbl, %*pbl, %*pbl, %d\n",
+ cpumask_pr_args(new_mask),
+ cpumask_pr_args(cpu_online_mask),
+ cpumask_pr_args(cpu_active_mask),
+ p->nr_cpus_allowed
+ );
}
[ 1.791611] smp: Bringing up secondary CPUs ...
[ 1.795225] x86: Booting SMP configuration:
[ 1.798964] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 1.807068] .... node #1, CPUs: #8
[ 1.094226] smpboot: CPU 8 Converting physical 0 to logical die 1
[ 1.895211] ------------[ cut here ]------------
[ 1.899058] 8-15, 0-8, 0-7, 8
[ 1.899058] WARNING: CPU: 8 PID: 50 at kernel/sched/core.c:2386 __set_cpus_allowed_ptr+0x1c7/0x1e0
[ 1.899058] CPU: 8 PID: 50 Comm: cpuhp/8 Not tainted 5.10.0+ #4
[ 1.899058] RIP: 0010:__set_cpus_allowed_ptr+0x1c7/0x1e0
[ 1.899058] Call Trace:
[ 1.899058] worker_attach_to_pool+0x53/0xd0
[ 1.899058] create_worker+0xf9/0x190
[ 1.899058] alloc_unbound_pwq+0x3a5/0x3b0
[ 1.899058] wq_update_unbound_numa+0x112/0x1c0
[ 1.899058] workqueue_online_cpu+0x1d0/0x220
[ 1.899058] cpuhp_invoke_callback+0x82/0x4a0
[ 1.899058] cpuhp_thread_fun+0xb8/0x120
[ 1.899058] smpboot_thread_fn+0x198/0x230
[ 1.899058] kthread+0x13d/0x160
[ 1.899058] ret_from_fork+0x22/0x30
[ 1.903058] #9 #10 #11 #12 #13 #14 #15
[ 1.907092] .... node #2, CPUs: #16
[ 1.094226] smpboot: CPU 16 Converting physical 0 to logical die 2
[ 1.995205] ------------[ cut here ]------------
[ 1.999058] 16-23, 0-16, 0-15, 8
[ 1.999058] WARNING: CPU: 16 PID: 91 at kernel/sched/core.c:2386 __set_cpus_allowed_ptr+0x1c7/0x1e0
[ 1.999058] CPU: 16 PID: 91 Comm: cpuhp/16 Tainted: G W 5.10.0+ #4
[ 1.999058] RIP: 0010:__set_cpus_allowed_ptr+0x1c7/0x1e0
[ 1.999058] Call Trace:
[ 1.999058] worker_attach_to_pool+0x53/0xd0
[ 1.999058] create_worker+0xf9/0x190
[ 1.999058] alloc_unbound_pwq+0x3a5/0x3b0
[ 1.999058] wq_update_unbound_numa+0x112/0x1c0
[ 1.999058] workqueue_online_cpu+0x1d0/0x220
[ 1.999058] cpuhp_invoke_callback+0x82/0x4a0
[ 1.999058] cpuhp_thread_fun+0xb8/0x120
[ 1.999058] smpboot_thread_fn+0x198/0x230
[ 1.999058] kthread+0x13d/0x160
[ 1.999058] ret_from_fork+0x22/0x30
[ 2.003058] #17 #18 #19 #20 #21 #22 #23
[ 2.007092] .... node #3, CPUs: #24
[ 1.094226] smpboot: CPU 24 Converting physical 0 to logical die 3
[ 2.095220] ------------[ cut here ]------------
[ 2.099058] 24-31, 0-24, 0-23, 8
[ 2.099058] WARNING: CPU: 24 PID: 132 at kernel/sched/core.c:2386 __set_cpus_allowed_ptr+0x1c7/0x1e0
[ 2.099058] CPU: 24 PID: 132 Comm: cpuhp/24 Tainted: G W 5.10.0+ #4
[ 2.099058] Call Trace:
[ 2.099058] worker_attach_to_pool+0x53/0xd0
[ 2.099058] create_worker+0xf9/0x190
[ 2.099058] alloc_unbound_pwq+0x3a5/0x3b0
[ 2.099058] wq_update_unbound_numa+0x112/0x1c0
[ 2.099058] workqueue_online_cpu+0x1d0/0x220
[ 2.099058] cpuhp_invoke_callback+0x82/0x4a0
[ 2.099058] cpuhp_thread_fun+0xb8/0x120
[ 2.099058] smpboot_thread_fn+0x198/0x230
[ 2.099058] kthread+0x13d/0x160
[ 2.099058] ret_from_fork+0x22/0x30
[ 2.103058] #25 #26 #27 #28 #29 #30 #31
[ 2.108091] smp: Brought up 4 nodes, 32 CPUs
[ 2.115065] smpboot: Max logical packages: 4
[ 2.119067] smpboot: Total of 32 processors activated (146992.31 BogoMIPS)
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: 6094661b1679bd1d59eb620371dfbf327e9feca7 ("[PATCH -tip V2 04/10] workqueue: don't set the worker's cpumask when kthread_bind_mask()")
url: https://github.com/0day-ci/linux/commits/Lai-Jiangshan/workqueue-break-affinity-initiatively/20201219-001548
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git a409ed156a90093a03fe6a93721ddf4c591eac87
in testcase: will-it-scale
version: will-it-scale-x86_64-b695a1b-1_20201217
with following parameters:
nr_task: 50%
mode: process
test: context_switch1
cpufreq_governor: performance
ucode: 0x2006a08
test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale
on test machine: 104 threads Skylake with 192G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
kern :warn : [ 2.900184] WARNING: CPU: 26 PID: 140 at kernel/sched/core.c:2379 __set_cpus_allowed_ptr+0x1c4/0x1e0
kern :warn : [ 2.900184] Modules linked in:
kern :warn : [ 2.900184] CPU: 26 PID: 140 Comm: cpuhp/26 Not tainted 5.10.0-11829-g6094661b1679 #1
kern :warn : [ 2.900184] RIP: 0010:__set_cpus_allowed_ptr+0x1c4/0x1e0
kern :warn : [ 2.900184] Code: 74 99 8b 15 d2 d6 ad 01 48 c7 c6 a0 30 bf 82 4c 89 f7 e8 8f bd 48 00 85 c0 75 80 41 83 bc 24 c0 03 00 00 01 0f 84 71 ff ff ff <0f> 0b e9 6a ff ff ff e8 b0 5c ac 00 66 66 2e 0f 1f 84 00 00 00 00
kern :warn : [ 2.900184] RSP: 0000:ffffc9000cebfd00 EFLAGS: 00010006
kern :warn : [ 2.900184] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffff98
kern :warn : [ 2.900184] RDX: 00000000ffffff98 RSI: 0000000000000000 RDI: ffff889846721ed0
kern :warn : [ 2.900184] RBP: ffffc9000cebfd40 R08: 0000000000000001 R09: 0000000000000002
kern :warn : [ 2.900184] R10: ffff889846721ed0 R11: ffffc9010cebfcd7 R12: ffff889846730000
kern :warn : [ 2.900184] R13: ffff8897e042b040 R14: ffff889846721ed0 R15: 000000000000001a
kern :warn : [ 2.900184] FS: 0000000000000000(0000) GS:ffff88afbc600000(0000) knlGS:0000000000000000
kern :warn : [ 2.900184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern :warn : [ 2.900184] CR2: 0000000000000000 CR3: 000000303e20a001 CR4: 00000000007706e0
kern :warn : [ 2.900184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kern :warn : [ 2.900184] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kern :warn : [ 2.900184] PKRU: 55555554
kern :warn : [ 2.900184] Call Trace:
kern :warn : [ 2.900184] worker_attach_to_pool+0x33/0xa0
kern :warn : [ 2.900184] create_worker+0xff/0x1a0
kern :warn : [ 2.900184] alloc_unbound_pwq+0x444/0x460
kern :warn : [ 2.900184] wq_update_unbound_numa+0x1a3/0x1e0
kern :warn : [ 2.900184] workqueue_online_cpu+0x1eb/0x240
kern :warn : [ 2.900184] ? workqueue_prepare_cpu+0x80/0x80
kern :warn : [ 2.900184] cpuhp_invoke_callback+0x82/0x440
kern :warn : [ 2.900184] ? smpboot_thread_fn+0x26/0x1e0
kern :warn : [ 2.900184] cpuhp_thread_fun+0xa4/0x100
kern :warn : [ 2.900184] smpboot_thread_fn+0x10b/0x1e0
kern :warn : [ 2.900184] ? sort_range+0x20/0x20
kern :warn : [ 2.900184] kthread+0x116/0x160
kern :warn : [ 2.900184] ? kthread_park+0xa0/0xa0
kern :warn : [ 2.900184] ret_from_fork+0x22/0x30
kern :warn : [ 2.900184] ---[ end trace ee287c8df3029920 ]---
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
Thanks,
Oliver Sang
On Sat, Dec 26, 2020 at 06:34:21PM +0800, Hillf Danton wrote:
> On Wed, 23 Dec 2020 11:49:51 -0800 "Paul E. McKenney" wrote:
> >On Sat, Dec 19, 2020 at 01:09:09AM +0800, Lai Jiangshan wrote:
> >> From: Lai Jiangshan <[email protected]>
> >>
> >> 06249738a41a ("workqueue: Manually break affinity on hotplug")
> >> said that scheduler will not force break affinity for us.
> >>
> >> But workqueue highly depends on the old behavior. Many parts of the codes
> >> relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
> >> is not enough to change it, and the commit has flaws in itself too.
> >>
> >> It doesn't handle for worker detachment.
> >> It doesn't handle for worker attachement, mainly worker creation
> >> which is handled by Valentin Schneider's patch [1].
> >> It doesn't handle for unbound workers which might be possible
> >> per-cpu-kthread.
> >>
> >> We need to thoroughly update the way workqueue handles affinity
> >> in cpu hot[un]plug, what is this patchset intends to do and
> >> replace the Valentin Schneider's patch [1]. The equivalent patch
> >> is patch 10.
> >>
> >> Patch 1 fixes a flaw reported by Hillf Danton <[email protected]>.
> >> I have to include this fix because later patches depends on it.
> >>
> >> The patchset is based on tip/master rather than workqueue tree,
> >> because the patchset is a complement for 06249738a41a ("workqueue:
> >> Manually break affinity on hotplug") which is only in tip/master by now.
> >>
> >> And TJ acked to route the series through tip.
> >>
> >> Changed from V1:
> >> Add TJ's acked-by for the whole patchset
> >>
> >> Add more words to the comments and the changelog, mainly derived
> >> from discussion with Peter.
> >>
> >> Update the comments as TJ suggested.
> >>
> >> Update a line of code as Valentin suggested.
> >>
> >> Add Valentin's ack for patch 10 because "Seems alright to me." and
> >> add Valentin's comments to the changelog which is integral.
> >>
> >> [1]: https://lore.kernel.org/r/[email protected]
> >> [V1 patcheset]: https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Cc: Hillf Danton <[email protected]>
> >> Cc: Valentin Schneider <[email protected]>
> >> Cc: Qian Cai <[email protected]>
> >> Cc: Peter Zijlstra <[email protected]>
> >> Cc: Vincent Donnefort <[email protected]>
> >> Cc: Tejun Heo <[email protected]>
> >
> >And rcutorture hits this, so thank you for the fix!
>
> Can you please specify a bit what you encountered in rcutorture
> before this patchset? You know we cant have a correct estimation
> of the fix diameter without your help.
It triggers the following in sched_cpu_dying() in kernel/sched/core.c,
exactly the same as for Lai Jiangshan:
BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq))
Which is in fact the "this" in my earlier "rcutorture hits this". ;-)
Thanx, Paul
> >Tested-by: Paul E. McKenney <[email protected]>
> >
> >> Lai Jiangshan (10):
> >> workqueue: restore unbound_workers' cpumask correctly
> >> workqueue: use cpu_possible_mask instead of cpu_active_mask to break
> >> affinity
> >> workqueue: Manually break affinity on pool detachment
> >> workqueue: don't set the worker's cpumask when kthread_bind_mask()
> >> workqueue: introduce wq_online_cpumask
> >> workqueue: use wq_online_cpumask in restore_unbound_workers_cpumask()
> >> workqueue: Manually break affinity on hotplug for unbound pool
> >> workqueue: reorganize workqueue_online_cpu()
> >> workqueue: reorganize workqueue_offline_cpu() unbind_workers()
> >> workqueue: Fix affinity of kworkers when attaching into pool
> >>
> >> kernel/workqueue.c | 214 ++++++++++++++++++++++++++++-----------------
> >> 1 file changed, 132 insertions(+), 82 deletions(-)
> >>
> >> --
> >> 2.19.1.6.gb485710b
On Sat, Dec 26, 2020 at 10:52 PM Paul E. McKenney <[email protected]> wrote:
> >
> > Can you please specify a bit what you encountered in rcutorture
> > before this patchset? You know we cant have a correct estimation
> > of the fix diameter without your help.
>
> It triggers the following in sched_cpu_dying() in kernel/sched/core.c,
> exactly the same as for Lai Jiangshan:
>
> BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq))
>
> Which is in fact the "this" in my earlier "rcutorture hits this". ;-)
>
> Thanx, Paul
>
Hi, Hillf
https://lkml.org/lkml/2020/12/22/141
From the email, I think rcutorture encountered the same problem.
Hi, Paul
I'm sorry to forget to add your Tested-by.
Thanks
Lai
On Sun, Dec 27, 2020 at 10:08:51PM +0800, Lai Jiangshan wrote:
> On Sat, Dec 26, 2020 at 10:52 PM Paul E. McKenney <[email protected]> wrote:
>
> > >
> > > Can you please specify a bit what you encountered in rcutorture
> > > before this patchset? You know we cant have a correct estimation
> > > of the fix diameter without your help.
>
> >
> > It triggers the following in sched_cpu_dying() in kernel/sched/core.c,
> > exactly the same as for Lai Jiangshan:
> >
> > BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq))
> >
> > Which is in fact the "this" in my earlier "rcutorture hits this". ;-)
> >
> > Thanx, Paul
> >
>
> Hi, Hillf
>
> https://lkml.org/lkml/2020/12/22/141
>
> >From the email, I think rcutorture encountered the same problem.
>
> Hi, Paul
>
> I'm sorry to forget to add your Tested-by.
No need to apologize, especially given that I didn't get around to
testing it until after it was pulled into -tip. ;-)
Thank you for the patch series!
Thanx, Paul