From: Lai Jiangshan <[email protected]>
The commit 6d25be5782e4 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue.
It simplified the connection between scheduler and workqueue. But it
caused some code uselss and some comments outdated in workqueue. This
patchset clean them up.
Patch1 is unrelated to 6d25be5782e4, it is related to a recent change
to make wq_worker_sleeping() not being called in preempt disabled
section.
Patch2 is cleanup for 6d25be5782e4 not calling schedule callbacks in
deeper sleeping path with local-wake-up fashion.
Patch3 is unrelated to 6d25be5782e4, but weakly prepared for patch4.
Patch4-6 are cleanup for 6d25be5782e4 not calling schedule callbacks in
wakeup code, so cacheline_aligned for nr_running and schedule() in
unbind_workers() is unneeded.
6d25be5782e4 also changed to use pool lock in wq_worker_sleeping(),
and patch 7 changes it back to use preemption disabling. This patch is
marked for 'RFC' because using pool lock in slow (sleeping) path is OK
for me and paves the road to remove "X:" protection.
There are several further cleanups depended on if patch7 is accepted or
not. For example, mb() in insert_work() can be removed if pool lock
wins.
Lai Jiangshan (7):
workqueue: Remove the outdated comment before wq_worker_sleeping()
workqueue: Remove the advanced kicking of the idle workers in
rebind_workers()
workqueue: Remove outdated comment about exceptional workers in
unbind_workers()
workqueue: Remove schedule() in unbind_workers()
workqueue: Move the code of waking a worker up in unbind_workers()
workqueue: Remove the cacheline_aligned for nr_running
workqueue: Replace pool lock with preemption disabling in
wq_worker_sleeping()
kernel/workqueue.c | 102 +++++++++++++++++----------------------------
1 file changed, 38 insertions(+), 64 deletions(-)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It isn't called with preempt disabled now.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5557d19ea81c..312b806b3956 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -887,8 +887,7 @@ void wq_worker_running(struct task_struct *task)
* @task: task going to sleep
*
* This function is called from schedule() when a busy worker is
- * going to sleep. Preemption needs to be disabled to protect ->sleeping
- * assignment.
+ * going to sleep.
*/
void wq_worker_sleeping(struct task_struct *task)
{
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The commit 6d25be5782e4 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue
and removed the local-wake-up functionality.
Now the wakingup of workers is done by normal fashion and workers not
yet migrated to the specific CPU in concurrency managed pool can also
be woken up by workers that already bound to the specific cpu now.
So this advanced kicking of the idle workers to migrate them to the
associated CPU is unneeded now.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 312b806b3956..ba8cf5f5e7fa 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5077,17 +5077,6 @@ static void rebind_workers(struct worker_pool *pool)
for_each_pool_worker(worker, pool) {
unsigned int worker_flags = worker->flags;
- /*
- * A bound idle worker should actually be on the runqueue
- * of the associated CPU for local wake-ups targeting it to
- * work. Kick all idle workers so that they migrate to the
- * associated CPU. Doing this in the same loop as
- * replacing UNBOUND with REBOUND is safe as no worker will
- * be bound before @pool->lock is released.
- */
- if (worker_flags & WORKER_IDLE)
- wake_up_process(worker->task);
-
/*
* We want to clear UNBOUND but can't directly call
* worker_clr_flags() or adjust nr_running. Atomically
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Long time before, workers are not ALL bound after CPU_ONLINE, they can
still be running in other CPUs before self rebinding.
But the commit a9ab775bcadf ("workqueue: directly restore CPU affinity
of workers from CPU_ONLINE") makes rebind_workers() bind them all.
So all workers are on the CPU before the CPU is down.
And the comment in unbind_workers() refers to the workers "which are
still executing works from before the last CPU down" is outdated.
Just removed it.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ba8cf5f5e7fa..ad663853bb78 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4998,9 +4998,7 @@ static void unbind_workers(int cpu)
/*
* We've blocked all attach/detach operations. Make all workers
* unbound and set DISASSOCIATED. Before this, all workers
- * except for the ones which are still executing works from
- * before the last CPU down must be on the cpu. After
- * this, they may become diasporas.
+ * must be on the cpu. After this, they may become diasporas.
*/
for_each_pool_worker(worker, pool)
worker->flags |= WORKER_UNBOUND;
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The commit 6d25be5782e4 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue
and moved the schedule callback from the wakeup code to at end of
schedule() in the worker's process context.
It means that the callback wq_worker_running() must be guaranteed that
it sees the %WORKER_UNBOUND flag after scheduled since unbind_workers()
is running on the same CPU that all the pool's workers bound to.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ad663853bb78..595f9bd5ad29 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4999,6 +4999,9 @@ static void unbind_workers(int cpu)
* We've blocked all attach/detach operations. Make all workers
* unbound and set DISASSOCIATED. Before this, all workers
* must be on the cpu. After this, they may become diasporas.
+ * And the preemption disabled section in their sched callbacks
+ * are guaranteed to see WORKER_UNBOUND since the code here
+ * is on the same cpu.
*/
for_each_pool_worker(worker, pool)
worker->flags |= WORKER_UNBOUND;
@@ -5014,14 +5017,6 @@ static void unbind_workers(int cpu)
mutex_unlock(&wq_pool_attach_mutex);
- /*
- * Call schedule() so that we cross rq->lock and thus can
- * guarantee sched callbacks see the %WORKER_UNBOUND flag.
- * This is necessary as scheduler callbacks may be invoked
- * from other cpus.
- */
- schedule();
-
/*
* Sched callbacks are disabled now. Zap nr_running.
* After this, nr_running stays zero and need_more_worker()
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
In unbind_workers(), there are two pool->lock held sections separated
by the code of zapping nr_running. wake_up_worker() needs to be in
pool->lock held section and after zapping nr_running. And zapping
nr_running had to be after schedule() when the local wake up
functionality was in use. Now, the call to schedule() has been removed
along with the local wake up functionality, so the code can be merged
into the same pool->lock held section.
The diffstat shows that it is other code moved down because the diff
tools can not know the meaning of merging lock sections by swapping
two code blocks.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 38 +++++++++++++++-----------------------
1 file changed, 15 insertions(+), 23 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 595f9bd5ad29..256f552e9513 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1830,14 +1830,8 @@ static void worker_enter_idle(struct worker *worker)
if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
- /*
- * Sanity check nr_running. Because unbind_workers() releases
- * pool->lock between setting %WORKER_UNBOUND and zapping
- * nr_running, the warning may trigger spuriously. Check iff
- * unbind is not in progress.
- */
- WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
- pool->nr_workers == pool->nr_idle &&
+ /* Sanity check nr_running. */
+ WARN_ON_ONCE(pool->nr_workers == pool->nr_idle &&
atomic_read(&pool->nr_running));
}
@@ -5008,21 +5002,12 @@ static void unbind_workers(int cpu)
pool->flags |= POOL_DISASSOCIATED;
- raw_spin_unlock_irq(&pool->lock);
-
- for_each_pool_worker(worker, pool) {
- kthread_set_per_cpu(worker->task, -1);
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
- }
-
- mutex_unlock(&wq_pool_attach_mutex);
-
/*
- * Sched callbacks are disabled now. Zap nr_running.
- * After this, nr_running stays zero and need_more_worker()
- * and keep_working() are always true as long as the
- * worklist is not empty. This pool now behaves as an
- * unbound (in terms of concurrency management) pool which
+ * The handling of nr_running in sched callbacks are disabled
+ * now. Zap nr_running. After this, nr_running stays zero and
+ * need_more_worker() and keep_working() are always true as
+ * long as the worklist is not empty. This pool now behaves as
+ * an unbound (in terms of concurrency management) pool which
* are served by workers tied to the pool.
*/
atomic_set(&pool->nr_running, 0);
@@ -5032,9 +5017,16 @@ static void unbind_workers(int cpu)
* worker blocking could lead to lengthy stalls. Kick off
* unbound chain execution of currently pending work items.
*/
- raw_spin_lock_irq(&pool->lock);
wake_up_worker(pool);
+
raw_spin_unlock_irq(&pool->lock);
+
+ for_each_pool_worker(worker, pool) {
+ kthread_set_per_cpu(worker->task, -1);
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+ }
+
+ mutex_unlock(&wq_pool_attach_mutex);
}
}
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
nr_running is never modified remotely after the schedule callback in
wakeup path is removed.
Rather nr_running is often accessed with other fields in the pool
together, so the cacheline_aligned for nr_running isn't needed.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 256f552e9513..33f1106b4f99 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -154,6 +154,9 @@ struct worker_pool {
unsigned long watchdog_ts; /* L: watchdog timestamp */
+ /* The current concurrency level. */
+ atomic_t nr_running;
+
struct list_head worklist; /* L: list of pending works */
int nr_workers; /* L: total number of workers */
@@ -177,19 +180,12 @@ struct worker_pool {
struct hlist_node hash_node; /* PL: unbound_pool_hash node */
int refcnt; /* PL: refcnt for unbound pools */
- /*
- * The current concurrency level. As it's likely to be accessed
- * from other CPUs during try_to_wake_up(), put it in a separate
- * cacheline.
- */
- atomic_t nr_running ____cacheline_aligned_in_smp;
-
/*
* Destruction of pool is RCU protected to allow dereferences
* from get_work_pool().
*/
struct rcu_head rcu;
-} ____cacheline_aligned_in_smp;
+};
/*
* The per-pool workqueue. While queued, the lower WORK_STRUCT_FLAG_BITS
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Once upon a time, wq_worker_sleeping() was called with rq lock held,
so wq_worker_sleeping() can not use pool lock. Instead it used "X:"
protection: preemption disabled on local cpu and wq_worker_sleeping()
didn't depend on rq lock to work even with it held.
Now, wq_worker_sleeping() isn't called with rq lock held and is using
pool lock. But the functionality of "X:" protection isn't removed and
wq_worker_running() is still using it.
So we can also use "X:" protection in wq_worker_sleeping() and avoid
locking the pool lock.
This patch also documents that only idle_list.next is under "X:"
protection, not the whole idle_list because destroy_worker() in idle
timer can remove non-first idle workers. Idle timer can be possible
strayed in different CPU, or even in the same CPU, it can interrupt
preemption disabled context.
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 33f1106b4f99..6c30edbe2fc2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -162,7 +162,8 @@ struct worker_pool {
int nr_workers; /* L: total number of workers */
int nr_idle; /* L: currently idle workers */
- struct list_head idle_list; /* X: list of idle workers */
+ /* idle_list.next and first_idle_worker(): X: first idle worker */
+ struct list_head idle_list; /* L: list of idle workers */
struct timer_list idle_timer; /* L: worker idle timeout */
struct timer_list mayday_timer; /* L: SOS timer for workers */
@@ -819,6 +820,11 @@ static bool too_many_workers(struct worker_pool *pool)
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;
+ /*
+ * nr_idle must be at least 2 to allow a manager and at least an idle
+ * worker in idle_list so that idle_worker_timeout() doesn't touch
+ * idle_list.next.
+ */
return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
}
@@ -826,7 +832,7 @@ static bool too_many_workers(struct worker_pool *pool)
* Wake up functions.
*/
-/* Return the first idle worker. Safe with preemption disabled */
+/* Return the first idle worker. Safe with "X:" protection */
static struct worker *first_idle_worker(struct worker_pool *pool)
{
if (unlikely(list_empty(&pool->idle_list)))
@@ -905,7 +911,7 @@ void wq_worker_sleeping(struct task_struct *task)
return;
worker->sleeping = 1;
- raw_spin_lock_irq(&pool->lock);
+ preempt_disable();
/*
* Recheck in case unbind_workers() preempted us. We don't
@@ -913,7 +919,7 @@ void wq_worker_sleeping(struct task_struct *task)
* and nr_running has been reset.
*/
if (worker->flags & WORKER_NOT_RUNNING) {
- raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
return;
}
@@ -923,10 +929,9 @@ void wq_worker_sleeping(struct task_struct *task)
* Please read comment there.
*
* NOT_RUNNING is clear. This means that we're bound to and
- * running on the local cpu w/ rq lock held and preemption
- * disabled, which in turn means that none else could be
- * manipulating idle_list, so dereferencing idle_list without pool
- * lock is safe.
+ * running on the local cpu with preemption disabled, which in turn
+ * means that no one else could be manipulating idle_list.next
+ * so dereferencing idle_list.next without pool lock is safe.
*/
if (atomic_dec_and_test(&pool->nr_running) &&
!list_empty(&pool->worklist)) {
@@ -934,7 +939,7 @@ void wq_worker_sleeping(struct task_struct *task)
if (next)
wake_up_process(next->task);
}
- raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
}
/**
--
2.19.1.6.gb485710b
On Tue, Dec 07, 2021 at 03:35:42PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> nr_running is never modified remotely after the schedule callback in
> wakeup path is removed.
>
> Rather nr_running is often accessed with other fields in the pool
> together, so the cacheline_aligned for nr_running isn't needed.
Does it even need to be atomic anymore?
Thanks.
--
tejun
Hello,
On Tue, Dec 07, 2021 at 03:35:43PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> Once upon a time, wq_worker_sleeping() was called with rq lock held,
> so wq_worker_sleeping() can not use pool lock. Instead it used "X:"
> protection: preemption disabled on local cpu and wq_worker_sleeping()
> didn't depend on rq lock to work even with it held.
>
> Now, wq_worker_sleeping() isn't called with rq lock held and is using
> pool lock. But the functionality of "X:" protection isn't removed and
> wq_worker_running() is still using it.
>
> So we can also use "X:" protection in wq_worker_sleeping() and avoid
> locking the pool lock.
>
> This patch also documents that only idle_list.next is under "X:"
> protection, not the whole idle_list because destroy_worker() in idle
> timer can remove non-first idle workers. Idle timer can be possible
> strayed in different CPU, or even in the same CPU, it can interrupt
> preemption disabled context.
It's nice to go back to not needing to grab pool lock in the worker sleeping
path but I'm not sure it actually matters. This isn't in a red-hot path and
we're touching a bunch of stuff in the pool anyway, so the overhead of
grabbing a lock which likely isn't too contended shouldn't matter all that
much. So, maybe it'd be better to just keep things simple?
Thanks.
--
tejun
On Tue, Dec 07, 2021 at 03:35:39PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> Long time before, workers are not ALL bound after CPU_ONLINE, they can
> still be running in other CPUs before self rebinding.
>
> But the commit a9ab775bcadf ("workqueue: directly restore CPU affinity
> of workers from CPU_ONLINE") makes rebind_workers() bind them all.
>
> So all workers are on the CPU before the CPU is down.
>
> And the comment in unbind_workers() refers to the workers "which are
> still executing works from before the last CPU down" is outdated.
> Just removed it.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
Applied 1-3 to wq/for-5.17.
Thanks.
--
tejun
On Tue, Dec 07, 2021 at 03:35:42PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> nr_running is never modified remotely after the schedule callback in
> wakeup path is removed.
>
> Rather nr_running is often accessed with other fields in the pool
> together, so the cacheline_aligned for nr_running isn't needed.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
Applied 4-6 to cgroup/for-5.17.
Thanks.
--
tejun
On 2021/12/10 06:07, Tejun Heo wrote:
> On Tue, Dec 07, 2021 at 03:35:42PM +0800, Lai Jiangshan wrote:
>> From: Lai Jiangshan <[email protected]>
>>
>> nr_running is never modified remotely after the schedule callback in
>> wakeup path is removed.
>>
>> Rather nr_running is often accessed with other fields in the pool
>> together, so the cacheline_aligned for nr_running isn't needed.
>
> Does it even need to be atomic anymore?
>
It doesn't need to be atomic, it is only modified in its associated CPU
in process context.