2024-02-06 08:01:47

by Yunlong Xing

[permalink] [raw]
Subject: [PATCH] workqueue: Fix pool->nr_running type back to atomic

In CPU-hotplug test, when plug the core, set_cpus_allowed_ptr() restoring
the cpus_mask of the per-cpu worker may fail, the cpus_mask of the worker
remain wq_unbound_cpumask until the core hotpluged next time. so, workers
in the same per-cpu pool can run concurrently and change nr_running at the
same time, atomic problem occur.

Crash ps info:
[RU] PID: 19966 TASK: ffffff802a71a580 CPU: 6 COMMAND: "kworker/6:1"
[ID] PID: 2620 TASK: ffffff80d451a580 CPU: 0 COMMAND: "kworker/6:2"

workqueue_online_cpu()
->rebind_workers()
->set_cpus_allowed_ptr()
// restore cpus_mask failed
kworker/6:2 cpus_mask is 0xFF
worker enter idle

T1:kworker/6:1(CPU6) T2:kworker/6:2(CPU0)

1:worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND)
->pool->nr_running++; (1)

2:wq_worker_sleeping()
->pool->nr_running--; (0)

3:wq_worker_running() worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND)
->pool->nr_running++; (1) ->pool->nr_running++; (1)

//Two workers that running on two CPUs modify the nr_running at the same
time, atomic problems(race condition) may occur. the nr_running should
be 2, but in this case it is 1.

4: worker_set_flags(worker, WORKER_PREP)
->pool->nr_running--; (0)

5:worker_set_flags(worker, WORKER_PREP)
->pool->nr_running--; (-1)

The complete following debug log:

[70285.393470] wqdb kworker/6:1:19966 cpu6 pool 6 ffffff817f311900 nr_running++ is 1, by clr 264
[70285.393484] wqdb kworker/6:1:19966 cpu6 pool 6 ffffff817f311900 nr_running-- is 0, by sleep
[70285.399883] wqdb kworker/6:1:19966 cpu6 pool 6 ffffff817f311900 nr_running++ is 1, by run
[70285.399882] wqdb kworker/6:2:2620 cpu0 pool 6 ffffff817f311900 nr_running++ is 1, by clr 264
[70285.399894] wqdb kworker/6:2:2620 cpu0 pool 6 ffffff817f311900 nr_running-- is 0, by set 8
[70285.400013] wqdb kworker/6:1:19966 cpu6 pool 6 ffffff817f311900 nr_running-- is -1, by set 8
[70285.400017] wqdb kworker/6:1:19966 cpu6 pool 6 ffffff817f311900 nr_running_error is -1

Recover nr_running to the atomic variable type and use atomic method
where nr_running is accessed.

Fixes: bc35f7ef9628 ("workqueue: Convert the type of pool->nr_running to int")
Cc: [email protected]
Signed-off-by: Yunlong Xing <[email protected]>
---
kernel/workqueue.c | 24 +++++++++++-------------
1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 76e60faed892..e74d9b83322c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -161,12 +161,10 @@ struct worker_pool {
bool cpu_stall; /* WD: stalled cpu bound pool */

/*
- * The counter is incremented in a process context on the associated CPU
- * w/ preemption disabled, and decremented or reset in the same context
- * but w/ pool->lock held. The readers grab pool->lock and are
- * guaranteed to see if the counter reached zero.
+ * The workers associated the same CPU maybe running on different CPU,
+ * so need use atomic_t.
*/
- int nr_running;
+ atomic_t nr_running;

struct list_head worklist; /* L: list of pending works */

@@ -832,7 +830,7 @@ static bool work_is_canceling(struct work_struct *work)
*/
static bool need_more_worker(struct worker_pool *pool)
{
- return !list_empty(&pool->worklist) && !pool->nr_running;
+ return !list_empty(&pool->worklist) && !atomic_read(&pool->nr_running);
}

/* Can I start working? Called from busy but !running workers. */
@@ -844,7 +842,7 @@ static bool may_start_working(struct worker_pool *pool)
/* Do I need to keep working? Called from currently running workers. */
static bool keep_working(struct worker_pool *pool)
{
- return !list_empty(&pool->worklist) && (pool->nr_running <= 1);
+ return !list_empty(&pool->worklist) && (atomic_read(&pool->nr_running) <= 1);
}

/* Do we need a new worker? Called from manager. */
@@ -879,7 +877,7 @@ static inline void worker_set_flags(struct worker *worker, unsigned int flags)
/* If transitioning into NOT_RUNNING, adjust nr_running. */
if ((flags & WORKER_NOT_RUNNING) &&
!(worker->flags & WORKER_NOT_RUNNING)) {
- pool->nr_running--;
+ atomic_dec(&pool->nr_running);
}

worker->flags |= flags;
@@ -908,7 +906,7 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
*/
if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
if (!(worker->flags & WORKER_NOT_RUNNING))
- pool->nr_running++;
+ atomic_inc(&pool->nr_running);
}

/* Return the first idle worker. Called with pool->lock held. */
@@ -951,7 +949,7 @@ static void worker_enter_idle(struct worker *worker)
mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);

/* Sanity check nr_running. */
- WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
+ WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && atomic_read(&pool->nr_running));
}

/**
@@ -1262,7 +1260,7 @@ void wq_worker_running(struct task_struct *task)
*/
preempt_disable();
if (!(worker->flags & WORKER_NOT_RUNNING))
- worker->pool->nr_running++;
+ atomic_inc(&worker->pool->nr_running);
preempt_enable();

/*
@@ -1313,7 +1311,7 @@ void wq_worker_sleeping(struct task_struct *task)
return;
}

- pool->nr_running--;
+ atomic_dec(&pool->nr_running);
if (kick_pool(pool))
worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;

@@ -5418,7 +5416,7 @@ static void unbind_workers(int cpu)
* an unbound (in terms of concurrency management) pool which
* are served by workers tied to the pool.
*/
- pool->nr_running = 0;
+ atomic_set(&pool->nr_running, 0);

/*
* With concurrency management just turned off, a busy
--
2.25.1



2024-02-06 16:52:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Fix pool->nr_running type back to atomic

Hello,

On Tue, Feb 06, 2024 at 04:00:24PM +0800, Yunlong Xing wrote:
> In CPU-hotplug test, when plug the core, set_cpus_allowed_ptr() restoring
> the cpus_mask of the per-cpu worker may fail, the cpus_mask of the worker
> remain wq_unbound_cpumask until the core hotpluged next time. so, workers
> in the same per-cpu pool can run concurrently and change nr_running at the
> same time, atomic problem occur.

How would set_cpus_allowed_ptr() fail? That should trigger WARN_ON, right?
If set_cpus_allowed_ptr() fails, nr_running getting desynchronized is only a
part of the problem. We will end up running per-cpu work items which must
execute on the same CPU on foreign CPUs.

Thanks.

--
tejun

2024-02-07 02:44:40

by yunlong xing

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Fix pool->nr_running type back to atomic

On Wed, Feb 7, 2024 at 12:52 AM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Feb 06, 2024 at 04:00:24PM +0800, Yunlong Xing wrote:
> > In CPU-hotplug test, when plug the core, set_cpus_allowed_ptr() restoring
> > the cpus_mask of the per-cpu worker may fail, the cpus_mask of the worker
> > remain wq_unbound_cpumask until the core hotpluged next time. so, workers
> > in the same per-cpu pool can run concurrently and change nr_running at the
> > same time, atomic problem occur.
>
> How would set_cpus_allowed_ptr() fail? That should trigger WARN_ON, right?
> If set_cpus_allowed_ptr() fails, nr_running getting desynchronized is only a
> part of the problem. We will end up running per-cpu work items which must
> execute on the same CPU on foreign CPUs.

Hi tejun,

Yes, WARN_ON is triggered. The reason of set_cpus_allowed_ptr() fail still
needs to be further investigated.

I was originally planning to inquire workqueue whether or not allows the worker
that associated with one cpu running on the other cpus?From your reply,the
answer of my question is not allow,right?

>
> Thanks.
>
> --
> tejun

2024-02-07 03:13:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Fix pool->nr_running type back to atomic

Hello,

On Wed, Feb 07, 2024 at 10:06:56AM +0800, yunlong xing wrote:
> Yes, WARN_ON is triggered. The reason of set_cpus_allowed_ptr() fail still
> needs to be further investigated.

I see. Please include context like the above when posting patches.

> I was originally planning to inquire workqueue whether or not allows the worker
> that associated with one cpu running on the other cpus?From your reply,the
> answer of my question is not allow,right?

Yes, you're right. set_cpus_allowed_ptr() shouldn't fail.

Thanks.

--
tejun