2022-08-02 08:48:57

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v3 0/3] workqueue: destroy_worker() vs isolated CPUs

Hi folks,

Using a work struct from within the workqueue code itself is a bit scary, but
it seems to be holding up (at the very least on the locking side of things).

Note that this affects all kworkers (not just percpu ones) for the sake of
consistency and to prevent adding extra corner cases. kthread_set_per_cpu(p, -1)
is a no-op for unbound kworkers, and IIUC the affinity change is not required
since unbound workers have to be affined to a subset of wq_unbound_cpumask, but
it shouldn't be harmful either.

3/3 (not for merging!) is a simple and stupid stresser that forces extra pcpu
kworkers to be spawned on a specific CPU - I can then quickly test this on QEMU
by making sure said CPU is isolated on the cmdline.

Thanks to Tejun & Lai for the discussion thus far.

Revisions
=========

RFCv2 -> RFCv3
++++++++++++++

o Rebase onto v5.19
o Add new patch (1/3) around accessing wq_unbound_cpumask

o Prevent WORKER_DIE workers for kfree()'ing themselves before the idle reaper
gets to handle them (Tejun)

Bit of an aside on that: I've been struggling to convince myself this can
happen due to spurious wakeups and would like some help here.

Idle workers are TASK_UNINTERRUPTIBLE, so they can't be woken up by
signals. That state is set *under* pool->lock, and all wakeups (before this
patch) are also done while holding pool->lock.

wake_up_worker() is done under pool->lock AND only wakes a worker on the
pool->idle_list. Thus the to-be-woken worker *cannot* have WORKER_DIE, though
it could gain it *after* being woken but *before* it runs, e.g.:

LOCK pool->lock
wake_up_worker(pool)
wake_up_process(p)
UNLOCK pool->lock
idle_reaper_fn()
LOCK pool->lock
destroy_worker(worker, list);
UNLOCK pool->lock
worker_thread()
goto woke_up;
LOCK pool->lock
READ worker->flags & WORKER_DIE
UNLOCK pool->lock
...
kfree(worker);
reap_worker(worker);
// Uh-oh

... But IMO that's not a spurious wakeup, that's a concurrency issue. I don't
see any spurious/unexpected worker wakeup happening once a worker is off the
pool->idle_list.


RFCv1 -> RFCv2
++++++++++++++

o Change the pool->timer into a delayed_work to have a sleepable context for
unbinding kworkers

Cheers,
Valentin

Valentin Schneider (3):
workqueue: Hold wq_pool_mutex while affining tasks to
wq_unbound_cpumask
workqueue: Unbind workers before sending them to exit()
DEBUG-DO-NOT-MERGE: workqueue: kworker spawner

kernel/Makefile | 2 +-
kernel/workqueue.c | 169 +++++++++++++++++++++++++++++-------
kernel/workqueue_internal.h | 1 +
kernel/wqstress.c | 69 +++++++++++++++
4 files changed, 207 insertions(+), 34 deletions(-)
create mode 100644 kernel/wqstress.c

--
2.31.1



2022-08-02 09:01:33

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v3 3/3] DEBUG-DO-NOT-MERGE: workqueue: kworker spawner

---
kernel/Makefile | 2 +-
kernel/workqueue.c | 9 +++++-
kernel/wqstress.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 78 insertions(+), 2 deletions(-)
create mode 100644 kernel/wqstress.c

diff --git a/kernel/Makefile b/kernel/Makefile
index a7e1f49ab2b3..860133f7bca5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o \
extable.o params.o platform-feature.o \
kthread.o sys_ni.o nsproxy.o \
notifier.o ksysfs.o cred.o reboot.o \
- async.o range.o smpboot.o ucount.o regset.o
+ async.o range.o smpboot.o ucount.o regset.o wqstress.o

obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
obj-$(CONFIG_MODULES) += kmod.o
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 28cd58c684ee..4ffd50a3db46 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -91,7 +91,7 @@ enum {
BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */

MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
- IDLE_WORKER_TIMEOUT = 300 * HZ, /* keep idle ones for 5 mins */
+ IDLE_WORKER_TIMEOUT = 3 * HZ, /* keep idle ones for 5 mins */

MAYDAY_INITIAL_TIMEOUT = HZ / 100 >= 2 ? HZ / 100 : 2,
/* call for help after 10ms
@@ -1996,6 +1996,10 @@ static void reap_workers(struct list_head *reaplist)
struct worker *worker, *tmp;

list_for_each_entry_safe(worker, tmp, reaplist, entry) {
+ pr_info("WORKER_REAP: task=%s cpu=%d this_task=%s this_cpu=%d\n",
+ worker->task->comm, task_cpu(worker->task),
+ current->comm, raw_smp_processor_id());
+
list_del_init(&worker->entry);
unbind_worker(worker);

@@ -2489,6 +2493,9 @@ static int worker_thread(void *__worker)
WARN_ON_ONCE(!list_empty(&worker->entry));
set_pf_worker(false);

+ pr_info("WORKER_DIE: task=%s this_cpu=%d\n",
+ current->comm, raw_smp_processor_id());
+
set_task_comm(worker->task, "kworker/dying");
ida_free(&pool->worker_ida, worker->id);
worker_detach_from_pool(worker);
diff --git a/kernel/wqstress.c b/kernel/wqstress.c
new file mode 100644
index 000000000000..16a3771027cd
--- /dev/null
+++ b/kernel/wqstress.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+
+MODULE_AUTHOR("Valentin Schneider <[email protected]>");
+MODULE_LICENSE("GPL");
+
+#define TARGET_CPU 3
+
+static void wqstress_workfn(struct work_struct *work)
+{
+ schedule_timeout_interruptible(10 * HZ);
+}
+
+#define DECL_WORK(n) static DECLARE_WORK(wqstress_work_##n, wqstress_workfn)
+#define KICK_WORK(n) do { \
+ schedule_work_on(TARGET_CPU, &wqstress_work_##n); \
+ } while (0);
+#define FLUSH_WORK(n) do { \
+ flush_work(&wqstress_work_##n); \
+ } while (0);
+
+DECL_WORK(0);
+DECL_WORK(1);
+DECL_WORK(2);
+DECL_WORK(3);
+DECL_WORK(4);
+DECL_WORK(5);
+DECL_WORK(6);
+DECL_WORK(7);
+DECL_WORK(8);
+DECL_WORK(9);
+
+/*
+ * This should create ≈(N-1) extra kworkers for N kicked work
+ */
+static int __init wqstress_init(void)
+{
+ pr_info("WQSTRESS START\n");
+
+ sched_set_fifo_low(current);
+
+ KICK_WORK(0);
+ KICK_WORK(1);
+ KICK_WORK(2);
+ KICK_WORK(3);
+ KICK_WORK(4);
+ KICK_WORK(5);
+ KICK_WORK(6);
+ KICK_WORK(7);
+ KICK_WORK(8);
+ KICK_WORK(9);
+
+ FLUSH_WORK(0);
+ FLUSH_WORK(1);
+ FLUSH_WORK(2);
+ FLUSH_WORK(3);
+ FLUSH_WORK(4);
+ FLUSH_WORK(5);
+ FLUSH_WORK(6);
+ FLUSH_WORK(7);
+ FLUSH_WORK(8);
+ FLUSH_WORK(9);
+
+ return 0;
+}
+
+late_initcall_sync(wqstress_init);
--
2.31.1


2022-08-02 09:11:17

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v3 1/3] workqueue: Hold wq_pool_mutex while affining tasks to wq_unbound_cpumask

When unbind_workers() reads wq_unbound_cpumask to set the affinity of
freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.

This is made more obvious as of recent commit

46a4d679ef88 ("workqueue: Avoid a false warning in unbind_workers()")

e.g.

unbind_workers() workqueue_set_unbound_cpumask()
kthread_set_per_cpu(p, -1);
if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
cpumask_copy(wq_unbound_cpumask, cpumask);
WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);

Make workqueue_offline_cpu() invoke unbind_workers() with wq_pool_mutex
held.

Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/workqueue.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aa8a82bc6738..97cc41430a76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5143,14 +5143,15 @@ int workqueue_offline_cpu(unsigned int cpu)
if (WARN_ON(cpu != smp_processor_id()))
return -1;

+ mutex_lock(&wq_pool_mutex);
+
unbind_workers(cpu);

/* update NUMA affinity of unbound workqueues */
- mutex_lock(&wq_pool_mutex);
list_for_each_entry(wq, &workqueues, list)
wq_update_unbound_numa(wq, cpu, false);
- mutex_unlock(&wq_pool_mutex);

+ mutex_unlock(&wq_pool_mutex);
return 0;
}

--
2.31.1


2022-08-02 09:16:50

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH v3 2/3] workqueue: Unbind workers before sending them to exit()

It has been reported that isolated CPUs can suffer from interference due to
per-CPU kworkers waking up just to die.

A surge of workqueue activity during initial setup of a latency-sensitive
application (refresh_vm_stats() being one of the culprits) can cause extra
per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
running merrily on an isolated CPU only to be interrupted sometime later by
a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
kworker activity).

Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
them with WORKER_DIE.

Changing the affinity does require a sleepable context, so get rid of the
pool->idle_timer and use a delayed_work instead. Ensure kworkers do not
free their resources before the new kworker reaper has handled them by
introducing a new struct worker.reaper field - this new field fills in a 4
byte hole in the second cacheline of struct worker.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/workqueue.c | 155 +++++++++++++++++++++++++++++-------
kernel/workqueue_internal.h | 1 +
2 files changed, 126 insertions(+), 30 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 97cc41430a76..28cd58c684ee 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -167,9 +167,9 @@ struct worker_pool {
int nr_workers; /* L: total number of workers */
int nr_idle; /* L: currently idle workers */

- struct list_head idle_list; /* L: list of idle workers */
- struct timer_list idle_timer; /* L: worker idle timeout */
- struct timer_list mayday_timer; /* L: SOS timer for workers */
+ struct list_head idle_list; /* L: list of idle workers */
+ struct delayed_work idle_reaper_work; /* L: worker idle timeout */
+ struct timer_list mayday_timer; /* L: SOS timer for workers */

/* a workers is either on busy_hash or idle_list, or the manager */
DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
@@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
/* idle_list is LIFO */
list_add(&worker->entry, &pool->idle_list);

- if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
- mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
+ if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
+ mod_delayed_work(system_unbound_wq,
+ &pool->idle_reaper_work,
+ IDLE_WORKER_TIMEOUT);

/* Sanity check nr_running. */
WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
@@ -1972,9 +1974,72 @@ static struct worker *create_worker(struct worker_pool *pool)
return NULL;
}

+static void unbind_worker(struct worker *worker)
+{
+ lockdep_assert_held(&wq_pool_mutex);
+
+ kthread_set_per_cpu(worker->task, -1);
+ if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
+ else
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+}
+
+static void rebind_worker(struct worker *worker, struct worker_pool *pool)
+{
+ kthread_set_per_cpu(worker->task, pool->cpu);
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
+}
+
+static void reap_workers(struct list_head *reaplist)
+{
+ struct worker *worker, *tmp;
+
+ list_for_each_entry_safe(worker, tmp, reaplist, entry) {
+ list_del_init(&worker->entry);
+ unbind_worker(worker);
+
+ /*
+ * If the worker was somehow already running, then it had to be
+ * in pool->idle_list when destroy_worker() happened or we
+ * wouldn't have gotten here.
+ *
+ * Thus, the worker must either have observed the WORKER_DIE
+ * flag, or have set its state to TASK_IDLE. Either way, the
+ * below will be observed by the worker and is safe to do
+ * outside of pool->lock.
+ */
+ WRITE_ONCE(worker->reaped, true);
+ wake_up_process(worker->task);
+ }
+}
+
+/*
+ * Unlikely as it may be, a worker could wake after destroy_worker() has
+ * happened but before reap_workers(). WORKER_DIE would be set in worker->flags,
+ * so it would be able to kfree(worker) and head out to do_exit().
+ *
+ * Rather than make the reaper wait for each to-be-reaped kworker to exit and
+ * kfree(worker) itself, make the kworkers (which have nothing to do but go
+ * do_exit() anyway) wait for the reaper to be done with them.
+ */
+static void worker_wait_reaped(struct worker *worker)
+{
+ WARN_ON_ONCE(current != worker->task);
+
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (READ_ONCE(worker->reaped))
+ break;
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+}
+
/**
* destroy_worker - destroy a workqueue worker
* @worker: worker to be destroyed
+ * @list: transfer worker away from its pool->idle_list and into list
*
* Destroy @worker and adjust @pool stats accordingly. The worker should
* be idle.
@@ -1982,7 +2047,7 @@ static struct worker *create_worker(struct worker_pool *pool)
* CONTEXT:
* raw_spin_lock_irq(pool->lock).
*/
-static void destroy_worker(struct worker *worker)
+static void destroy_worker(struct worker *worker, struct list_head *list)
{
struct worker_pool *pool = worker->pool;

@@ -1997,34 +2062,64 @@ static void destroy_worker(struct worker *worker)
pool->nr_workers--;
pool->nr_idle--;

- list_del_init(&worker->entry);
worker->flags |= WORKER_DIE;
- wake_up_process(worker->task);
+
+ list_move(&worker->entry, list);
}

-static void idle_worker_timeout(struct timer_list *t)
+/**
+ * idle_reaper_fn - reap workers that have been idle for too long.
+ *
+ * Unbinding marked-for-destruction workers requires a sleepable context, as
+ * changing a task's affinity is not an atomic operation, and we don't want
+ * to disturb isolated CPUs IDLE_WORKER_TIMEOUT in the future just for a kworker
+ * to do_exit().
+ *
+ * Percpu kworkers should meet the conditions for the affinity change to not
+ * block (not migration-disabled and not running), but there is no *hard*
+ * guarantee that they are not running when we get here.
+ *
+ * The delayed_work is only ever modified under raw_spin_lock_irq(pool->lock).
+ */
+static void idle_reaper_fn(struct work_struct *work)
{
- struct worker_pool *pool = from_timer(pool, t, idle_timer);
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct worker_pool *pool = container_of(dwork, struct worker_pool, idle_reaper_work);
+ struct list_head reaplist;
+ struct worker *worker;
+
+ INIT_LIST_HEAD(&reaplist);

raw_spin_lock_irq(&pool->lock);

while (too_many_workers(pool)) {
- struct worker *worker;
unsigned long expires;
+ unsigned long now = jiffies;

/* idle_list is kept in LIFO order, check the last one */
worker = list_entry(pool->idle_list.prev, struct worker, entry);
expires = worker->last_active + IDLE_WORKER_TIMEOUT;

- if (time_before(jiffies, expires)) {
- mod_timer(&pool->idle_timer, expires);
+ /*
+ * Careful: queueing a work item from here can and will cause a
+ * self-deadlock when dealing with an unbound pool. However,
+ * here the delay *cannot* be zero and *has* to be in the
+ * future, which works.
+ */
+ if (time_before(now, expires)) {
+ mod_delayed_work(system_unbound_wq,
+ &pool->idle_reaper_work,
+ expires - now);
break;
}

- destroy_worker(worker);
+ destroy_worker(worker, &reaplist);
}
-
raw_spin_unlock_irq(&pool->lock);
+
+ mutex_lock(&wq_pool_mutex);
+ reap_workers(&reaplist);
+ mutex_unlock(&wq_pool_mutex);
}

static void send_mayday(struct work_struct *work)
@@ -2388,6 +2483,9 @@ static int worker_thread(void *__worker)
/* am I supposed to die? */
if (unlikely(worker->flags & WORKER_DIE)) {
raw_spin_unlock_irq(&pool->lock);
+
+ worker_wait_reaped(worker);
+
WARN_ON_ONCE(!list_empty(&worker->entry));
set_pf_worker(false);

@@ -3454,7 +3552,7 @@ static int init_worker_pool(struct worker_pool *pool)
INIT_LIST_HEAD(&pool->idle_list);
hash_init(pool->busy_hash);

- timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);
+ INIT_DEFERRABLE_WORK(&pool->idle_reaper_work, idle_reaper_fn);

timer_setup(&pool->mayday_timer, pool_mayday_timeout, 0);

@@ -3559,8 +3657,11 @@ static bool wq_manager_inactive(struct worker_pool *pool)
static void put_unbound_pool(struct worker_pool *pool)
{
DECLARE_COMPLETION_ONSTACK(detach_completion);
+ struct list_head reaplist;
struct worker *worker;

+ INIT_LIST_HEAD(&reaplist);
+
lockdep_assert_held(&wq_pool_mutex);

if (--pool->refcnt)
@@ -3588,10 +3689,12 @@ static void put_unbound_pool(struct worker_pool *pool)
pool->flags |= POOL_MANAGER_ACTIVE;

while ((worker = first_idle_worker(pool)))
- destroy_worker(worker);
+ destroy_worker(worker, &reaplist);
WARN_ON(pool->nr_workers || pool->nr_idle);
raw_spin_unlock_irq(&pool->lock);

+ reap_workers(&reaplist);
+
mutex_lock(&wq_pool_attach_mutex);
if (!list_empty(&pool->workers))
pool->detach_completion = &detach_completion;
@@ -3601,7 +3704,7 @@ static void put_unbound_pool(struct worker_pool *pool)
wait_for_completion(pool->detach_completion);

/* shut down the timers */
- del_timer_sync(&pool->idle_timer);
+ cancel_delayed_work_sync(&pool->idle_reaper_work);
del_timer_sync(&pool->mayday_timer);

/* RCU protected to allow dereferences from get_work_pool() */
@@ -4999,13 +5102,8 @@ static void unbind_workers(int cpu)

raw_spin_unlock_irq(&pool->lock);

- for_each_pool_worker(worker, pool) {
- kthread_set_per_cpu(worker->task, -1);
- if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
- else
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
- }
+ for_each_pool_worker(worker, pool)
+ unbind_worker(worker);

mutex_unlock(&wq_pool_attach_mutex);
}
@@ -5030,11 +5128,8 @@ static void rebind_workers(struct worker_pool *pool)
* of all workers first and then clear UNBOUND. As we're called
* from CPU_ONLINE, the following shouldn't fail.
*/
- for_each_pool_worker(worker, pool) {
- kthread_set_per_cpu(worker->task, pool->cpu);
- WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
- pool->attrs->cpumask) < 0);
- }
+ for_each_pool_worker(worker, pool)
+ rebind_worker(worker, pool);

raw_spin_lock_irq(&pool->lock);

diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index e00b1204a8e9..a3d60e10a76f 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -46,6 +46,7 @@ struct worker {
unsigned int flags; /* X: flags */
int id; /* I: worker id */
int sleeping; /* None */
+ int reaped; /* None */

/*
* Opaque string set with work_set_desc(). Printed out with task
--
2.31.1


2022-08-03 04:14:22

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/3] workqueue: Hold wq_pool_mutex while affining tasks to wq_unbound_cpumask



On 2022/8/2 16:41, Valentin Schneider wrote:
> When unbind_workers() reads wq_unbound_cpumask to set the affinity of
> freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
> sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
>
> This is made more obvious as of recent commit
>
> 46a4d679ef88 ("workqueue: Avoid a false warning in unbind_workers()")
>
> e.g.
>
> unbind_workers() workqueue_set_unbound_cpumask()
> kthread_set_per_cpu(p, -1);
> if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
> cpumask_copy(wq_unbound_cpumask, cpumask);
> WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
>
> Make workqueue_offline_cpu() invoke unbind_workers() with wq_pool_mutex
> held.

I would prefer to protect wq_unbound_cpumask with wq_pool_attach_mutex.

From df7b4672db4dfd3e480b1873b9d346e8a7dfc69f Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <[email protected]>
Date: Wed, 3 Aug 2022 10:52:04 +0800
Subject: [PATCH] workqueue: Protects wq_unbound_cpumask with
wq_pool_attach_mutex

When unbind_workers() reads wq_unbound_cpumask to set the affinity of
freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.

Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
remove the need of temporary saved_cpumask.

Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Reported-by: Valentin Schneider <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 41 ++++++++++++++++-------------------------
1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6b2b66940530..eaea73e7e365 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -327,7 +327,7 @@ static struct rcuwait manager_wait = __RCUWAIT_INITIALIZER(manager_wait);
static LIST_HEAD(workqueues); /* PR: list of all workqueues */
static bool workqueue_freezing; /* PL: have wqs started freezing? */

-/* PL: allowable cpus for unbound wqs and work items */
+/* PL&A: allowable cpus for unbound wqs and work items */
static cpumask_var_t wq_unbound_cpumask;

/* CPU where unbound work was last round robin scheduled from this CPU */
@@ -3933,7 +3933,8 @@ static void apply_wqattrs_cleanup(struct apply_wqattrs_ctx *ctx)
/* allocate the attrs and pwqs for later installation */
static struct apply_wqattrs_ctx *
apply_wqattrs_prepare(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+ const struct workqueue_attrs *attrs,
+ const cpumask_var_t unbound_cpumask)
{
struct apply_wqattrs_ctx *ctx;
struct workqueue_attrs *new_attrs, *tmp_attrs;
@@ -3949,14 +3950,15 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
goto out_free;

/*
- * Calculate the attrs of the default pwq.
+ * Calculate the attrs of the default pwq with unbound_cpumask
+ * which is wq_unbound_cpumask or to set to wq_unbound_cpumask.
* If the user configured cpumask doesn't overlap with the
* wq_unbound_cpumask, we fallback to the wq_unbound_cpumask.
*/
copy_workqueue_attrs(new_attrs, attrs);
- cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
+ cpumask_and(new_attrs->cpumask, new_attrs->cpumask, unbound_cpumask);
if (unlikely(cpumask_empty(new_attrs->cpumask)))
- cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
+ cpumask_copy(new_attrs->cpumask, unbound_cpumask);

/*
* We may create multiple pwqs with differing cpumasks. Make a
@@ -4053,7 +4055,7 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
wq->flags &= ~__WQ_ORDERED;
}

- ctx = apply_wqattrs_prepare(wq, attrs);
+ ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask);
if (!ctx)
return -ENOMEM;

@@ -5311,7 +5313,7 @@ void thaw_workqueues(void)
}
#endif /* CONFIG_FREEZER */

-static int workqueue_apply_unbound_cpumask(void)
+static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
{
LIST_HEAD(ctxs);
int ret = 0;
@@ -5327,7 +5329,7 @@ static int workqueue_apply_unbound_cpumask(void)
if (wq->flags & __WQ_ORDERED)
continue;

- ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+ ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -5342,6 +5344,11 @@ static int workqueue_apply_unbound_cpumask(void)
apply_wqattrs_cleanup(ctx);
}

+ if (!ret) {
+ mutex_lock(&wq_pool_attach_mutex);
+ cpumask_copy(wq_unbound_cpumask, unbound_cpumask);
+ mutex_unlock(&wq_pool_attach_mutex);
+ }
return ret;
}

@@ -5360,7 +5367,6 @@ static int workqueue_apply_unbound_cpumask(void)
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
{
int ret = -EINVAL;
- cpumask_var_t saved_cpumask;

/*
* Not excluding isolated cpus on purpose.
@@ -5374,23 +5380,8 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
goto out_unlock;
}

- if (!zalloc_cpumask_var(&saved_cpumask, GFP_KERNEL)) {
- ret = -ENOMEM;
- goto out_unlock;
- }
-
- /* save the old wq_unbound_cpumask. */
- cpumask_copy(saved_cpumask, wq_unbound_cpumask);
-
- /* update wq_unbound_cpumask at first and apply it to wqs. */
- cpumask_copy(wq_unbound_cpumask, cpumask);
- ret = workqueue_apply_unbound_cpumask();
-
- /* restore the wq_unbound_cpumask when failed. */
- if (ret < 0)
- cpumask_copy(wq_unbound_cpumask, saved_cpumask);
+ ret = workqueue_apply_unbound_cpumask(cpumask);

- free_cpumask_var(saved_cpumask);
out_unlock:
apply_wqattrs_unlock();
}
--
2.19.1.6.gb485710b



2022-08-04 12:02:17

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/3] workqueue: Hold wq_pool_mutex while affining tasks to wq_unbound_cpumask

On 03/08/22 11:40, Lai Jiangshan wrote:
> On 2022/8/2 16:41, Valentin Schneider wrote:
>> When unbind_workers() reads wq_unbound_cpumask to set the affinity of
>> freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
>> sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
>>
>> This is made more obvious as of recent commit
>>
>> 46a4d679ef88 ("workqueue: Avoid a false warning in unbind_workers()")
>>
>> e.g.
>>
>> unbind_workers() workqueue_set_unbound_cpumask()
>> kthread_set_per_cpu(p, -1);
>> if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
>> cpumask_copy(wq_unbound_cpumask, cpumask);
>> WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
>>
>> Make workqueue_offline_cpu() invoke unbind_workers() with wq_pool_mutex
>> held.
>
> I would prefer to protect wq_unbound_cpumask with wq_pool_attach_mutex.

That looks alright to me, do you want to push that separately as it's a
standalone patch, or should I carry it with this series?


2022-08-05 02:47:53

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/3] workqueue: Hold wq_pool_mutex while affining tasks to wq_unbound_cpumask

On Thu, Aug 4, 2022 at 7:40 PM Valentin Schneider <[email protected]> wrote:
>
> On 03/08/22 11:40, Lai Jiangshan wrote:
> > On 2022/8/2 16:41, Valentin Schneider wrote:
> >> When unbind_workers() reads wq_unbound_cpumask to set the affinity of
> >> freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
> >> sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
> >>
> >> This is made more obvious as of recent commit
> >>
> >> 46a4d679ef88 ("workqueue: Avoid a false warning in unbind_workers()")
> >>
> >> e.g.
> >>
> >> unbind_workers() workqueue_set_unbound_cpumask()
> >> kthread_set_per_cpu(p, -1);
> >> if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
> >> cpumask_copy(wq_unbound_cpumask, cpumask);
> >> WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
> >>
> >> Make workqueue_offline_cpu() invoke unbind_workers() with wq_pool_mutex
> >> held.
> >
> > I would prefer to protect wq_unbound_cpumask with wq_pool_attach_mutex.
>
> That looks alright to me, do you want to push that separately as it's a
> standalone patch, or should I carry it with this series?
>

I'm Okay with both.

It needs review from Tejun. If Tejun has not queued it before you send
a new update of this series, I will be glad if you carry it.

2022-08-05 03:26:00

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/3] workqueue: Unbind workers before sending them to exit()

On Tue, Aug 2, 2022 at 4:42 PM Valentin Schneider <[email protected]> wrote:
>
> It has been reported that isolated CPUs can suffer from interference due to
> per-CPU kworkers waking up just to die.
>
> A surge of workqueue activity during initial setup of a latency-sensitive
> application (refresh_vm_stats() being one of the culprits) can cause extra
> per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
> running merrily on an isolated CPU only to be interrupted sometime later by
> a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
> kworker activity).
>
> Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
> contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
> them with WORKER_DIE.
>
> Changing the affinity does require a sleepable context, so get rid of the
> pool->idle_timer and use a delayed_work instead. Ensure kworkers do not
> free their resources before the new kworker reaper has handled them by
> introducing a new struct worker.reaper field - this new field fills in a 4
> byte hole in the second cacheline of struct worker.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/workqueue.c | 155 +++++++++++++++++++++++++++++-------
> kernel/workqueue_internal.h | 1 +
> 2 files changed, 126 insertions(+), 30 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 97cc41430a76..28cd58c684ee 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -167,9 +167,9 @@ struct worker_pool {
> int nr_workers; /* L: total number of workers */
> int nr_idle; /* L: currently idle workers */
>
> - struct list_head idle_list; /* L: list of idle workers */
> - struct timer_list idle_timer; /* L: worker idle timeout */
> - struct timer_list mayday_timer; /* L: SOS timer for workers */
> + struct list_head idle_list; /* L: list of idle workers */
> + struct delayed_work idle_reaper_work; /* L: worker idle timeout */
> + struct timer_list mayday_timer; /* L: SOS timer for workers */
>
> /* a workers is either on busy_hash or idle_list, or the manager */
> DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
> @@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
> /* idle_list is LIFO */
> list_add(&worker->entry, &pool->idle_list);
>
> - if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
> - mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
> + if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
> + mod_delayed_work(system_unbound_wq,
> + &pool->idle_reaper_work,
> + IDLE_WORKER_TIMEOUT);
>
> /* Sanity check nr_running. */
> WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
> @@ -1972,9 +1974,72 @@ static struct worker *create_worker(struct worker_pool *pool)
> return NULL;
> }
>
> +static void unbind_worker(struct worker *worker)
> +{
> + lockdep_assert_held(&wq_pool_mutex);
> +
> + kthread_set_per_cpu(worker->task, -1);
> + if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
> + else
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
> +}
> +
> +static void rebind_worker(struct worker *worker, struct worker_pool *pool)
> +{
> + kthread_set_per_cpu(worker->task, pool->cpu);
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
> +}
> +
> +static void reap_workers(struct list_head *reaplist)
> +{
> + struct worker *worker, *tmp;
> +
> + list_for_each_entry_safe(worker, tmp, reaplist, entry) {
> + list_del_init(&worker->entry);
> + unbind_worker(worker);
> +
> + /*
> + * If the worker was somehow already running, then it had to be
> + * in pool->idle_list when destroy_worker() happened or we
> + * wouldn't have gotten here.
> + *
> + * Thus, the worker must either have observed the WORKER_DIE
> + * flag, or have set its state to TASK_IDLE. Either way, the
> + * below will be observed by the worker and is safe to do
> + * outside of pool->lock.
> + */
> + WRITE_ONCE(worker->reaped, true);
> + wake_up_process(worker->task);
> + }
> +}
> +
> +/*
> + * Unlikely as it may be, a worker could wake after destroy_worker() has
> + * happened but before reap_workers(). WORKER_DIE would be set in worker->flags,
> + * so it would be able to kfree(worker) and head out to do_exit().
> + *
> + * Rather than make the reaper wait for each to-be-reaped kworker to exit and
> + * kfree(worker) itself, make the kworkers (which have nothing to do but go
> + * do_exit() anyway) wait for the reaper to be done with them.
> + */
> +static void worker_wait_reaped(struct worker *worker)
> +{
> + WARN_ON_ONCE(current != worker->task);
> +
> + for (;;) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + if (READ_ONCE(worker->reaped))
> + break;
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> +}


It is not a good idea to add this scheduler-ist code here.

Using wq_pool_attach_mutex to protects the whole body of idle_reaper_fn()
can stop the worker from freeing itself since the worker has to
get the mutex before exiting.

And I don't think batching destruction is a good idea since
it is not a hot path.

> +
> /**
> * destroy_worker - destroy a workqueue worker
> * @worker: worker to be destroyed
> + * @list: transfer worker away from its pool->idle_list and into list
> *
> * Destroy @worker and adjust @pool stats accordingly. The worker should
> * be idle.
> @@ -1982,7 +2047,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> * CONTEXT:
> * raw_spin_lock_irq(pool->lock).
> */
> -static void destroy_worker(struct worker *worker)
> +static void destroy_worker(struct worker *worker, struct list_head *list)
> {
> struct worker_pool *pool = worker->pool;
>
> @@ -1997,34 +2062,64 @@ static void destroy_worker(struct worker *worker)
> pool->nr_workers--;
> pool->nr_idle--;
>
> - list_del_init(&worker->entry);
> worker->flags |= WORKER_DIE;
> - wake_up_process(worker->task);
> +
> + list_move(&worker->entry, list);
> }
>
> -static void idle_worker_timeout(struct timer_list *t)
> +/**
> + * idle_reaper_fn - reap workers that have been idle for too long.
> + *
> + * Unbinding marked-for-destruction workers requires a sleepable context, as
> + * changing a task's affinity is not an atomic operation, and we don't want
> + * to disturb isolated CPUs IDLE_WORKER_TIMEOUT in the future just for a kworker
> + * to do_exit().
> + *
> + * Percpu kworkers should meet the conditions for the affinity change to not
> + * block (not migration-disabled and not running), but there is no *hard*
> + * guarantee that they are not running when we get here.
> + *
> + * The delayed_work is only ever modified under raw_spin_lock_irq(pool->lock).
> + */
> +static void idle_reaper_fn(struct work_struct *work)
> {
> - struct worker_pool *pool = from_timer(pool, t, idle_timer);
> + struct delayed_work *dwork = to_delayed_work(work);
> + struct worker_pool *pool = container_of(dwork, struct worker_pool, idle_reaper_work);
> + struct list_head reaplist;
> + struct worker *worker;
> +
> + INIT_LIST_HEAD(&reaplist);
>
> raw_spin_lock_irq(&pool->lock);
>
> while (too_many_workers(pool)) {
> - struct worker *worker;
> unsigned long expires;
> + unsigned long now = jiffies;
>
> /* idle_list is kept in LIFO order, check the last one */
> worker = list_entry(pool->idle_list.prev, struct worker, entry);
> expires = worker->last_active + IDLE_WORKER_TIMEOUT;
>
> - if (time_before(jiffies, expires)) {
> - mod_timer(&pool->idle_timer, expires);
> + /*
> + * Careful: queueing a work item from here can and will cause a
> + * self-deadlock when dealing with an unbound pool. However,
> + * here the delay *cannot* be zero and *has* to be in the
> + * future, which works.
> + */
> + if (time_before(now, expires)) {

IMHO, using raw_spin_unlock_irq(&pool->lock) here is better than
violating locking rules *overtly* and documenting that it can not be
really violated. But It would bring a "goto" statement.

> + mod_delayed_work(system_unbound_wq,
> + &pool->idle_reaper_work,
> + expires - now);
> break;
> }
>
> - destroy_worker(worker);
> + destroy_worker(worker, &reaplist);
> }
> -
> raw_spin_unlock_irq(&pool->lock);
> +
> + mutex_lock(&wq_pool_mutex);
> + reap_workers(&reaplist);
> + mutex_unlock(&wq_pool_mutex);
> }
>
> static void send_mayday(struct work_struct *work)
> @@ -2388,6 +2483,9 @@ static int worker_thread(void *__worker)
> /* am I supposed to die? */
> if (unlikely(worker->flags & WORKER_DIE)) {
> raw_spin_unlock_irq(&pool->lock);
> +
> + worker_wait_reaped(worker);
> +
> WARN_ON_ONCE(!list_empty(&worker->entry));
> set_pf_worker(false);
>
> @@ -3454,7 +3552,7 @@ static int init_worker_pool(struct worker_pool *pool)
> INIT_LIST_HEAD(&pool->idle_list);
> hash_init(pool->busy_hash);
>
> - timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);
> + INIT_DEFERRABLE_WORK(&pool->idle_reaper_work, idle_reaper_fn);
>
> timer_setup(&pool->mayday_timer, pool_mayday_timeout, 0);
>
> @@ -3559,8 +3657,11 @@ static bool wq_manager_inactive(struct worker_pool *pool)
> static void put_unbound_pool(struct worker_pool *pool)
> {
> DECLARE_COMPLETION_ONSTACK(detach_completion);
> + struct list_head reaplist;
> struct worker *worker;
>
> + INIT_LIST_HEAD(&reaplist);
> +
> lockdep_assert_held(&wq_pool_mutex);
>
> if (--pool->refcnt)
> @@ -3588,10 +3689,12 @@ static void put_unbound_pool(struct worker_pool *pool)
> pool->flags |= POOL_MANAGER_ACTIVE;
>
> while ((worker = first_idle_worker(pool)))
> - destroy_worker(worker);
> + destroy_worker(worker, &reaplist);
> WARN_ON(pool->nr_workers || pool->nr_idle);
> raw_spin_unlock_irq(&pool->lock);
>
> + reap_workers(&reaplist);
> +
> mutex_lock(&wq_pool_attach_mutex);
> if (!list_empty(&pool->workers))
> pool->detach_completion = &detach_completion;
> @@ -3601,7 +3704,7 @@ static void put_unbound_pool(struct worker_pool *pool)
> wait_for_completion(pool->detach_completion);
>
> /* shut down the timers */
> - del_timer_sync(&pool->idle_timer);
> + cancel_delayed_work_sync(&pool->idle_reaper_work);
> del_timer_sync(&pool->mayday_timer);
>
> /* RCU protected to allow dereferences from get_work_pool() */
> @@ -4999,13 +5102,8 @@ static void unbind_workers(int cpu)
>
> raw_spin_unlock_irq(&pool->lock);
>
> - for_each_pool_worker(worker, pool) {
> - kthread_set_per_cpu(worker->task, -1);
> - if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
> - WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
> - else
> - WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
> - }
> + for_each_pool_worker(worker, pool)
> + unbind_worker(worker);
>
> mutex_unlock(&wq_pool_attach_mutex);
> }
> @@ -5030,11 +5128,8 @@ static void rebind_workers(struct worker_pool *pool)
> * of all workers first and then clear UNBOUND. As we're called
> * from CPU_ONLINE, the following shouldn't fail.
> */
> - for_each_pool_worker(worker, pool) {
> - kthread_set_per_cpu(worker->task, pool->cpu);
> - WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
> - pool->attrs->cpumask) < 0);
> - }
> + for_each_pool_worker(worker, pool)
> + rebind_worker(worker, pool);


It is better to skip the workers which are WORKER_DIE.
Or just detach the worker when reaping it.

>
> raw_spin_lock_irq(&pool->lock);
>
> diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
> index e00b1204a8e9..a3d60e10a76f 100644
> --- a/kernel/workqueue_internal.h
> +++ b/kernel/workqueue_internal.h
> @@ -46,6 +46,7 @@ struct worker {
> unsigned int flags; /* X: flags */
> int id; /* I: worker id */
> int sleeping; /* None */
> + int reaped; /* None */
>
> /*
> * Opaque string set with work_set_desc(). Printed out with task
> --
> 2.31.1
>

2022-08-05 16:51:40

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/3] workqueue: Unbind workers before sending them to exit()

On 05/08/22 11:16, Lai Jiangshan wrote:
> On Tue, Aug 2, 2022 at 4:42 PM Valentin Schneider <[email protected]> wrote:
>> +/*
>> + * Unlikely as it may be, a worker could wake after destroy_worker() has
>> + * happened but before reap_workers(). WORKER_DIE would be set in worker->flags,
>> + * so it would be able to kfree(worker) and head out to do_exit().
>> + *
>> + * Rather than make the reaper wait for each to-be-reaped kworker to exit and
>> + * kfree(worker) itself, make the kworkers (which have nothing to do but go
>> + * do_exit() anyway) wait for the reaper to be done with them.
>> + */
>> +static void worker_wait_reaped(struct worker *worker)
>> +{
>> + WARN_ON_ONCE(current != worker->task);
>> +
>> + for (;;) {
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + if (READ_ONCE(worker->reaped))
>> + break;
>> + schedule();
>> + }
>> + __set_current_state(TASK_RUNNING);
>> +}
>
>
> It is not a good idea to add this scheduler-ist code here.
>
> Using wq_pool_attach_mutex to protects the whole body of idle_reaper_fn()
> can stop the worker from freeing itself since the worker has to
> get the mutex before exiting.
>

Right, there's worker_detach_from_pool() before kfree(worker), hadn't
thought of that. I want to limit how many locks I'm hoarding with the
reaper, but given that one is for attach/detach I think that's OK - and I
also really don't like this worker_wait_reaped() function, so will be happy
to get rid of it. I'll give this a try, thanks!

> And I don't think batching destruction is a good idea since
> it is not a hot path.
>

The batching is mostly there because checking & removing a worker from its
pool->idle_list has to be done under pool->lock, but changing its affinity
requires a sleepable context, so I batched that outside of the spinlock
section.

>> while (too_many_workers(pool)) {
>> - struct worker *worker;
>> unsigned long expires;
>> + unsigned long now = jiffies;
>>
>> /* idle_list is kept in LIFO order, check the last one */
>> worker = list_entry(pool->idle_list.prev, struct worker, entry);
>> expires = worker->last_active + IDLE_WORKER_TIMEOUT;
>>
>> - if (time_before(jiffies, expires)) {
>> - mod_timer(&pool->idle_timer, expires);
>> + /*
>> + * Careful: queueing a work item from here can and will cause a
>> + * self-deadlock when dealing with an unbound pool. However,
>> + * here the delay *cannot* be zero and *has* to be in the
>> + * future, which works.
>> + */
>> + if (time_before(now, expires)) {
>
> IMHO, using raw_spin_unlock_irq(&pool->lock) here is better than
> violating locking rules *overtly* and documenting that it can not be
> really violated. But It would bring a "goto" statement.

I was worried about serializing accesses to pool->idle_reaper_work and its
underlying timer (worker_enter_idle() vs idle_reaper_fn()), though I think
the worst that can happen if idle_reaper_fn() does that without holding
pool->lock is worker_enter_idle() pushing back the timer to
IDLE_WORKER_TIMEOUT (rather than (last_active + IDLE_WORKER_TIMEOUT) -
now).

>> + mod_delayed_work(system_unbound_wq,
>> + &pool->idle_reaper_work,
>> + expires - now);
>> break;
>> }

>> @@ -5030,11 +5128,8 @@ static void rebind_workers(struct worker_pool *pool)
>> * of all workers first and then clear UNBOUND. As we're called
>> * from CPU_ONLINE, the following shouldn't fail.
>> */
>> - for_each_pool_worker(worker, pool) {
>> - kthread_set_per_cpu(worker->task, pool->cpu);
>> - WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
>> - pool->attrs->cpumask) < 0);
>> - }
>> + for_each_pool_worker(worker, pool)
>> + rebind_worker(worker, pool);
>
>
> It is better to skip the workers which are WORKER_DIE.
> Or just detach the worker when reaping it.

Hadn't even thought about this racing with to-be-destroyed workers. Having
worker_detach_from_pool() done by the worker itself is convenient for the
serialization with wq_pool_attach_mutex as you suggested, let me scratch my
head some more.

>
>>
>> raw_spin_lock_irq(&pool->lock);
>>
>> diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
>> index e00b1204a8e9..a3d60e10a76f 100644
>> --- a/kernel/workqueue_internal.h
>> +++ b/kernel/workqueue_internal.h
>> @@ -46,6 +46,7 @@ struct worker {
>> unsigned int flags; /* X: flags */
>> int id; /* I: worker id */
>> int sleeping; /* None */
>> + int reaped; /* None */
>>
>> /*
>> * Opaque string set with work_set_desc(). Printed out with task
>> --
>> 2.31.1
>>

2022-08-16 03:40:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/3] workqueue: Hold wq_pool_mutex while affining tasks to wq_unbound_cpumask

On Wed, Aug 03, 2022 at 11:40:28AM +0800, Lai Jiangshan wrote:
> From df7b4672db4dfd3e480b1873b9d346e8a7dfc69f Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <[email protected]>
> Date: Wed, 3 Aug 2022 10:52:04 +0800
> Subject: [PATCH] workqueue: Protects wq_unbound_cpumask with
> wq_pool_attach_mutex
>
> When unbind_workers() reads wq_unbound_cpumask to set the affinity of
> freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
> sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
>
> Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
> remove the need of temporary saved_cpumask.
>
> Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
> Reported-by: Valentin Schneider <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>

This patch looks fine to me but is whitespace corrupted (two leading spaces
on all context lines). Can you please resend?

Thanks.

--
tejun

2022-08-18 15:15:41

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH] workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex

From: Lai Jiangshan <[email protected]>

When unbind_workers() reads wq_unbound_cpumask to set the affinity of
freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.

Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
remove the need of temporary saved_cpumask.

Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Reported-by: Valentin Schneider <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 41 ++++++++++++++++-------------------------
1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6b2b66940530..eaea73e7e365 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -327,7 +327,7 @@ static struct rcuwait manager_wait = __RCUWAIT_INITIALIZER(manager_wait);
static LIST_HEAD(workqueues); /* PR: list of all workqueues */
static bool workqueue_freezing; /* PL: have wqs started freezing? */

-/* PL: allowable cpus for unbound wqs and work items */
+/* PL&A: allowable cpus for unbound wqs and work items */
static cpumask_var_t wq_unbound_cpumask;

/* CPU where unbound work was last round robin scheduled from this CPU */
@@ -3933,7 +3933,8 @@ static void apply_wqattrs_cleanup(struct apply_wqattrs_ctx *ctx)
/* allocate the attrs and pwqs for later installation */
static struct apply_wqattrs_ctx *
apply_wqattrs_prepare(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+ const struct workqueue_attrs *attrs,
+ const cpumask_var_t unbound_cpumask)
{
struct apply_wqattrs_ctx *ctx;
struct workqueue_attrs *new_attrs, *tmp_attrs;
@@ -3949,14 +3950,15 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
goto out_free;

/*
- * Calculate the attrs of the default pwq.
+ * Calculate the attrs of the default pwq with unbound_cpumask
+ * which is wq_unbound_cpumask or to set to wq_unbound_cpumask.
* If the user configured cpumask doesn't overlap with the
* wq_unbound_cpumask, we fallback to the wq_unbound_cpumask.
*/
copy_workqueue_attrs(new_attrs, attrs);
- cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
+ cpumask_and(new_attrs->cpumask, new_attrs->cpumask, unbound_cpumask);
if (unlikely(cpumask_empty(new_attrs->cpumask)))
- cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
+ cpumask_copy(new_attrs->cpumask, unbound_cpumask);

/*
* We may create multiple pwqs with differing cpumasks. Make a
@@ -4053,7 +4055,7 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
wq->flags &= ~__WQ_ORDERED;
}

- ctx = apply_wqattrs_prepare(wq, attrs);
+ ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask);
if (!ctx)
return -ENOMEM;

@@ -5311,7 +5313,7 @@ void thaw_workqueues(void)
}
#endif /* CONFIG_FREEZER */

-static int workqueue_apply_unbound_cpumask(void)
+static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
{
LIST_HEAD(ctxs);
int ret = 0;
@@ -5327,7 +5329,7 @@ static int workqueue_apply_unbound_cpumask(void)
if (wq->flags & __WQ_ORDERED)
continue;

- ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+ ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -5342,6 +5344,11 @@ static int workqueue_apply_unbound_cpumask(void)
apply_wqattrs_cleanup(ctx);
}

+ if (!ret) {
+ mutex_lock(&wq_pool_attach_mutex);
+ cpumask_copy(wq_unbound_cpumask, unbound_cpumask);
+ mutex_unlock(&wq_pool_attach_mutex);
+ }
return ret;
}

@@ -5360,7 +5367,6 @@ static int workqueue_apply_unbound_cpumask(void)
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
{
int ret = -EINVAL;
- cpumask_var_t saved_cpumask;

/*
* Not excluding isolated cpus on purpose.
@@ -5374,23 +5380,8 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
goto out_unlock;
}

- if (!zalloc_cpumask_var(&saved_cpumask, GFP_KERNEL)) {
- ret = -ENOMEM;
- goto out_unlock;
- }
-
- /* save the old wq_unbound_cpumask. */
- cpumask_copy(saved_cpumask, wq_unbound_cpumask);
-
- /* update wq_unbound_cpumask at first and apply it to wqs. */
- cpumask_copy(wq_unbound_cpumask, cpumask);
- ret = workqueue_apply_unbound_cpumask();
-
- /* restore the wq_unbound_cpumask when failed. */
- if (ret < 0)
- cpumask_copy(wq_unbound_cpumask, saved_cpumask);
+ ret = workqueue_apply_unbound_cpumask(cpumask);

- free_cpumask_var(saved_cpumask);
out_unlock:
apply_wqattrs_unlock();
}
--
2.19.1.6.gb485710b

2022-08-27 00:49:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex

Hello,

On Thu, Aug 18, 2022 at 10:33:48PM +0800, Lai Jiangshan wrote:
> @@ -5342,6 +5344,11 @@ static int workqueue_apply_unbound_cpumask(void)
> apply_wqattrs_cleanup(ctx);
> }
>
> + if (!ret) {
> + mutex_lock(&wq_pool_attach_mutex);
> + cpumask_copy(wq_unbound_cpumask, unbound_cpumask);
> + mutex_unlock(&wq_pool_attach_mutex);

Is this enough? Shouldn't the lock be protecting a wider scope? If there's
someone reading the flag with just pool_attach_mutex, what prevents them
reading it right before the new value is committed and keeps using the stale
value?

Thanks.

--
tejun

2022-08-30 09:40:13

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex

On Sat, Aug 27, 2022 at 8:33 AM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Thu, Aug 18, 2022 at 10:33:48PM +0800, Lai Jiangshan wrote:
> > @@ -5342,6 +5344,11 @@ static int workqueue_apply_unbound_cpumask(void)
> > apply_wqattrs_cleanup(ctx);
> > }
> >
> > + if (!ret) {
> > + mutex_lock(&wq_pool_attach_mutex);
> > + cpumask_copy(wq_unbound_cpumask, unbound_cpumask);
> > + mutex_unlock(&wq_pool_attach_mutex);
>
> Is this enough? Shouldn't the lock be protecting a wider scope? If there's
> someone reading the flag with just pool_attach_mutex, what prevents them
> reading it right before the new value is committed and keeps using the stale
> value?

Which "flag"? wq_unbound_cpumask?

This code is adding protection for wq_unbound_cpumask and makes
unbind_workers() use a stable version of wq_unbound_cpumask during
operation.

It doesn't really matter if pool's mask becomes stale later again
with respect to wq_unbound_cpumask.

No code ensures the disassociated pool's mask is kept with the newest
wq_unbound_cpumask since the 10a5a651e3af ("workqueue: Restrict kworker
in the offline CPU pool running on housekeeping CPUs") first uses
wq_unbound_cpumask for the disassociated pools.

What matters is that the pool's mask should the wq_unbound_cpumask
at the time when it becomes disassociated which has no isolated CPUs.

I don't like 10a5a651e3af for it not synching the pool's mask
with wq_unbound_cpumask. But I think it works anyway.

>
> Thanks.
>
> --
> tejun

2022-08-30 15:09:46

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/3] workqueue: Hold wq_pool_mutex while affining tasks to wq_unbound_cpumask

On Tue, Aug 2, 2022 at 4:42 PM Valentin Schneider <[email protected]> wrote:
>
> When unbind_workers() reads wq_unbound_cpumask to set the affinity of
> freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
> sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
>

Hello Valentin,

Updating wq_unbound_cpumask requires cpus_read_lock() and
unbind_workers() is in the CPU hotplug path and so it is sufficient to
access to wq_unbound_cpumask in unbind_workers().

The extra protection is only required when the logic is also moved to
destroy_worker().

Thanks
Lai

2022-09-04 20:52:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex

Hello,

On Tue, Aug 30, 2022 at 05:32:17PM +0800, Lai Jiangshan wrote:
> > Is this enough? Shouldn't the lock be protecting a wider scope? If there's
> > someone reading the flag with just pool_attach_mutex, what prevents them
> > reading it right before the new value is committed and keeps using the stale
> > value?
>
> Which "flag"? wq_unbound_cpumask?

Oh, yeah, sorry.

> This code is adding protection for wq_unbound_cpumask and makes
> unbind_workers() use a stable version of wq_unbound_cpumask during
> operation.
>
> It doesn't really matter if pool's mask becomes stale later again
> with respect to wq_unbound_cpumask.
>
> No code ensures the disassociated pool's mask is kept with the newest
> wq_unbound_cpumask since the 10a5a651e3af ("workqueue: Restrict kworker
> in the offline CPU pool running on housekeeping CPUs") first uses
> wq_unbound_cpumask for the disassociated pools.
>
> What matters is that the pool's mask should the wq_unbound_cpumask
> at the time when it becomes disassociated which has no isolated CPUs.
>
> I don't like 10a5a651e3af for it not synching the pool's mask
> with wq_unbound_cpumask. But I think it works anyway.

Hmm... I see. Can you add a comment explaining why we're grasbbing
wq_pool_attach_mutex there?

Thanks.

--
tejun