2023-05-10 03:09:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/6] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

If a per-cpu work item hogs the CPU, it can prevent other work items from
starting through concurrency management. A per-cpu workqueue which intends
to host such CPU-hogging work items can choose to not participate in
concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
error-prone and difficult to debug when missed.

This patch adds an automatic CPU usage based detection. If a
concurrency-managed work item consumes more CPU time than the threshold (5ms
by default), it's marked CPU_INTENSIVE automatically on schedule-out.

The mechanism isn't foolproof in that the 5ms detection delays can add up if
many CPU-hogging work items are queued at the same time. However, in such
situations, the bigger problem likely is the CPU being saturated with
per-cpu work items and the solution would be making them UNBOUND.

For occasional CPU hogging, the new automatic mechanism should provide
reasonable protection with minimal increase in code complexity.

v2: Lai pointed out that wq_worker_stopping() also needs to be called from
preemption and rtlock paths and an earlier patch was updated
accordingly. This patch adds a comment describing the risk of infinte
recursions and how they're avoided.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
kernel/workqueue.c | 82 +++++++++++++++++++++++++++----------
kernel/workqueue_internal.h | 1 +
2 files changed, 61 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 31f1618d98c2..b63bbd22f756 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -310,6 +310,14 @@ static struct kmem_cache *pwq_cache;
static cpumask_var_t *wq_numa_possible_cpumask;
/* possible CPUs of each node */

+/*
+ * Per-cpu work items which run for longer than the following threshold are
+ * automatically considered CPU intensive and excluded from concurrency
+ * management to prevent them from noticeably delaying other per-cpu work items.
+ */
+static unsigned long wq_cpu_intensive_thresh_us = 5000;
+module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
+
static bool wq_disable_numa;
module_param_named(disable_numa, wq_disable_numa, bool, 0444);

@@ -963,9 +971,6 @@ void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)
struct worker *worker = kthread_data(task);
struct worker_pool *pool;

- if (!voluntary || task_is_running(task))
- return;
-
/*
* Rescuers, which may not have all the fields set up like normal
* workers, also reach here, let's not access anything before
@@ -976,24 +981,54 @@ void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)

pool = worker->pool;

- /* Return if preempted before wq_worker_running() was reached */
- if (worker->sleeping)
- return;
+ if (!voluntary || task_is_running(task)) {
+ /*
+ * Concurrency-managed @worker is still RUNNING. See if the
+ * current work is hogging CPU stalling other per-cpu work
+ * items. If so, mark @worker CPU_INTENSIVE to exclude it from
+ * concurrency management. @worker->current_* are stable as they
+ * can only be modified by @task which is %current.
+ */
+ if (!worker->current_work ||
+ task->se.sum_exec_runtime - worker->current_at <
+ wq_cpu_intensive_thresh_us * NSEC_PER_USEC)
+ return;

- worker->sleeping = 1;
- raw_spin_lock_irq(&pool->lock);
+ /*
+ * As this function may be called from preempt_enable_notrace(),
+ * all operations upto this point must not be traceable to avoid
+ * infinite recursions. We're safe once IRQ is disabled.
+ */
+ raw_spin_lock_irq(&pool->lock);
+ worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+ } else {
+ /*
+ * Concurrency-managed @worker is sleeping. Decrement the
+ * associated pool's nr_running accordingly.
+ */

- /*
- * Recheck in case unbind_workers() preempted us. We don't
- * want to decrement nr_running after the worker is unbound
- * and nr_running has been reset.
- */
- if (worker->flags & WORKER_NOT_RUNNING) {
- raw_spin_unlock_irq(&pool->lock);
- return;
+ /* Return if preempted before wq_worker_running() was reached */
+ if (worker->sleeping)
+ return;
+
+ worker->sleeping = 1;
+ raw_spin_lock_irq(&pool->lock);
+
+ /*
+ * Recheck in case unbind_workers() preempted us. We don't want
+ * to decrement nr_running after the worker is unbound and
+ * nr_running has been reset. As a running worker never sleeps
+ * inbetween work items, we know that @worker->current_* are
+ * valid after the following check.
+ */
+ if (worker->flags & WORKER_NOT_RUNNING) {
+ raw_spin_unlock_irq(&pool->lock);
+ return;
+ }
+
+ pool->nr_running--;
}

- pool->nr_running--;
if (need_more_worker(pool))
wake_up_worker(pool);
raw_spin_unlock_irq(&pool->lock);
@@ -2311,7 +2346,6 @@ __acquires(&pool->lock)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
- bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
unsigned long work_data;
struct worker *collision;
#ifdef CONFIG_LOCKDEP
@@ -2348,6 +2382,7 @@ __acquires(&pool->lock)
worker->current_work = work;
worker->current_func = work->func;
worker->current_pwq = pwq;
+ worker->current_at = worker->task->se.sum_exec_runtime;
work_data = *work_data_bits(work);
worker->current_color = get_work_color(work_data);

@@ -2365,7 +2400,7 @@ __acquires(&pool->lock)
* of concurrency management and the next code block will chain
* execution of the pending work items.
*/
- if (unlikely(cpu_intensive))
+ if (unlikely(pwq->wq->flags & WQ_CPU_INTENSIVE))
worker_set_flags(worker, WORKER_CPU_INTENSIVE);

/*
@@ -2443,9 +2478,12 @@ __acquires(&pool->lock)

raw_spin_lock_irq(&pool->lock);

- /* clear cpu intensive status */
- if (unlikely(cpu_intensive))
- worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
+ /*
+ * In addition to %WQ_CPU_INTENSIVE, @worker may also have been marked
+ * CPU intensive by wq_worker_stopping() if @work consumed more than
+ * wq_cpu_intensive_thresh_us. Clear it.
+ */
+ worker_clr_flags(worker, WORKER_CPU_INTENSIVE);

/* tag the worker for identification in schedule() */
worker->last_func = worker->current_func;
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index a20b4d052a45..a1f012accce2 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -31,6 +31,7 @@ struct worker {
struct work_struct *current_work; /* L: work being processed */
work_func_t current_func; /* L: current_work's fn */
struct pool_workqueue *current_pwq; /* L: current_work's pwq */
+ u64 current_at; /* L: runtime when current_work started */
unsigned int current_color; /* L: current_work's color */
int sleeping; /* None */

--
2.40.1


2023-05-10 15:40:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

On Tue, May 09, 2023 at 05:07:50PM -1000, Tejun Heo wrote:

> @@ -976,24 +981,54 @@ void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)
>
> pool = worker->pool;
>
> - /* Return if preempted before wq_worker_running() was reached */
> - if (worker->sleeping)
> - return;
> + if (!voluntary || task_is_running(task)) {
> + /*
> + * Concurrency-managed @worker is still RUNNING. See if the
> + * current work is hogging CPU stalling other per-cpu work
> + * items. If so, mark @worker CPU_INTENSIVE to exclude it from
> + * concurrency management. @worker->current_* are stable as they
> + * can only be modified by @task which is %current.
> + */
> + if (!worker->current_work ||
> + task->se.sum_exec_runtime - worker->current_at <
> + wq_cpu_intensive_thresh_us * NSEC_PER_USEC)
> + return;
>

> @@ -2348,6 +2382,7 @@ __acquires(&pool->lock)
> worker->current_work = work;
> worker->current_func = work->func;
> worker->current_pwq = pwq;
> + worker->current_at = worker->task->se.sum_exec_runtime;

That only gets updated at scheduling events, it's not a 'running' clock.

> work_data = *work_data_bits(work);
> worker->current_color = get_work_color(work_data);
>


Anyway, it occurs to me that if all you want is to measure long running
works, then would it not be much easier to simply forward the tick?

Something like the below.. Then this tick handler (which will have just
updated ->sum_exec_runtime BTW) can do that above 'work-be-long-running'
check.

Or am I missing something? Seems simpler than hijacking preempt-out.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..3484cada9a4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5632,6 +5632,9 @@ void scheduler_tick(void)

perf_event_task_tick();

+ if (curr->flags & PF_WQ_WORKER)
+ wq_worker_tick(curr);
+
#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq);

2023-05-10 16:21:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/6] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

Hello, Peter.

On Wed, May 10, 2023 at 05:09:46PM +0200, Peter Zijlstra wrote:
> > @@ -2348,6 +2382,7 @@ __acquires(&pool->lock)
> > worker->current_work = work;
> > worker->current_func = work->func;
> > worker->current_pwq = pwq;
> > + worker->current_at = worker->task->se.sum_exec_runtime;
>
> That only gets updated at scheduling events, it's not a 'running' clock.

I think it gets updated on each tick and preemption checks, right? That
should be more than enough here. Just using jiffies should be fine on higher
HZ machines; however, when the threshold is not significantly longer than
HZ, it becomes a bit problematic. Reading highres clock separately is an
option but is a little bit more expensive. So, sum_exec_runtime seems to fit
pretty well. FWIW, it's already used outside scheduler proper too.

> > work_data = *work_data_bits(work);
> > worker->current_color = get_work_color(work_data);
> >
>
> Anyway, it occurs to me that if all you want is to measure long running
> works, then would it not be much easier to simply forward the tick?

Ah, that reminds me that I forgot to update the origin timestamp on
sleeping. It should be tracking the continuous CPU consumption between
sleeps.

> Something like the below.. Then this tick handler (which will have just
> updated ->sum_exec_runtime BTW) can do that above 'work-be-long-running'
> check.
>
> Or am I missing something? Seems simpler than hijacking preempt-out.

One advantage of doing it from preempt-out is that workqueue can immediately
release other per-cpu work items without any delay as when the violating
work item leaves CPU is the exact point for both detection and action.

That said, this mechanism doesn't have to super accurate, so simpler code
has its benefits. I'll take a stab at it.

Thanks.

--
tejun