2023-04-18 20:57:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring

Hello,

To reduce the number of concurrent worker threads, workqueue holds back
starting per-cpu work items while the previous work item stays in the
RUNNING state. As such a per-cpu work item which consumes a lot of CPU
cycles, even if it has cond_resched()'s in the right places, can stall other
per-cpu work items.

To support per-cpu work items that may occupy the CPU for a substantial
period of time, workqueue has WQ_CPU_INTENSIVE flag which exempts work items
issued through the marked workqueue from concurrency management - they're
started immediately and don't block other work items. While this works, it's
error-prone in that a workqueue user can easily forget to set the flag or
set it unnecessarily. Furthermore, the impacts of the wrong flag setting can
be rather indirect and challenging to root-cause.

This patchset makes workqueue auto-detect CPU intensive work items based on
CPU consumption. If a work item consumes more than the threshold (5ms by
default) of CPU time, it's automatically marked as CPU intensive when it
gets scheduled out which unblocks starting of pending per-cpu work items.

The mechanism isn't foolproof in that the detection delays can add up if
many CPU-hogging work items are queued at the same time. However, in such
situations, the bigger problem likely is the CPU being saturated with
per-cpu work items and the solution would be making them UNBOUND. Future
changes will make UNBOUND workqueues more attractive by improving their
locality behaviors and eventually remove the explicit WQ_CPU_INTENSIVE flag.

While at it, add statistics and a monitoring script. Lack of visibility has
always been a bit of pain point when debugging workqueue related issues and
with this change and more drastic ones planned for workqueue, this is a good
time to address the shortcoming.

This patchset was born out of the discussion in the following thread:

https://lkml.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com

and contains the following five patches:

0001-workqueue-sched-Notify-workqueue-of-scheduling-of-RU.patch
0002-workqueue-Re-order-struct-worker-fields.patch
0003-workqueue-Move-worker_set-clr_flags-upwards.patch
0004-workqueue-Automatically-mark-CPU-hogging-work-items-.patch
0005-workqueue-Add-pwq-stats-and-a-monitoring-script.patch

and also available in the following git branch:

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git auto-cpu-intensive

diffstat follows. Thanks.

Documentation/core-api/workqueue.rst | 30 ++++++++
kernel/sched/core.c | 18 +----
kernel/workqueue.c | 221 +++++++++++++++++++++++++++++++++++++++-----------------------
kernel/workqueue_internal.h | 14 +--
tools/workqueue/wq_monitor.py | 148 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 333 insertions(+), 98 deletions(-)

--
tejun


2023-04-18 20:58:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/5] workqueue: Move worker_set/clr_flags() upwards

They're gonna be used wq_worker_stopping(). Move them upwards.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 108 ++++++++++++++++++++++-----------------------
1 file changed, 54 insertions(+), 54 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6199fbf10cec..b9e8dc54272d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -859,6 +859,60 @@ static void wake_up_worker(struct worker_pool *pool)
wake_up_process(worker->task);
}

+/**
+ * worker_set_flags - set worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to set
+ *
+ * Set @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_set_flags(struct worker *worker, unsigned int flags)
+{
+ struct worker_pool *pool = worker->pool;
+
+ WARN_ON_ONCE(worker->task != current);
+
+ /* If transitioning into NOT_RUNNING, adjust nr_running. */
+ if ((flags & WORKER_NOT_RUNNING) &&
+ !(worker->flags & WORKER_NOT_RUNNING)) {
+ pool->nr_running--;
+ }
+
+ worker->flags |= flags;
+}
+
+/**
+ * worker_clr_flags - clear worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to clear
+ *
+ * Clear @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
+{
+ struct worker_pool *pool = worker->pool;
+ unsigned int oflags = worker->flags;
+
+ WARN_ON_ONCE(worker->task != current);
+
+ worker->flags &= ~flags;
+
+ /*
+ * If transitioning out of NOT_RUNNING, increment nr_running. Note
+ * that the nested NOT_RUNNING is not a noop. NOT_RUNNING is mask
+ * of multiple flags, not a single flag.
+ */
+ if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
+ if (!(worker->flags & WORKER_NOT_RUNNING))
+ pool->nr_running++;
+}
+
/**
* wq_worker_running - a worker is running again
* @task: task waking up
@@ -964,60 +1018,6 @@ work_func_t wq_worker_last_func(struct task_struct *task)
return worker->last_func;
}

-/**
- * worker_set_flags - set worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to set
- *
- * Set @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_set_flags(struct worker *worker, unsigned int flags)
-{
- struct worker_pool *pool = worker->pool;
-
- WARN_ON_ONCE(worker->task != current);
-
- /* If transitioning into NOT_RUNNING, adjust nr_running. */
- if ((flags & WORKER_NOT_RUNNING) &&
- !(worker->flags & WORKER_NOT_RUNNING)) {
- pool->nr_running--;
- }
-
- worker->flags |= flags;
-}
-
-/**
- * worker_clr_flags - clear worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to clear
- *
- * Clear @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
-{
- struct worker_pool *pool = worker->pool;
- unsigned int oflags = worker->flags;
-
- WARN_ON_ONCE(worker->task != current);
-
- worker->flags &= ~flags;
-
- /*
- * If transitioning out of NOT_RUNNING, increment nr_running. Note
- * that the nested NOT_RUNNING is not a noop. NOT_RUNNING is mask
- * of multiple flags, not a single flag.
- */
- if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
- if (!(worker->flags & WORKER_NOT_RUNNING))
- pool->nr_running++;
-}
-
/**
* find_worker_executing_work - find worker which is executing a work
* @pool: pool of interest
--
2.40.0

2023-04-18 21:00:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/5] workqueue: Add pwq->stats[] and a monitoring script

Currently, the only way to peer into workqueue operations is through
tracing. While possible, it isn't easy or convenient to monitor
per-workqueue behaviors over time this way. Let's add pwq->stats[] that
track relevant events and a drgn monitoring script -
tools/workqueue/wq_monitor.py.

It's arguable whether this needs to be configurable. However, it currently
only has seven counters and the runtime overhead shouldn't be noticeable
given that they're on pwq's which are per-cpu on per-cpu workqueues and
per-numa-node on unbound ones. Let's keep it simple for the time being.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
Documentation/core-api/workqueue.rst | 30 ++++++
kernel/workqueue.c | 29 +++++-
tools/workqueue/wq_monitor.py | 148 +++++++++++++++++++++++++++
3 files changed, 206 insertions(+), 1 deletion(-)
create mode 100755 tools/workqueue/wq_monitor.py

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 8ec4d6270b24..a4bb2208100e 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -348,6 +348,35 @@ Guidelines
level of locality in wq operations and work item execution.


+Monitoring
+==========
+
+Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
+
+ $ tools/workqueue/wq_monitor.py events
+ total infl CPUtime CPUhog CMwake mayday rescued
+ events 18545 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38306 0 0.1 - - - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29598 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ total infl CPUtime CPUhog CMwake mayday rescued
+ events 18548 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38322 0 0.1 - - - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29603 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ ...
+
+
Debugging
=========

@@ -387,6 +416,7 @@ For the second type of problems it should be possible to just check
The work item's function should be trivially visible in the stack
trace.

+
Non-reentrance Conditions
=========================

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d24b887ddd86..c95e80de5ec3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -195,6 +195,22 @@ struct worker_pool {
struct rcu_head rcu;
};

+/*
+ * Per-pool_workqueue statistics. These can be monitored using
+ * tools/workqueue/wq_monitor.py.
+ */
+enum pool_workqueue_stats {
+ PWQ_STAT_STARTED, /* work items started execution */
+ PWQ_STAT_COMPLETED, /* work items completed execution */
+ PWQ_STAT_CPU_TIME, /* total CPU time consumed */
+ PWQ_STAT_CPU_INTENSIVE, /* wq_cpu_intensive_thresh_us expirations */
+ PWQ_STAT_CM_WAKEUP, /* concurrency-management worker wakeups */
+ PWQ_STAT_MAYDAY, /* maydays to rescuer */
+ PWQ_STAT_RESCUED, /* linked work items executed by rescuer */
+
+ PWQ_NR_STATS,
+};
+
/*
* The per-pool workqueue. While queued, the lower WORK_STRUCT_FLAG_BITS
* of work_struct->data are used for flags and the remaining high bits
@@ -232,6 +248,8 @@ struct pool_workqueue {
struct list_head pwqs_node; /* WR: node on wq->pwqs */
struct list_head mayday_node; /* MD: node on wq->maydays */

+ u64 stats[PWQ_NR_STATS];
+
/*
* Release of unbound pwq is punted to system_wq. See put_pwq()
* and pwq_unbound_release_workfn() for details. pool_workqueue
@@ -984,6 +1002,7 @@ void wq_worker_stopping(struct task_struct *task)

raw_spin_lock_irq(&pool->lock);
worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+ worker->current_pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
} else {
/*
* Concurrency-managed @worker is sleeping. Decrement the
@@ -1012,8 +1031,10 @@ void wq_worker_stopping(struct task_struct *task)
pool->nr_running--;
}

- if (need_more_worker(pool))
+ if (need_more_worker(pool)) {
+ worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
wake_up_worker(pool);
+ }
raw_spin_unlock_irq(&pool->lock);
}

@@ -2181,6 +2202,7 @@ static void send_mayday(struct work_struct *work)
get_pwq(pwq);
list_add_tail(&pwq->mayday_node, &wq->maydays);
wake_up_process(wq->rescuer->task);
+ pwq->stats[PWQ_STAT_MAYDAY]++;
}
}

@@ -2419,6 +2441,7 @@ __acquires(&pool->lock)
* workqueues), so hiding them isn't a problem.
*/
lockdep_invariant_state(true);
+ pwq->stats[PWQ_STAT_STARTED]++;
trace_workqueue_execute_start(work);
worker->current_func(work);
/*
@@ -2426,6 +2449,9 @@ __acquires(&pool->lock)
* point will only record its address.
*/
trace_workqueue_execute_end(work, worker->current_func);
+ pwq->stats[PWQ_STAT_COMPLETED]++;
+ pwq->stats[PWQ_STAT_CPU_TIME] +=
+ worker->task->se.sum_exec_runtime - worker->current_at;
lock_map_release(&lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);

@@ -2672,6 +2698,7 @@ static int rescuer_thread(void *__rescuer)
if (first)
pool->watchdog_ts = jiffies;
move_linked_works(work, scheduled, &n);
+ pwq->stats[PWQ_STAT_RESCUED]++;
}
first = false;
}
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
new file mode 100755
index 000000000000..faea81df431c
--- /dev/null
+++ b/tools/workqueue/wq_monitor.py
@@ -0,0 +1,148 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2023 Tejun Heo <[email protected]>
+# Copyright (C) 2023 Meta Platforms, Inc. and affiliates.
+
+desc = """
+This is a drgn script to monitor workqueues.
+See the definition of pool_workqueue_stats enums.
+For drgn, visit https://github.com/osandov/drgn.
+"""
+
+import sys
+import signal
+import os
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry,list_empty
+from drgn.helpers.linux.cpumask import for_each_possible_cpu
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('workqueue', metavar='REGEX', nargs='*',
+ help='Target workqueue name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+ help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+ help='Output in json')
+args = parser.parse_args()
+
+def err(s):
+ print(s, file=sys.stderr, flush=True)
+ sys.exit(1)
+
+workqueues = prog['workqueues']
+
+WQ_UNBOUND = prog['WQ_UNBOUND']
+WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM']
+
+PWQ_STAT_STARTED = prog['PWQ_STAT_STARTED'] # work items started execution
+PWQ_STAT_COMPLETED = prog['PWQ_STAT_COMPLETED'] # work items completed execution
+PWQ_STAT_CPU_TIME = prog['PWQ_STAT_CPU_TIME'] # total CPU time consumed
+PWQ_STAT_CPU_INTENSIVE = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us expirations
+PWQ_STAT_CM_WAKEUP = prog['PWQ_STAT_CM_WAKEUP'] # concurrency-management worker wakeups
+PWQ_STAT_MAYDAY = prog['PWQ_STAT_MAYDAY'] # maydays to rescuer
+PWQ_STAT_RESCUED = prog['PWQ_STAT_RESCUED'] # linked work items executed by rescuer
+PWQ_NR_STATS = prog['PWQ_NR_STATS']
+
+class WqStats:
+ def __init__(self, wq):
+ self.name = wq.name.string_().decode()
+ self.unbound = wq.flags & WQ_UNBOUND != 0
+ self.mem_reclaim = wq.flags & WQ_MEM_RECLAIM != 0
+ self.stats = [0] * PWQ_NR_STATS
+ for pwq in list_for_each_entry('struct pool_workqueue', wq.pwqs.address_of_(), 'pwqs_node'):
+ for i in range(PWQ_NR_STATS):
+ self.stats[i] += int(pwq.stats[i])
+
+ def dict(self, now):
+ return { 'timestamp' : now,
+ 'name' : self.name,
+ 'unbound' : self.unbound,
+ 'mem_reclaim' : self.mem_reclaim,
+ 'started' : self.stats[PWQ_STAT_STARTED],
+ 'completed' : self.stats[PWQ_STAT_COMPLETED],
+ 'cputime' : self.stats[PWQ_STAT_CPU_TIME],
+ 'cpu_intensive' : self.stats[PWQ_STAT_CPU_INTENSIVE],
+ 'cm_wakeup' : self.stats[PWQ_STAT_CM_WAKEUP],
+ 'mayday' : self.stats[PWQ_STAT_MAYDAY],
+ 'rescued' : self.stats[PWQ_STAT_RESCUED], }
+
+ def table_header_str():
+ return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} {"CPUhog":>7} ' \
+ f'{"CMwake":>7} {"mayday":>7} {"rescued":>7}'
+
+ def table_row_str(self):
+ cpu_intensive = '-'
+ cm_wakeup = '-'
+ mayday = '-'
+ rescued = '-'
+
+ if not self.unbound:
+ cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE])
+ cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
+
+ if self.mem_reclaim:
+ mayday = str(self.stats[PWQ_STAT_MAYDAY])
+ rescued = str(self.stats[PWQ_STAT_RESCUED])
+
+ out = f'{self.name[-24:]:24} ' \
+ f'{self.stats[PWQ_STAT_STARTED]:8} ' \
+ f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
+ f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000000:8.1f} ' \
+ f'{cpu_intensive:>7} ' \
+ f'{cm_wakeup:>7} ' \
+ f'{mayday:>7} ' \
+ f'{rescued:>7} '
+ return out.rstrip(':')
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+ global exit_req
+ exit_req = True
+
+def main():
+ # handle args
+ table_fmt = not args.json
+ interval = args.interval
+
+ re_str = None
+ if args.workqueue:
+ for r in args.workqueue:
+ if re_str is None:
+ re_str = r
+ else:
+ re_str += '|' + r
+
+ filter_re = re.compile(re_str) if re_str else None
+
+ # monitoring loop
+ signal.signal(signal.SIGINT, sigint_handler)
+
+ while not exit_req:
+ now = time.time()
+
+ if table_fmt:
+ print()
+ print(WqStats.table_header_str())
+
+ for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+ stats = WqStats(wq)
+ if filter_re and not filter_re.search(stats.name):
+ continue
+ if table_fmt:
+ print(stats.table_row_str())
+ else:
+ print(stats.dict(now))
+
+ if interval == 0:
+ break
+ time.sleep(interval)
+
+if __name__ == "__main__":
+ main()
--
2.40.0

2023-04-18 21:00:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/5] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

If a per-cpu work item hogs the CPU, it can prevent other work items from
starting through concurrency management. A per-cpu workqueue which intends
to host such CPU-hogging work items can choose to not participate in
concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
error-prone and difficult to debug when missed.

This patch adds an automatic CPU usage based detection. If a
concurrency-managed work item consumes more CPU time than the threshold (5ms
by default), it's marked CPU_INTENSIVE automatically on schedule-out.

The mechanism isn't foolproof in that the 5ms detection delays can add up if
many CPU-hogging work items are queued at the same time. However, in such
situations, the bigger problem likely is the CPU being saturated with
per-cpu work items and the solution would be making them UNBOUND.

For occasional CPU hogging, the new automatic mechanism should provide
reasonable protection with minimal increase in code complexity.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 77 ++++++++++++++++++++++++++-----------
kernel/workqueue_internal.h | 1 +
2 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b9e8dc54272d..d24b887ddd86 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -306,6 +306,14 @@ static struct kmem_cache *pwq_cache;
static cpumask_var_t *wq_numa_possible_cpumask;
/* possible CPUs of each node */

+/*
+ * Per-cpu work items which run for longer than the following threshold are
+ * automatically considered CPU intensive and excluded from concurrency
+ * management to prevent them from noticeably delaying other per-cpu work items.
+ */
+static unsigned long wq_cpu_intensive_thresh_us = 5000;
+module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
+
static bool wq_disable_numa;
module_param_named(disable_numa, wq_disable_numa, bool, 0444);

@@ -951,9 +959,6 @@ void wq_worker_stopping(struct task_struct *task)
struct worker *worker = kthread_data(task);
struct worker_pool *pool;

- if (task_is_running(task))
- return;
-
/*
* Rescuers, which may not have all the fields set up like normal
* workers, also reach here, let's not access anything before
@@ -964,24 +969,49 @@ void wq_worker_stopping(struct task_struct *task)

pool = worker->pool;

- /* Return if preempted before wq_worker_running() was reached */
- if (worker->sleeping)
- return;
+ if (task_is_running(task)) {
+ /*
+ * Concurrency-managed @worker is still RUNNING. See if the
+ * current work is hogging CPU stalling other per-cpu work
+ * items. If so, mark @worker CPU_INTENSIVE to exclude it from
+ * concurrency management. @worker->current_* are stable as they
+ * can only be modified by @task which is %current.
+ */
+ if (!worker->current_work ||
+ task->se.sum_exec_runtime - worker->current_at <
+ wq_cpu_intensive_thresh_us * NSEC_PER_USEC)
+ return;

- worker->sleeping = 1;
- raw_spin_lock_irq(&pool->lock);
+ raw_spin_lock_irq(&pool->lock);
+ worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+ } else {
+ /*
+ * Concurrency-managed @worker is sleeping. Decrement the
+ * associated pool's nr_running accordingly.
+ */

- /*
- * Recheck in case unbind_workers() preempted us. We don't
- * want to decrement nr_running after the worker is unbound
- * and nr_running has been reset.
- */
- if (worker->flags & WORKER_NOT_RUNNING) {
- raw_spin_unlock_irq(&pool->lock);
- return;
+ /* Return if preempted before wq_worker_running() was reached */
+ if (worker->sleeping)
+ return;
+
+ worker->sleeping = 1;
+ raw_spin_lock_irq(&pool->lock);
+
+ /*
+ * Recheck in case unbind_workers() preempted us. We don't want
+ * to decrement nr_running after the worker is unbound and
+ * nr_running has been reset. As a running worker never sleeps
+ * inbetween work items, we know that @worker->current_* are
+ * valid after the following check.
+ */
+ if (worker->flags & WORKER_NOT_RUNNING) {
+ raw_spin_unlock_irq(&pool->lock);
+ return;
+ }
+
+ pool->nr_running--;
}

- pool->nr_running--;
if (need_more_worker(pool))
wake_up_worker(pool);
raw_spin_unlock_irq(&pool->lock);
@@ -2288,7 +2318,6 @@ __acquires(&pool->lock)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
- bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
unsigned long work_data;
struct worker *collision;
#ifdef CONFIG_LOCKDEP
@@ -2325,6 +2354,7 @@ __acquires(&pool->lock)
worker->current_work = work;
worker->current_func = work->func;
worker->current_pwq = pwq;
+ worker->current_at = worker->task->se.sum_exec_runtime;
work_data = *work_data_bits(work);
worker->current_color = get_work_color(work_data);

@@ -2342,7 +2372,7 @@ __acquires(&pool->lock)
* of concurrency management and the next code block will chain
* execution of the pending work items.
*/
- if (unlikely(cpu_intensive))
+ if (unlikely(pwq->wq->flags & WQ_CPU_INTENSIVE))
worker_set_flags(worker, WORKER_CPU_INTENSIVE);

/*
@@ -2420,9 +2450,12 @@ __acquires(&pool->lock)

raw_spin_lock_irq(&pool->lock);

- /* clear cpu intensive status */
- if (unlikely(cpu_intensive))
- worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
+ /*
+ * In addition to %WQ_CPU_INTENSIVE, @worker may also have been marked
+ * CPU intensive by wq_worker_stopping() if @work consumed more than
+ * wq_cpu_intensive_thresh_us. Clear it.
+ */
+ worker_clr_flags(worker, WORKER_CPU_INTENSIVE);

/* tag the worker for identification in schedule() */
worker->last_func = worker->current_func;
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 902459b328de..08ffed41f104 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -31,6 +31,7 @@ struct worker {
struct work_struct *current_work; /* L: work being processed */
work_func_t current_func; /* L: current_work's fn */
struct pool_workqueue *current_pwq; /* L: current_work's pwq */
+ u64 current_at; /* L: runtime when current_work started */
unsigned int current_color; /* L: current_work's color */
int sleeping; /* None */

--
2.40.0

2023-04-19 15:57:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

Hello,

On Wed, Apr 19, 2023 at 09:45:52AM +0800, Hillf Danton wrote:
...
> Need to wake up another worker in case the current one with
> WORKER_CPU_INTENSIVE set is going to sleep with works left behind.

Please see below.

> > +
> > + pool->nr_running--;
> > }
> >
> > - pool->nr_running--;
> > if (need_more_worker(pool))
> > wake_up_worker(pool);

That's what this block is for.

Thanks.

--
tejun

2023-04-23 03:26:03

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

On Wed, Apr 19, 2023 at 4:52 AM Tejun Heo <[email protected]> wrote:
>
> If a per-cpu work item hogs the CPU, it can prevent other work items from
> starting through concurrency management. A per-cpu workqueue which intends
> to host such CPU-hogging work items can choose to not participate in
> concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
> error-prone and difficult to debug when missed.
>
> This patch adds an automatic CPU usage based detection. If a
> concurrency-managed work item consumes more CPU time than the threshold (5ms
> by default), it's marked CPU_INTENSIVE automatically on schedule-out.
>
> The mechanism isn't foolproof in that the 5ms detection delays can add up if
> many CPU-hogging work items are queued at the same time. However, in such
> situations, the bigger problem likely is the CPU being saturated with
> per-cpu work items and the solution would be making them UNBOUND.
>
> For occasional CPU hogging, the new automatic mechanism should provide
> reasonable protection with minimal increase in code complexity.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> ---
> kernel/workqueue.c | 77 ++++++++++++++++++++++++++-----------
> kernel/workqueue_internal.h | 1 +
> 2 files changed, 56 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index b9e8dc54272d..d24b887ddd86 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -306,6 +306,14 @@ static struct kmem_cache *pwq_cache;
> static cpumask_var_t *wq_numa_possible_cpumask;
> /* possible CPUs of each node */
>
> +/*
> + * Per-cpu work items which run for longer than the following threshold are
> + * automatically considered CPU intensive and excluded from concurrency
> + * management to prevent them from noticeably delaying other per-cpu work items.
> + */
> +static unsigned long wq_cpu_intensive_thresh_us = 5000;
> +module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
> +
> static bool wq_disable_numa;
> module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>
> @@ -951,9 +959,6 @@ void wq_worker_stopping(struct task_struct *task)
> struct worker *worker = kthread_data(task);
> struct worker_pool *pool;
>
> - if (task_is_running(task))
> - return;
> -
> /*
> * Rescuers, which may not have all the fields set up like normal
> * workers, also reach here, let's not access anything before
> @@ -964,24 +969,49 @@ void wq_worker_stopping(struct task_struct *task)
>
> pool = worker->pool;
>
> - /* Return if preempted before wq_worker_running() was reached */
> - if (worker->sleeping)
> - return;
> + if (task_is_running(task)) {
> + /*
> + * Concurrency-managed @worker is still RUNNING. See if the
> + * current work is hogging CPU stalling other per-cpu work
> + * items. If so, mark @worker CPU_INTENSIVE to exclude it from
> + * concurrency management. @worker->current_* are stable as they
> + * can only be modified by @task which is %current.

Hello

wq_worker_stopping() and sched_submit_work() are only called from
schedule() and are not called for other various kinds of scheduling,
such as schedule_rtlock(), preempt_schedule_*(), __cond_resched().

A work item hogging CPU may not call the bare schedule(). To make
the new wq_worker_stopping() works, it has to be added to other kinds
of scheduling, IMO.

Thanks
Lai

2023-04-24 15:33:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

Hello, Lai.

On Sun, Apr 23, 2023 at 11:23:28AM +0800, Lai Jiangshan wrote:
> wq_worker_stopping() and sched_submit_work() are only called from
> schedule() and are not called for other various kinds of scheduling,
> such as schedule_rtlock(), preempt_schedule_*(), __cond_resched().
>
> A work item hogging CPU may not call the bare schedule(). To make
> the new wq_worker_stopping() works, it has to be added to other kinds
> of scheduling, IMO.

Yeah, you're right. The proposed code would work fine only on !preempt
kernels. I guess the right thing to do is splitting out the hook for
non-sleeping schedulers and adding them in those paths. I'll look into it.

Thanks for spotting the issue.

--
tejun

2023-04-25 13:16:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

On Tue, Apr 18, 2023 at 10:51:58AM -1000, Tejun Heo wrote:
> If a per-cpu work item hogs the CPU, it can prevent other work items from
> starting through concurrency management. A per-cpu workqueue which intends
> to host such CPU-hogging work items can choose to not participate in
> concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
> error-prone and difficult to debug when missed.

Well; you could use this very heuristic, and instead of silently trying
to fix up, complain about the missing CPU_INTENSIVE thing.

> This patch adds an automatic CPU usage based detection. If a
> concurrency-managed work item consumes more CPU time than the threshold (5ms
> by default), it's marked CPU_INTENSIVE automatically on schedule-out.
>
> The mechanism isn't foolproof in that the 5ms detection delays can add up if
> many CPU-hogging work items are queued at the same time. However, in such
> situations, the bigger problem likely is the CPU being saturated with
> per-cpu work items and the solution would be making them UNBOUND.
>
> For occasional CPU hogging, the new automatic mechanism should provide
> reasonable protection with minimal increase in code complexity.

But why not keep it a debug mechanism? Now you're got a heuristic with
all the down-sides that they bring.

2023-04-28 15:22:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

Hello, Peter.

On Tue, Apr 25, 2023 at 03:12:54PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 18, 2023 at 10:51:58AM -1000, Tejun Heo wrote:
> > If a per-cpu work item hogs the CPU, it can prevent other work items from
> > starting through concurrency management. A per-cpu workqueue which intends
> > to host such CPU-hogging work items can choose to not participate in
> > concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
> > error-prone and difficult to debug when missed.
>
> Well; you could use this very heuristic, and instead of silently trying
> to fix up, complain about the missing CPU_INTENSIVE thing.
...
> But why not keep it a debug mechanism? Now you're got a heuristic with
> all the down-sides that they bring.

I'm working on improving the locality of unbound workqueues and it isn't
clear whether there's enough space for per-cpu CPU_INTENSIVE work items -
ie. if it's gonna saturate the CPU for extended periods of time to the point
of requiring CPU_INTENSIVE, it might as well be an unbound work item if the
baseline locality can be good enough. There aren't that many users of
CPU_INTENSIVE in tree and we can decide based on how unbound workqueues
actually work out for them.

As for warnings, yeah, that'd be great to have regardless of how
CPU_INTENSIVE turns out. Just gotta make sure it doesn't fire spuriously and
become a nuisance.

Thanks.

--
tejun