LinuxLists.cc - [PATCHSET v2 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring

2023-05-10 03:09:57

Subject: [PATCHSET v2 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring

v2: * Lai pointed out that !SM_NONE cases should also be notified. 0001 and
0004 are updated accordingly.

* PeterZ suggested reporting on work items that trigger the auto CPU
intensive mechanism. 0006 adds reporting of work functions that
trigger the mechanism repeatedly with exponential backoff.

Hello,

To reduce the number of concurrent worker threads, workqueue holds back
starting per-cpu work items while the previous work item stays in the
RUNNING state. As such a per-cpu work item which consumes a lot of CPU
cycles, even if it has cond_resched()'s in the right places, can stall other
per-cpu work items.

To support per-cpu work items that may occupy the CPU for a substantial
period of time, workqueue has WQ_CPU_INTENSIVE flag which exempts work items
issued through the marked workqueue from concurrency management - they're
started immediately and don't block other work items. While this works, it's
error-prone in that a workqueue user can easily forget to set the flag or
set it unnecessarily. Furthermore, the impacts of the wrong flag setting can
be rather indirect and challenging to root-cause.

This patchset makes workqueue auto-detect CPU intensive work items based on
CPU consumption. If a work item consumes more than the threshold (5ms by
default) of CPU time, it's automatically marked as CPU intensive when it
gets scheduled out which unblocks starting of pending per-cpu work items.

The mechanism isn't foolproof in that the detection delays can add up if
many CPU-hogging work items are queued at the same time. However, in such
situations, the bigger problem likely is the CPU being saturated with
per-cpu work items and the solution would be making them UNBOUND. Future
changes will make UNBOUND workqueues more attractive by improving their
locality behaviors and configurability. We might eventually remove the
explicit WQ_CPU_INTENSIVE flag.

While at it, add statistics and a monitoring script. Lack of visibility has
always been a bit of pain point when debugging workqueue related issues and
with this change and more drastic ones planned for workqueue, this is a good
time to address the shortcoming.

This patchset was born out of the discussion in the following thread:

https://lkml.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com

and contains the following five patches:

0001-workqueue-sched-Notify-workqueue-of-scheduling-of-RU.patch
0002-workqueue-Re-order-struct-worker-fields.patch
0003-workqueue-Move-worker_set-clr_flags-upwards.patch
0004-workqueue-Automatically-mark-CPU-hogging-work-items-.patch
0005-workqueue-Report-work-funcs-that-trigger-automatic-C.patch
0006-workqueue-Add-pwq-stats-and-a-monitoring-script.patch

and also available in the following git branch:

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git auto-cpu-intensive-v2

diffstat follows. Thanks.

Documentation/core-api/workqueue.rst | 30 ++
kernel/sched/core.c | 48 ++--
kernel/workqueue.c | 364 +++++++++++++++++++++++++++--------
kernel/workqueue_internal.h | 14 -
lib/Kconfig.debug | 13 +
tools/workqueue/wq_monitor.py | 148 ++++++++++++++
6 files changed, 513 insertions(+), 104 deletions(-)

--
tejun

2023-05-10 03:19:38

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 5/6] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism

Workqueue now automatically marks per-cpu work items that hog CPU for too
long as CPU_INTENSIVE, which excludes them from concurrency management and
prevents stalling other concurrency-managed work items. If a work function
keeps running over the thershold, it likely needs to be switched to use an
unbound workqueue.

This patch adds a debug mechanism which tracks the work functions which
trigger the automatic CPU_INTENSIVE mechanism and report them using
pr_warn() with exponential backoff.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
---
kernel/workqueue.c | 128 +++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 13 +++++
2 files changed, 141 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b63bbd22f756..fa535d820093 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -954,6 +954,132 @@ void wq_worker_running(struct task_struct *task)
worker->sleeping = 0;
}

+#ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
+
+/*
+ * Concurrency-managed per-cpu work items that hog CPU for longer than
+ * wq_cpu_intensive_thresh_us trigger the automatic CPU_INTENSIVE mechanism,
+ * which prevents them from stalling other concurrency-managed work items. If a
+ * work function keeps triggering this mechanism, it's likely that the work item
+ * should be using an unbound workqueue instead.
+ *
+ * wq_cpu_intensive_report() tracks work functions which trigger such conditions
+ * and report them so that they can be examined and converted to use unbound
+ * workqueues as appropriate. To avoid flooding the console, each violating work
+ * function is tracked and reported with exponential backoff.
+ */
+#define WCI_MAX_ENTS 128
+
+struct wci_ent {
+ work_func_t func;
+ atomic64_t cnt;
+ struct hlist_node hash_node;
+ struct list_head report_node;
+};
+
+static struct wci_ent wci_ents[WCI_MAX_ENTS];
+static int wci_nr_ents;
+static DEFINE_RAW_SPINLOCK(wci_lock);
+static DEFINE_HASHTABLE(wci_hash, ilog2(WCI_MAX_ENTS));
+static LIST_HEAD(wci_report_list);
+static struct kthread_worker *wci_report_worker;
+
+static void wci_report_workfn(struct kthread_work *work)
+{
+ struct wci_ent *ent, *n;
+
+ raw_spin_lock_irq(&wci_lock);
+ list_for_each_entry_safe(ent, n, &wci_report_list, report_node) {
+ pr_warn("workqueue: %ps hogged CPU for >%luus %llu times, consider switching to WQ_UNBOUND\n",
+ ent->func, wq_cpu_intensive_thresh_us,
+ atomic64_read(&ent->cnt));
+ list_del_init(&ent->report_node);
+ }
+ raw_spin_unlock_irq(&wci_lock);
+}
+
+static DEFINE_KTHREAD_WORK(wci_report_work, wci_report_workfn);
+
+static struct wci_ent *wci_find_ent(work_func_t func)
+{
+ struct wci_ent *ent;
+
+ hash_for_each_possible_rcu(wci_hash, ent, hash_node,
+ (unsigned long)func) {
+ if (ent->func == func)
+ return ent;
+ }
+ return NULL;
+}
+
+static void wq_cpu_intensive_report(work_func_t func)
+{
+ struct wci_ent *ent;
+
+restart:
+ ent = wci_find_ent(func);
+ if (ent) {
+ u64 cnt;
+
+ /*
+ * Start reporting from the fourth time and back off
+ * exponentially.
+ */
+ cnt = atomic64_inc_return_relaxed(&ent->cnt);
+ if (cnt < 4 || !is_power_of_2(cnt))
+ return;
+
+ raw_spin_lock(&wci_lock);
+ if (list_empty(&ent->report_node)) {
+ list_add_tail(&ent->report_node, &wci_report_list);
+ if (wci_report_worker)
+ kthread_queue_work(wci_report_worker,
+ &wci_report_work);
+ }
+ raw_spin_unlock(&wci_lock);
+ return;
+ }
+
+ /*
+ * @func is a new violation. Allocate a new entry for it. If wcn_ents[]
+ * is exhausted, something went really wrong and we probably made enough
+ * noise already.
+ */
+ if (wci_nr_ents >= WCI_MAX_ENTS)
+ return;
+
+ raw_spin_lock(&wci_lock);
+
+ if (wci_nr_ents >= WCI_MAX_ENTS) {
+ raw_spin_unlock(&wci_lock);
+ return;
+ }
+
+ if (wci_find_ent(func)) {
+ raw_spin_unlock(&wci_lock);
+ goto restart;
+ }
+
+ ent = &wci_ents[wci_nr_ents++];
+ ent->func = func;
+ atomic64_set(&ent->cnt, 1);
+ INIT_LIST_HEAD(&ent->report_node);
+ hash_add_rcu(wci_hash, &ent->hash_node, (unsigned long)func);
+
+ raw_spin_unlock(&wci_lock);
+}
+
+static void __init wq_cpu_intensive_report_init(void)
+{
+ wci_report_worker = kthread_create_worker(0, "wq_cpu_intensive_report");
+ WARN_ON(!wci_report_worker);
+}
+
+#else /* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+static void wq_cpu_intensive_report(work_func_t func) {}
+static void __init wq_cpu_intensive_report_init(void) {}
+#endif /* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+
/**
* wq_worker_stopping - a worker is stopping
* @task: task stopping
@@ -1001,6 +1127,7 @@ void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)
*/
raw_spin_lock_irq(&pool->lock);
worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+ wq_cpu_intensive_report(worker->current_func);
} else {
/*
* Concurrency-managed @worker is sleeping. Decrement the
@@ -6421,6 +6548,7 @@ void __init workqueue_init(void)

wq_online = true;
wq_watchdog_init();
+ wq_cpu_intensive_report_init();
}

/*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc6803..2f3619662840 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1134,6 +1134,19 @@ config WQ_WATCHDOG
state. This can be configured through kernel parameter
"workqueue.watchdog_thresh" and its sysfs counterpart.

+config WQ_CPU_INTENSIVE_REPORT
+ bool "Report per-cpu work items which are CPU intensive"
+ depends on DEBUG_KERNEL
+ help
+ Say Y here to enable reporting of concurrency-managed per-cpu work
+ items that hog CPUs for longer than
+ workqueue.cpu_intensive_threshold_us. Workqueue automatically
+ detects and excludes them from concurrency management to prevent
+ them from stalling other per-cpu work items. Occassional
+ triggering may not necessarily indicate a problem. Repeated
+ triggering likely indicates that the work item should be switched
+ to use an unbound workqueue.
+
config TEST_LOCKUP
tristate "Test module to generate lockups"
depends on m
--
2.40.1

2023-05-10 03:21:11

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 6/6] workqueue: Add pwq->stats[] and a monitoring script

Currently, the only way to peer into workqueue operations is through
tracing. While possible, it isn't easy or convenient to monitor
per-workqueue behaviors over time this way. Let's add pwq->stats[] that
track relevant events and a drgn monitoring script -
tools/workqueue/wq_monitor.py.

It's arguable whether this needs to be configurable. However, it currently
only has seven counters and the runtime overhead shouldn't be noticeable
given that they're on pwq's which are per-cpu on per-cpu workqueues and
per-numa-node on unbound ones. Let's keep it simple for the time being.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
Documentation/core-api/workqueue.rst | 30 ++++++
kernel/workqueue.c | 29 +++++-
tools/workqueue/wq_monitor.py | 148 +++++++++++++++++++++++++++
3 files changed, 206 insertions(+), 1 deletion(-)
create mode 100644 tools/workqueue/wq_monitor.py

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 8ec4d6270b24..a4bb2208100e 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -348,6 +348,35 @@ Guidelines
level of locality in wq operations and work item execution.

+Monitoring
+==========
+
+Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
+
+ $ tools/workqueue/wq_monitor.py events
+ total infl CPUtime CPUhog CMwake mayday rescued
+ events 18545 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38306 0 0.1 - - - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29598 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ total infl CPUtime CPUhog CMwake mayday rescued
+ events 18548 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38322 0 0.1 - - - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29603 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ ...
+
+
Debugging
=========

@@ -387,6 +416,7 @@ For the second type of problems it should be possible to just check
The work item's function should be trivially visible in the stack
trace.

+
Non-reentrance Conditions
=========================

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fa535d820093..09cbe85947cd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -199,6 +199,22 @@ struct worker_pool {
struct rcu_head rcu;
};

+/*
+ * Per-pool_workqueue statistics. These can be monitored using
+ * tools/workqueue/wq_monitor.py.
+ */
+enum pool_workqueue_stats {
+ PWQ_STAT_STARTED, /* work items started execution */
+ PWQ_STAT_COMPLETED, /* work items completed execution */
+ PWQ_STAT_CPU_TIME, /* total CPU time consumed */
+ PWQ_STAT_CPU_INTENSIVE, /* wq_cpu_intensive_thresh_us expirations */
+ PWQ_STAT_CM_WAKEUP, /* concurrency-management worker wakeups */
+ PWQ_STAT_MAYDAY, /* maydays to rescuer */
+ PWQ_STAT_RESCUED, /* linked work items executed by rescuer */
+
+ PWQ_NR_STATS,
+};
+
/*
* The per-pool workqueue. While queued, the lower WORK_STRUCT_FLAG_BITS
* of work_struct->data are used for flags and the remaining high bits
@@ -236,6 +252,8 @@ struct pool_workqueue {
struct list_head pwqs_node; /* WR: node on wq->pwqs */
struct list_head mayday_node; /* MD: node on wq->maydays */

+ u64 stats[PWQ_NR_STATS];
+
/*
* Release of unbound pwq is punted to system_wq. See put_pwq()
* and pwq_unbound_release_workfn() for details. pool_workqueue
@@ -1127,6 +1145,7 @@ void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)
*/
raw_spin_lock_irq(&pool->lock);
worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+ worker->current_pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
wq_cpu_intensive_report(worker->current_func);
} else {
/*
@@ -1156,8 +1175,10 @@ void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)
pool->nr_running--;
}

- if (need_more_worker(pool))
+ if (need_more_worker(pool)) {
+ worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
wake_up_worker(pool);
+ }
raw_spin_unlock_irq(&pool->lock);
}

@@ -2336,6 +2357,7 @@ static void send_mayday(struct work_struct *work)
get_pwq(pwq);
list_add_tail(&pwq->mayday_node, &wq->maydays);
wake_up_process(wq->rescuer->task);
+ pwq->stats[PWQ_STAT_MAYDAY]++;
}
}

@@ -2574,6 +2596,7 @@ __acquires(&pool->lock)
* workqueues), so hiding them isn't a problem.
*/
lockdep_invariant_state(true);
+ pwq->stats[PWQ_STAT_STARTED]++;
trace_workqueue_execute_start(work);
worker->current_func(work);
/*
@@ -2581,6 +2604,9 @@ __acquires(&pool->lock)
* point will only record its address.
*/
trace_workqueue_execute_end(work, worker->current_func);
+ pwq->stats[PWQ_STAT_COMPLETED]++;
+ pwq->stats[PWQ_STAT_CPU_TIME] +=
+ worker->task->se.sum_exec_runtime - worker->current_at;
lock_map_release(&lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);

@@ -2827,6 +2853,7 @@ static int rescuer_thread(void *__rescuer)
if (first)
pool->watchdog_ts = jiffies;
move_linked_works(work, scheduled, &n);
+ pwq->stats[PWQ_STAT_RESCUED]++;
}
first = false;
}
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
new file mode 100644
index 000000000000..faea81df431c
--- /dev/null
+++ b/tools/workqueue/wq_monitor.py
@@ -0,0 +1,148 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2023 Tejun Heo <[email protected]>
+# Copyright (C) 2023 Meta Platforms, Inc. and affiliates.
+
+desc = """
+This is a drgn script to monitor workqueues.
+See the definition of pool_workqueue_stats enums.
+For drgn, visit https://github.com/osandov/drgn.
+"""
+
+import sys
+import signal
+import os
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry,list_empty
+from drgn.helpers.linux.cpumask import for_each_possible_cpu
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('workqueue', metavar='REGEX', nargs='*',
+ help='Target workqueue name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+ help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+ help='Output in json')
+args = parser.parse_args()
+
+def err(s):
+ print(s, file=sys.stderr, flush=True)
+ sys.exit(1)
+
+workqueues = prog['workqueues']
+
+WQ_UNBOUND = prog['WQ_UNBOUND']
+WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM']
+
+PWQ_STAT_STARTED = prog['PWQ_STAT_STARTED'] # work items started execution
+PWQ_STAT_COMPLETED = prog['PWQ_STAT_COMPLETED'] # work items completed execution
+PWQ_STAT_CPU_TIME = prog['PWQ_STAT_CPU_TIME'] # total CPU time consumed
+PWQ_STAT_CPU_INTENSIVE = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us expirations
+PWQ_STAT_CM_WAKEUP = prog['PWQ_STAT_CM_WAKEUP'] # concurrency-management worker wakeups
+PWQ_STAT_MAYDAY = prog['PWQ_STAT_MAYDAY'] # maydays to rescuer
+PWQ_STAT_RESCUED = prog['PWQ_STAT_RESCUED'] # linked work items executed by rescuer
+PWQ_NR_STATS = prog['PWQ_NR_STATS']
+
+class WqStats:
+ def __init__(self, wq):
+ self.name = wq.name.string_().decode()
+ self.unbound = wq.flags & WQ_UNBOUND != 0
+ self.mem_reclaim = wq.flags & WQ_MEM_RECLAIM != 0
+ self.stats = [0] * PWQ_NR_STATS
+ for pwq in list_for_each_entry('struct pool_workqueue', wq.pwqs.address_of_(), 'pwqs_node'):
+ for i in range(PWQ_NR_STATS):
+ self.stats[i] += int(pwq.stats[i])
+
+ def dict(self, now):
+ return { 'timestamp' : now,
+ 'name' : self.name,
+ 'unbound' : self.unbound,
+ 'mem_reclaim' : self.mem_reclaim,
+ 'started' : self.stats[PWQ_STAT_STARTED],
+ 'completed' : self.stats[PWQ_STAT_COMPLETED],
+ 'cputime' : self.stats[PWQ_STAT_CPU_TIME],
+ 'cpu_intensive' : self.stats[PWQ_STAT_CPU_INTENSIVE],
+ 'cm_wakeup' : self.stats[PWQ_STAT_CM_WAKEUP],
+ 'mayday' : self.stats[PWQ_STAT_MAYDAY],
+ 'rescued' : self.stats[PWQ_STAT_RESCUED], }
+
+ def table_header_str():
+ return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} {"CPUhog":>7} ' \
+ f'{"CMwake":>7} {"mayday":>7} {"rescued":>7}'
+
+ def table_row_str(self):
+ cpu_intensive = '-'
+ cm_wakeup = '-'
+ mayday = '-'
+ rescued = '-'
+
+ if not self.unbound:
+ cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE])
+ cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
+
+ if self.mem_reclaim:
+ mayday = str(self.stats[PWQ_STAT_MAYDAY])
+ rescued = str(self.stats[PWQ_STAT_RESCUED])
+
+ out = f'{self.name[-24:]:24} ' \
+ f'{self.stats[PWQ_STAT_STARTED]:8} ' \
+ f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
+ f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000000:8.1f} ' \
+ f'{cpu_intensive:>7} ' \
+ f'{cm_wakeup:>7} ' \
+ f'{mayday:>7} ' \
+ f'{rescued:>7} '
+ return out.rstrip(':')
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+ global exit_req
+ exit_req = True
+
+def main():
+ # handle args
+ table_fmt = not args.json
+ interval = args.interval
+
+ re_str = None
+ if args.workqueue:
+ for r in args.workqueue:
+ if re_str is None:
+ re_str = r
+ else:
+ re_str += '|' + r
+
+ filter_re = re.compile(re_str) if re_str else None
+
+ # monitoring loop
+ signal.signal(signal.SIGINT, sigint_handler)
+
+ while not exit_req:
+ now = time.time()
+
+ if table_fmt:
+ print()
+ print(WqStats.table_header_str())
+
+ for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+ stats = WqStats(wq)
+ if filter_re and not filter_re.search(stats.name):
+ continue
+ if table_fmt:
+ print(stats.table_row_str())
+ else:
+ print(stats.dict(now))
+
+ if interval == 0:
+ break
+ time.sleep(interval)
+
+if __name__ == "__main__":
+ main()
--
2.40.1

2023-05-10 03:21:26

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 1/6] workqueue, sched: Notify workqueue of scheduling of RUNNING and preempted tasks

When a workqueue kworker goes to sleep, wq_worker_sleeping() is called so
that workqueue can manage concurrency. This patch renames
wq_worker_sleeping() to wq_worker_stopping() and calls it whenever a kworker
is scheduled out whether it's going to sleep or not.

Workqueue will use the schedule-out events of RUNNING tasks to automatically
detect CPU hogging work items and exclude them from concurrency management
so that they can't stall other work items.

sched_submit_work() and sched_update_work() were only called by schedule().
As workqueue would need to be notified of all scheduling events including
preemptions, make the two functions notrace, add @sched_mode and call them
from all schedule functions. As wq_worker_sleeping() is now called from
preemption path, it's also marked notrace.

As sched_update_work() is noop when called !SM_NONE, it's debatable whether
it needs to be called from !SM_NONE paths. However, it shouldn't add any
actual overhead and is possibly less confusing to keep the symmetry. This
can go either way.

The only functional change is that wq_worker_sleeping() is now called from
all scheduling paths. As this patch adds early exit conditions which make it
noop in all new invocations, there shouldn't be any behavior change. While
at it, remove the already outdated comment which doesn't cover the io_wq
case.

v2: Lai pointed out that !SM_NONE cases should also be notified. Updated
accordingly.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 48 ++++++++++++++++++++++++-------------
kernel/workqueue.c | 23 +++++++++++++-----
kernel/workqueue_internal.h | 2 +-
3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..ab4317a8e11a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6692,24 +6692,20 @@ void __noreturn do_task_dead(void)
cpu_relax();
}

-static inline void sched_submit_work(struct task_struct *tsk)
+static inline notrace void sched_submit_work(struct task_struct *tsk,
+ unsigned int sched_mode)
{
- unsigned int task_flags;
+ unsigned int task_flags = tsk->flags;
+ bool voluntary = sched_mode == SM_NONE;

- if (task_is_running(tsk))
+ if (task_flags & PF_WQ_WORKER)
+ wq_worker_stopping(tsk, voluntary);
+
+ if (!voluntary || task_is_running(tsk))
return;

- task_flags = tsk->flags;
- /*
- * If a worker goes to sleep, notify and ask workqueue whether it
- * wants to wake up a task to maintain concurrency.
- */
- if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
- if (task_flags & PF_WQ_WORKER)
- wq_worker_sleeping(tsk);
- else
- io_wq_worker_sleeping(tsk);
- }
+ if (task_flags & PF_IO_WORKER)
+ io_wq_worker_sleeping(tsk);

/*
* spinlock and rwlock must not flush block requests. This will
@@ -6725,8 +6721,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
blk_flush_plug(tsk->plug, true);
}

-static void sched_update_worker(struct task_struct *tsk)
+static notrace void sched_update_worker(struct task_struct *tsk,
+ unsigned int sched_mode)
{
+ if (sched_mode != SM_NONE)
+ return;
+
if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
@@ -6739,13 +6739,13 @@ asmlinkage __visible void __sched schedule(void)
{
struct task_struct *tsk = current;

- sched_submit_work(tsk);
+ sched_submit_work(tsk, SM_NONE);
do {
preempt_disable();
__schedule(SM_NONE);
sched_preempt_enable_no_resched();
} while (need_resched());
- sched_update_worker(tsk);
+ sched_update_worker(tsk, SM_NONE);
}
EXPORT_SYMBOL(schedule);

@@ -6808,17 +6808,24 @@ void __sched schedule_preempt_disabled(void)
#ifdef CONFIG_PREEMPT_RT
void __sched notrace schedule_rtlock(void)
{
+ struct task_struct *tsk = current;
+
+ sched_submit_work(tsk, SM_RTLOCK_WAIT);
do {
preempt_disable();
__schedule(SM_RTLOCK_WAIT);
sched_preempt_enable_no_resched();
} while (need_resched());
+ sched_update_worker(tsk, SM_RTLOCK_WAIT);
}
NOKPROBE_SYMBOL(schedule_rtlock);
#endif

static void __sched notrace preempt_schedule_common(void)
{
+ struct task_struct *tsk = current;
+
+ sched_submit_work(tsk, SM_PREEMPT);
do {
/*
* Because the function tracer can trace preempt_count_sub()
@@ -6844,6 +6851,7 @@ static void __sched notrace preempt_schedule_common(void)
* between schedule and now.
*/
} while (need_resched());
+ sched_update_worker(tsk, SM_PREEMPT);
}

#ifdef CONFIG_PREEMPTION
@@ -6901,11 +6909,13 @@ EXPORT_SYMBOL(dynamic_preempt_schedule);
*/
asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
{
+ struct task_struct *tsk = current;
enum ctx_state prev_ctx;

if (likely(!preemptible()))
return;

+ sched_submit_work(tsk, SM_PREEMPT);
do {
/*
* Because the function tracer can trace preempt_count_sub()
@@ -6934,6 +6944,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
preempt_latency_stop(1);
preempt_enable_no_resched_notrace();
} while (need_resched());
+ sched_update_worker(tsk, SM_PREEMPT);
}
EXPORT_SYMBOL_GPL(preempt_schedule_notrace);

@@ -6968,11 +6979,13 @@ EXPORT_SYMBOL(dynamic_preempt_schedule_notrace);
*/
asmlinkage __visible void __sched preempt_schedule_irq(void)
{
+ struct task_struct *tsk = current;
enum ctx_state prev_state;

/* Catch callers which need to be fixed */
BUG_ON(preempt_count() || !irqs_disabled());

+ sched_submit_work(tsk, SM_PREEMPT);
prev_state = exception_enter();

do {
@@ -6984,6 +6997,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
} while (need_resched());

exception_exit(prev_state);
+ sched_update_worker(tsk, SM_PREEMPT);
}

int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4666a1a92a31..cbcdc11adabd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -867,7 +867,10 @@ static void wake_up_worker(struct worker_pool *pool)
* wq_worker_running - a worker is running again
* @task: task waking up
*
- * This function is called when a worker returns from schedule()
+ * This function is called when a worker returns from schedule().
+ *
+ * Unlike wq_worker_stopping(), this function is only called from schedule() but
+ * not other scheduling paths including preemption and can be traced safely.
*/
void wq_worker_running(struct task_struct *task)
{
@@ -890,17 +893,25 @@ void wq_worker_running(struct task_struct *task)
}

/**
- * wq_worker_sleeping - a worker is going to sleep
- * @task: task going to sleep
+ * wq_worker_stopping - a worker is stopping
+ * @task: task stopping
+ * @voluntary: being called from schedule()
+ *
+ * This function is called from scheduling paths including schedule() and
+ * preemption when a busy worker is going off the CPU.
*
- * This function is called from schedule() when a busy worker is
- * going to sleep.
+ * As this function may be called from preempt_enable_notrace() and friends when
+ * !@voluntary, it must be notrace and limit reentrancy when @voluntary to avoid
+ * infinite recursions.
*/
-void wq_worker_sleeping(struct task_struct *task)
+void notrace wq_worker_stopping(struct task_struct *task, bool voluntary)
{
struct worker *worker = kthread_data(task);
struct worker_pool *pool;

+ if (!voluntary || task_is_running(task))
+ return;
+
/*
* Rescuers, which may not have all the fields set up like normal
* workers, also reach here, let's not access anything before
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index e00b1204a8e9..c34b876af16d 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -75,7 +75,7 @@ static inline struct worker *current_wq_worker(void)
* sched/ and workqueue.c.
*/
void wq_worker_running(struct task_struct *task);
-void wq_worker_sleeping(struct task_struct *task);
+void wq_worker_stopping(struct task_struct *task, bool voluntary);
work_func_t wq_worker_last_func(struct task_struct *task);

#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
--
2.40.1