Hello,
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patchset implements system-wide nr_active mechanism
with the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of online
CPUs per node. e.g. A node with twice more online CPUs will get twice
higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
This patchset includes the following patches:
0001-workqueue-Move-pwq-max_active-to-wq-max_active.patch
0002-workqueue-Factor-out-pwq_is_empty.patch
0003-workqueue-Replace-pwq_activate_inactive_work-with-__.patch
0004-workqueue-Move-nr_active-handling-into-helpers.patch
0005-workqueue-Make-wq_adjust_max_active-round-robin-pwqs.patch
0006-workqueue-Add-first_possible_node-and-node_nr_cpus.patch
0007-workqueue-Move-pwq_dec_nr_in_flight-to-the-end-of-wo.patch
0008-workqueue-Introduce-struct-wq_node_nr_active.patch
0009-workqueue-Implement-system-wide-nr_active-enforcemen.patch
0010-workqueue-Reimplement-ordered-workqueue-using-shared.patch
This pachset is also available in the following git branch.
https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git unbound-system-wide-max_active
diffstat follows.
include/linux/workqueue.h | 35 ++-
kernel/workqueue.c | 669 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 565 insertions(+), 139 deletions(-)
--
tejun
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.
This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 119 +++++++++++++++++++--------------------------
1 file changed, 51 insertions(+), 68 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2989b57e154a..0e5dbeeb5778 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -250,7 +250,6 @@ struct pool_workqueue {
* is marked with WORK_STRUCT_INACTIVE iff it is in pwq->inactive_works.
*/
int nr_active; /* L: nr of active works */
- int max_active; /* L: max active works */
struct list_head inactive_works; /* L: inactive works */
struct list_head pwqs_node; /* WR: node on wq->pwqs */
struct list_head mayday_node; /* MD: node on wq->maydays */
@@ -298,7 +297,8 @@ struct workqueue_struct {
struct worker *rescuer; /* MD: rescue worker */
int nr_drainers; /* WQ: drain in progress */
- int saved_max_active; /* WQ: saved pwq max_active */
+ int max_active; /* WQ: max active works */
+ int saved_max_active; /* WQ: saved max_active */
struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */
struct pool_workqueue *dfl_pwq; /* PW: only for unbound wqs */
@@ -1486,7 +1486,7 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, unsigned long work_
pwq->nr_active--;
if (!list_empty(&pwq->inactive_works)) {
/* one down, submit an inactive one */
- if (pwq->nr_active < pwq->max_active)
+ if (pwq->nr_active < pwq->wq->max_active)
pwq_activate_first_inactive(pwq);
}
}
@@ -1787,7 +1787,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
pwq->nr_in_flight[pwq->work_color]++;
work_flags = work_color_to_flags(pwq->work_color);
- if (likely(pwq->nr_active < pwq->max_active)) {
+ if (likely(pwq->nr_active < pwq->wq->max_active)) {
if (list_empty(&pool->worklist))
pool->watchdog_ts = jiffies;
@@ -4136,50 +4136,6 @@ static void pwq_release_workfn(struct kthread_work *work)
}
}
-/**
- * pwq_adjust_max_active - update a pwq's max_active to the current setting
- * @pwq: target pool_workqueue
- *
- * If @pwq isn't freezing, set @pwq->max_active to the associated
- * workqueue's saved_max_active and activate inactive work items
- * accordingly. If @pwq is freezing, clear @pwq->max_active to zero.
- */
-static void pwq_adjust_max_active(struct pool_workqueue *pwq)
-{
- struct workqueue_struct *wq = pwq->wq;
- bool freezable = wq->flags & WQ_FREEZABLE;
- unsigned long flags;
-
- /* for @wq->saved_max_active */
- lockdep_assert_held(&wq->mutex);
-
- /* fast exit for non-freezable wqs */
- if (!freezable && pwq->max_active == wq->saved_max_active)
- return;
-
- /* this function can be called during early boot w/ irq disabled */
- raw_spin_lock_irqsave(&pwq->pool->lock, flags);
-
- /*
- * During [un]freezing, the caller is responsible for ensuring that
- * this function is called at least once after @workqueue_freezing
- * is updated and visible.
- */
- if (!freezable || !workqueue_freezing) {
- pwq->max_active = wq->saved_max_active;
-
- while (!list_empty(&pwq->inactive_works) &&
- pwq->nr_active < pwq->max_active)
- pwq_activate_first_inactive(pwq);
-
- kick_pool(pwq->pool);
- } else {
- pwq->max_active = 0;
- }
-
- raw_spin_unlock_irqrestore(&pwq->pool->lock, flags);
-}
-
/* initialize newly allocated @pwq which is associated with @wq and @pool */
static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
struct worker_pool *pool)
@@ -4212,9 +4168,6 @@ static void link_pwq(struct pool_workqueue *pwq)
/* set the matching work_color */
pwq->work_color = wq->work_color;
- /* sync max_active to the current setting */
- pwq_adjust_max_active(pwq);
-
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
}
@@ -4665,6 +4618,46 @@ static int init_rescuer(struct workqueue_struct *wq)
return 0;
}
+/**
+ * wq_adjust_max_active - update a wq's max_active to the current setting
+ * @wq: target workqueue
+ *
+ * If @wq isn't freezing, set @wq->max_active to the saved_max_active and
+ * activate inactive work items accordingly. If @wq is freezing, clear
+ * @wq->max_active to zero.
+ */
+static void wq_adjust_max_active(struct workqueue_struct *wq)
+{
+ struct pool_workqueue *pwq;
+
+ lockdep_assert_held(&wq->mutex);
+
+ if ((wq->flags & WQ_FREEZABLE) && workqueue_freezing) {
+ wq->max_active = 0;
+ return;
+ }
+
+ if (wq->max_active == wq->saved_max_active)
+ return;
+
+ wq->max_active = wq->saved_max_active;
+
+ for_each_pwq(pwq, wq) {
+ unsigned long flags;
+
+ /* this function can be called during early boot w/ irq disabled */
+ raw_spin_lock_irqsave(&pwq->pool->lock, flags);
+
+ while (!list_empty(&pwq->inactive_works) &&
+ pwq->nr_active < wq->max_active)
+ pwq_activate_first_inactive(pwq);
+
+ kick_pool(pwq->pool);
+
+ raw_spin_unlock_irqrestore(&pwq->pool->lock, flags);
+ }
+}
+
__printf(1, 4)
struct workqueue_struct *alloc_workqueue(const char *fmt,
unsigned int flags,
@@ -4672,7 +4665,6 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
{
va_list args;
struct workqueue_struct *wq;
- struct pool_workqueue *pwq;
/*
* Unbound && max_active == 1 used to imply ordered, which is no longer
@@ -4707,6 +4699,7 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
/* init wq */
wq->flags = flags;
+ wq->max_active = max_active;
wq->saved_max_active = max_active;
mutex_init(&wq->mutex);
atomic_set(&wq->nr_pwqs_to_flush, 0);
@@ -4735,8 +4728,7 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
mutex_lock(&wq_pool_mutex);
mutex_lock(&wq->mutex);
- for_each_pwq(pwq, wq)
- pwq_adjust_max_active(pwq);
+ wq_adjust_max_active(wq);
mutex_unlock(&wq->mutex);
list_add_tail_rcu(&wq->list, &workqueues);
@@ -4874,8 +4866,6 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
*/
void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
{
- struct pool_workqueue *pwq;
-
/* disallow meddling with max_active for ordered workqueues */
if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
return;
@@ -4886,10 +4876,7 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
wq->flags &= ~__WQ_ORDERED;
wq->saved_max_active = max_active;
-
- for_each_pwq(pwq, wq)
- pwq_adjust_max_active(pwq);
-
+ wq_adjust_max_active(wq);
mutex_unlock(&wq->mutex);
}
EXPORT_SYMBOL_GPL(workqueue_set_max_active);
@@ -5135,8 +5122,8 @@ static void show_pwq(struct pool_workqueue *pwq)
pr_info(" pwq %d:", pool->id);
pr_cont_pool_info(pool);
- pr_cont(" active=%d/%d refcnt=%d%s\n",
- pwq->nr_active, pwq->max_active, pwq->refcnt,
+ pr_cont(" active=%d refcnt=%d%s\n",
+ pwq->nr_active, pwq->refcnt,
!list_empty(&pwq->mayday_node) ? " MAYDAY" : "");
hash_for_each(pool->busy_hash, bkt, worker, hentry) {
@@ -5684,7 +5671,6 @@ EXPORT_SYMBOL_GPL(work_on_cpu_safe_key);
void freeze_workqueues_begin(void)
{
struct workqueue_struct *wq;
- struct pool_workqueue *pwq;
mutex_lock(&wq_pool_mutex);
@@ -5693,8 +5679,7 @@ void freeze_workqueues_begin(void)
list_for_each_entry(wq, &workqueues, list) {
mutex_lock(&wq->mutex);
- for_each_pwq(pwq, wq)
- pwq_adjust_max_active(pwq);
+ wq_adjust_max_active(wq);
mutex_unlock(&wq->mutex);
}
@@ -5759,7 +5744,6 @@ bool freeze_workqueues_busy(void)
void thaw_workqueues(void)
{
struct workqueue_struct *wq;
- struct pool_workqueue *pwq;
mutex_lock(&wq_pool_mutex);
@@ -5771,8 +5755,7 @@ void thaw_workqueues(void)
/* restore max_active and repopulate worklist */
list_for_each_entry(wq, &workqueues, list) {
mutex_lock(&wq->mutex);
- for_each_pwq(pwq, wq)
- pwq_adjust_max_active(pwq);
+ wq_adjust_max_active(wq);
mutex_unlock(&wq->mutex);
}
--
2.43.0
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0e5dbeeb5778..c6c90af89d3d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1447,6 +1447,11 @@ static void put_pwq_unlocked(struct pool_workqueue *pwq)
}
}
+static bool pwq_is_empty(struct pool_workqueue *pwq)
+{
+ return !pwq->nr_active && list_empty(&pwq->inactive_works);
+}
+
static void pwq_activate_inactive_work(struct work_struct *work)
{
struct pool_workqueue *pwq = get_work_pwq(work);
@@ -3310,7 +3315,7 @@ void drain_workqueue(struct workqueue_struct *wq)
bool drained;
raw_spin_lock_irq(&pwq->pool->lock);
- drained = !pwq->nr_active && list_empty(&pwq->inactive_works);
+ drained = pwq_is_empty(pwq);
raw_spin_unlock_irq(&pwq->pool->lock);
if (drained)
@@ -4760,7 +4765,7 @@ static bool pwq_busy(struct pool_workqueue *pwq)
if ((pwq != pwq->wq->dfl_pwq) && (pwq->refcnt > 1))
return true;
- if (pwq->nr_active || !list_empty(&pwq->inactive_works))
+ if (!pwq_is_empty(pwq))
return true;
return false;
@@ -5197,7 +5202,7 @@ void show_one_workqueue(struct workqueue_struct *wq)
unsigned long flags;
for_each_pwq(pwq, wq) {
- if (pwq->nr_active || !list_empty(&pwq->inactive_works)) {
+ if (!pwq_is_empty(pwq)) {
idle = false;
break;
}
@@ -5209,7 +5214,7 @@ void show_one_workqueue(struct workqueue_struct *wq)
for_each_pwq(pwq, wq) {
raw_spin_lock_irqsave(&pwq->pool->lock, flags);
- if (pwq->nr_active || !list_empty(&pwq->inactive_works)) {
+ if (!pwq_is_empty(pwq)) {
/*
* Defer printing to avoid deadlocks in console
* drivers that queue work while holding locks
--
2.43.0
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.
pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c6c90af89d3d..adec03e63d50 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1452,16 +1452,36 @@ static bool pwq_is_empty(struct pool_workqueue *pwq)
return !pwq->nr_active && list_empty(&pwq->inactive_works);
}
-static void pwq_activate_inactive_work(struct work_struct *work)
+static void __pwq_activate_work(struct pool_workqueue *pwq,
+ struct work_struct *work)
{
- struct pool_workqueue *pwq = get_work_pwq(work);
-
trace_workqueue_activate_work(work);
if (list_empty(&pwq->pool->worklist))
pwq->pool->watchdog_ts = jiffies;
move_linked_works(work, &pwq->pool->worklist, NULL);
__clear_bit(WORK_STRUCT_INACTIVE_BIT, work_data_bits(work));
+}
+
+/**
+ * pwq_activate_work - Activate a work item if inactive
+ * @pwq: pool_workqueue @work belongs to
+ * @work: work item to activate
+ *
+ * Returns %true if activated. %false if already active.
+ */
+static bool pwq_activate_work(struct pool_workqueue *pwq,
+ struct work_struct *work)
+{
+ struct worker_pool *pool = pwq->pool;
+
+ lockdep_assert_held(&pool->lock);
+
+ if (!(*work_data_bits(work) & WORK_STRUCT_INACTIVE))
+ return false;
+
pwq->nr_active++;
+ __pwq_activate_work(pwq, work);
+ return true;
}
static void pwq_activate_first_inactive(struct pool_workqueue *pwq)
@@ -1469,7 +1489,7 @@ static void pwq_activate_first_inactive(struct pool_workqueue *pwq)
struct work_struct *work = list_first_entry(&pwq->inactive_works,
struct work_struct, entry);
- pwq_activate_inactive_work(work);
+ pwq_activate_work(pwq, work);
}
/**
@@ -1607,8 +1627,7 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
* management later on and cause stall. Make sure the work
* item is activated before grabbing.
*/
- if (*work_data_bits(work) & WORK_STRUCT_INACTIVE)
- pwq_activate_inactive_work(work);
+ pwq_activate_work(pwq, work);
list_del_init(&work->entry);
pwq_dec_nr_in_flight(pwq, *work_data_bits(work));
--
2.43.0
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.
- pwq_tryinc_nr_active() is added. It increments nr_active if under
max_active limit and return a boolean indicating whether inc was
successful. Note that the function is structured to accommodate future
changes. __queue_work() is updated to use the new helper.
- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
thus no longer assumes that nr_active is under max_active and returns a
boolean to indicate whether a work item has been activated.
- wq_adjust_max_active() no longer tests directly whether a work item can be
activated. Instead, it's updated to use the return value of
pwq_activate_first_inactive() to tell whether a work item has been
activated.
- nr_active decrement and activating the first inactive work item is
factored into pwq_dec_nr_active().
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 80 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 63 insertions(+), 17 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index adec03e63d50..21bd4e4351f7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1484,12 +1484,66 @@ static bool pwq_activate_work(struct pool_workqueue *pwq,
return true;
}
-static void pwq_activate_first_inactive(struct pool_workqueue *pwq)
+/**
+ * pwq_tryinc_nr_active - Try to increment nr_active for a pwq
+ * @pwq: pool_workqueue of interest
+ *
+ * Try to increment nr_active for @pwq. Returns %true if an nr_active count is
+ * successfully obtained. %false otherwise.
+ */
+static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
{
- struct work_struct *work = list_first_entry(&pwq->inactive_works,
- struct work_struct, entry);
+ struct workqueue_struct *wq = pwq->wq;
+ struct worker_pool *pool = pwq->pool;
+ bool obtained;
- pwq_activate_work(pwq, work);
+ lockdep_assert_held(&pool->lock);
+
+ obtained = pwq->nr_active < wq->max_active;
+
+ if (obtained)
+ pwq->nr_active++;
+ return obtained;
+}
+
+/**
+ * pwq_activate_first_inactive - Activate the first inactive work item on a pwq
+ * @pwq: pool_workqueue of interest
+ *
+ * Activate the first inactive work item of @pwq if available and allowed by
+ * max_active limit.
+ *
+ * Returns %true if an inactive work item has been activated. %false if no
+ * inactive work item is found or max_active limit is reached.
+ */
+static bool pwq_activate_first_inactive(struct pool_workqueue *pwq)
+{
+ struct work_struct *work =
+ list_first_entry_or_null(&pwq->inactive_works,
+ struct work_struct, entry);
+
+ if (work && pwq_tryinc_nr_active(pwq)) {
+ __pwq_activate_work(pwq, work);
+ return true;
+ } else {
+ return false;
+ }
+}
+
+/**
+ * pwq_dec_nr_active - Retire an active count
+ * @pwq: pool_workqueue of interest
+ *
+ * Decrement @pwq's nr_active and try to activate the first inactive work item.
+ */
+static void pwq_dec_nr_active(struct pool_workqueue *pwq)
+{
+ struct worker_pool *pool = pwq->pool;
+
+ lockdep_assert_held(&pool->lock);
+
+ pwq->nr_active--;
+ pwq_activate_first_inactive(pwq);
}
/**
@@ -1507,14 +1561,8 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, unsigned long work_
{
int color = get_work_color(work_data);
- if (!(work_data & WORK_STRUCT_INACTIVE)) {
- pwq->nr_active--;
- if (!list_empty(&pwq->inactive_works)) {
- /* one down, submit an inactive one */
- if (pwq->nr_active < pwq->wq->max_active)
- pwq_activate_first_inactive(pwq);
- }
- }
+ if (!(work_data & WORK_STRUCT_INACTIVE))
+ pwq_dec_nr_active(pwq);
pwq->nr_in_flight[color]--;
@@ -1811,12 +1859,11 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
pwq->nr_in_flight[pwq->work_color]++;
work_flags = work_color_to_flags(pwq->work_color);
- if (likely(pwq->nr_active < pwq->wq->max_active)) {
+ if (pwq_tryinc_nr_active(pwq)) {
if (list_empty(&pool->worklist))
pool->watchdog_ts = jiffies;
trace_workqueue_activate_work(work);
- pwq->nr_active++;
insert_work(pwq, work, &pool->worklist, work_flags);
kick_pool(pool);
} else {
@@ -4672,9 +4719,8 @@ static void wq_adjust_max_active(struct workqueue_struct *wq)
/* this function can be called during early boot w/ irq disabled */
raw_spin_lock_irqsave(&pwq->pool->lock, flags);
- while (!list_empty(&pwq->inactive_works) &&
- pwq->nr_active < wq->max_active)
- pwq_activate_first_inactive(pwq);
+ while (pwq_activate_first_inactive(pwq))
+ ;
kick_pool(pwq->pool);
--
2.43.0
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.
In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 21bd4e4351f7..143c0b31505e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4699,7 +4699,7 @@ static int init_rescuer(struct workqueue_struct *wq)
*/
static void wq_adjust_max_active(struct workqueue_struct *wq)
{
- struct pool_workqueue *pwq;
+ bool activated;
lockdep_assert_held(&wq->mutex);
@@ -4713,19 +4713,26 @@ static void wq_adjust_max_active(struct workqueue_struct *wq)
wq->max_active = wq->saved_max_active;
- for_each_pwq(pwq, wq) {
- unsigned long flags;
-
- /* this function can be called during early boot w/ irq disabled */
- raw_spin_lock_irqsave(&pwq->pool->lock, flags);
-
- while (pwq_activate_first_inactive(pwq))
- ;
+ /*
+ * Round-robin through pwq's activating the first inactive work item
+ * until max_active is filled.
+ */
+ do {
+ struct pool_workqueue *pwq;
- kick_pool(pwq->pool);
+ activated = false;
+ for_each_pwq(pwq, wq) {
+ unsigned long flags;
- raw_spin_unlock_irqrestore(&pwq->pool->lock, flags);
- }
+ /* can be called during early boot w/ irq disabled */
+ raw_spin_lock_irqsave(&pwq->pool->lock, flags);
+ if (pwq_activate_first_inactive(pwq)) {
+ activated = true;
+ kick_pool(pwq->pool);
+ }
+ raw_spin_unlock_irqrestore(&pwq->pool->lock, flags);
+ }
+ } while (activated);
}
__printf(1, 4)
--
2.43.0
To track the first possible NUMA node and the number of online CPUs in each
NUMA node, respectively. These will be used to implement system-wide
nr_active for unbound workqueues.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 143c0b31505e..a9f63cd61454 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -53,6 +53,7 @@
#include <linux/nmi.h>
#include <linux/kvm_para.h>
#include <linux/delay.h>
+#include <linux/topology.h>
#include "workqueue_internal.h"
@@ -338,6 +339,8 @@ struct wq_pod_type {
int *cpu_pod; /* cpu -> pod */
};
+static int first_possible_node __read_mostly;
+static int node_nr_cpus[MAX_NUMNODES] __read_mostly;
static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE;
@@ -5610,6 +5613,8 @@ int workqueue_online_cpu(unsigned int cpu)
struct workqueue_struct *wq;
int pi;
+ node_nr_cpus[cpu_to_node(cpu)]++;
+
mutex_lock(&wq_pool_mutex);
for_each_pool(pool, pi) {
@@ -5665,6 +5670,8 @@ int workqueue_offline_cpu(unsigned int cpu)
}
mutex_unlock(&wq_pool_mutex);
+ node_nr_cpus[cpu_to_node(cpu)]--;
+
return 0;
}
@@ -6601,6 +6608,9 @@ void __init workqueue_init_early(void)
BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
+ first_possible_node = first_node(node_states[N_POSSIBLE]);
+ node_nr_cpus[cpu_to_node(0)]++;
+
BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
cpumask_copy(wq_unbound_cpumask, cpu_possible_mask);
restrict_unbound_cpumask("HK_TYPE_WQ", housekeeping_cpumask(HK_TYPE_WQ));
--
2.43.0
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.
This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().
This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a9f63cd61454..9982c63470e5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1557,6 +1557,11 @@ static void pwq_dec_nr_active(struct pool_workqueue *pwq)
* A work either has completed or is removed from pending queue,
* decrement nr_in_flight of its pwq and handle workqueue flushing.
*
+ * NOTE:
+ * For unbound workqueues, this function may temporarily drop @pwq->pool->lock
+ * and thus should be called after all other state updates for the in-flight
+ * work item is complete.
+ *
* CONTEXT:
* raw_spin_lock_irq(pool->lock).
*/
@@ -1681,11 +1686,13 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
pwq_activate_work(pwq, work);
list_del_init(&work->entry);
- pwq_dec_nr_in_flight(pwq, *work_data_bits(work));
/* work->data points to pwq iff queued, point to pool */
set_work_pool_and_keep_pending(work, pool->id);
+ /* must be the last step, see the function comment */
+ pwq_dec_nr_in_flight(pwq, *work_data_bits(work));
+
raw_spin_unlock(&pool->lock);
rcu_read_unlock();
return 1;
@@ -2745,6 +2752,8 @@ __acquires(&pool->lock)
worker->current_func = NULL;
worker->current_pwq = NULL;
worker->current_color = INT_MAX;
+
+ /* must be the last step, see the function comment */
pwq_dec_nr_in_flight(pwq, work_data);
}
--
2.43.0
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.
As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.
Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.
This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 132 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 129 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9982c63470e5..6aa6f2eee94e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -278,6 +278,16 @@ struct wq_flusher {
struct wq_device;
+/*
+ * Unlike in a per-cpu workqueue where max_active limits its concurrency level
+ * on each CPU, in an unbound workqueue, max_active applies to the whole system.
+ * As sharing a single nr_active across multiple sockets can be very expensive,
+ * the counting and enforcement is per NUMA node.
+ */
+struct wq_node_nr_active {
+ atomic_t count; /* per-node nr_active count */
+};
+
/*
* The externally visible workqueue. It relays the issued work items to
* the appropriate worker_pool through its pool_workqueues.
@@ -324,6 +334,7 @@ struct workqueue_struct {
/* hot fields used during command issue, aligned to cacheline */
unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */
+ struct wq_node_nr_active **node_nr_active; /* I: per-node nr_active */
};
static struct kmem_cache *pwq_cache;
@@ -1398,6 +1409,31 @@ work_func_t wq_worker_last_func(struct task_struct *task)
return worker->last_func;
}
+/**
+ * wq_node_nr_active - Determine wq_node_nr_active to use
+ * @wq: workqueue of interest
+ * @node: NUMA node, can be %NUMA_NO_NODE
+ *
+ * Determine wq_node_nr_active to use for @wq on @node. Returns:
+ *
+ * - %NULL for per-cpu workqueues as they don't need to use shared nr_active.
+ *
+ * - node_nr_active[nr_node_ids] if @node is %NUMA_NO_NODE.
+ *
+ * - Otherwise, node_nr_active[@node].
+ */
+static struct wq_node_nr_active *wq_node_nr_active(struct workqueue_struct *wq,
+ int node)
+{
+ if (!(wq->flags & WQ_UNBOUND))
+ return NULL;
+
+ if (node == NUMA_NO_NODE)
+ node = nr_node_ids;
+
+ return wq->node_nr_active[node];
+}
+
/**
* get_pwq - get an extra reference on the specified pool_workqueue
* @pwq: pool_workqueue to get
@@ -1476,12 +1512,17 @@ static bool pwq_activate_work(struct pool_workqueue *pwq,
struct work_struct *work)
{
struct worker_pool *pool = pwq->pool;
+ struct wq_node_nr_active *nna;
lockdep_assert_held(&pool->lock);
if (!(*work_data_bits(work) & WORK_STRUCT_INACTIVE))
return false;
+ nna = wq_node_nr_active(pwq->wq, pool->node);
+ if (nna)
+ atomic_inc(&nna->count);
+
pwq->nr_active++;
__pwq_activate_work(pwq, work);
return true;
@@ -1498,12 +1539,21 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
{
struct workqueue_struct *wq = pwq->wq;
struct worker_pool *pool = pwq->pool;
- bool obtained;
+ struct wq_node_nr_active *nna = wq_node_nr_active(wq, pool->node);
+ bool obtained = false;
lockdep_assert_held(&pool->lock);
- obtained = pwq->nr_active < wq->max_active;
+ if (!nna) {
+ /* per-cpu workqueue, pwq->nr_active is sufficient */
+ obtained = pwq->nr_active < wq->max_active;
+ goto out;
+ }
+
+ atomic_inc(&nna->count);
+ obtained = true;
+out:
if (obtained)
pwq->nr_active++;
return obtained;
@@ -1542,10 +1592,26 @@ static bool pwq_activate_first_inactive(struct pool_workqueue *pwq)
static void pwq_dec_nr_active(struct pool_workqueue *pwq)
{
struct worker_pool *pool = pwq->pool;
+ struct wq_node_nr_active *nna = wq_node_nr_active(pwq->wq, pool->node);
lockdep_assert_held(&pool->lock);
+ /*
+ * @pwq->nr_active should be decremented for both percpu and unbound
+ * workqueues.
+ */
pwq->nr_active--;
+
+ /*
+ * For a percpu workqueue, it's simple. Just need to kick the first
+ * inactive work item on @pwq itself.
+ */
+ if (!nna) {
+ pwq_activate_first_inactive(pwq);
+ return;
+ }
+
+ atomic_dec(&nna->count);
pwq_activate_first_inactive(pwq);
}
@@ -4004,12 +4070,64 @@ static void wq_free_lockdep(struct workqueue_struct *wq)
}
#endif
+static void free_node_nr_active(struct wq_node_nr_active **ptr_ar)
+{
+ int i;
+
+ if (!ptr_ar)
+ return;
+ for (i = 0; i < nr_node_ids + 1; i++)
+ kfree(ptr_ar[i]);
+ kfree(ptr_ar);
+}
+
+static void init_node_nr_active(struct wq_node_nr_active *nna)
+{
+ atomic_set(&nna->count, 0);
+}
+
+/*
+ * Each node's nr_active counter will be accessed mostly from its own node and
+ * should be allocated in the node.
+ */
+static struct wq_node_nr_active **alloc_node_nr_active(void)
+{
+ struct wq_node_nr_active **nna_ar, *nna;
+ int node;
+
+ nna_ar = kcalloc(nr_node_ids + 1, sizeof(*nna_ar), GFP_KERNEL);
+ if (!nna_ar)
+ return NULL;
+
+ for_each_node(node) {
+ nna = kzalloc_node(sizeof(*nna), GFP_KERNEL, node);
+ if (!nna)
+ goto err_free;
+ init_node_nr_active(nna);
+ nna_ar[node] = nna;
+ }
+
+ /* [nr_node_ids] is used as the fallback */
+ nna = kzalloc_node(sizeof(*nna), GFP_KERNEL, NUMA_NO_NODE);
+ if (!nna)
+ goto err_free;
+ init_node_nr_active(nna);
+ nna_ar[nr_node_ids] = nna;
+
+ return nna_ar;
+
+err_free:
+ free_node_nr_active(nna_ar);
+ return NULL;
+}
+
static void rcu_free_wq(struct rcu_head *rcu)
{
struct workqueue_struct *wq =
container_of(rcu, struct workqueue_struct, rcu);
wq_free_lockdep(wq);
+ free_node_nr_active(wq->node_nr_active);
free_percpu(wq->cpu_pwq);
free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
@@ -4800,8 +4918,14 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
wq_init_lockdep(wq);
INIT_LIST_HEAD(&wq->list);
+ if (flags & WQ_UNBOUND) {
+ wq->node_nr_active = alloc_node_nr_active();
+ if (!wq->node_nr_active)
+ goto err_unreg_lockdep;
+ }
+
if (alloc_and_link_pwqs(wq) < 0)
- goto err_unreg_lockdep;
+ goto err_free_node_nr_active;
if (wq_online && init_rescuer(wq) < 0)
goto err_destroy;
@@ -4826,6 +4950,8 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
return wq;
+err_free_node_nr_active:
+ free_node_nr_active(wq->node_nr_active);
err_unreg_lockdep:
wq_unregister_lockdep(wq);
wq_free_lockdep(wq);
--
2.43.0
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of online
CPUs per node. e.g. A node with twice more online CPUs will get twice
higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Naohiro Aota <[email protected]>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
---
include/linux/workqueue.h | 35 ++++-
kernel/workqueue.c | 268 ++++++++++++++++++++++++++++++++++----
2 files changed, 278 insertions(+), 25 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 24b1e5070f4d..ad97453e7c3a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -405,6 +405,13 @@ enum {
WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE,
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
+
+ /*
+ * Per-node default cap on min_active. Unless explicitly set, min_active
+ * is set to min(max_active, WQ_DFL_MIN_ACTIVE). For more details, see
+ * workqueue_struct->min_active definition.
+ */
+ WQ_DFL_MIN_ACTIVE = 8,
};
/*
@@ -447,11 +454,33 @@ extern struct workqueue_struct *system_freezable_power_efficient_wq;
* alloc_workqueue - allocate a workqueue
* @fmt: printf format for the name of the workqueue
* @flags: WQ_* flags
- * @max_active: max in-flight work items per CPU, 0 for default
+ * @max_active: max in-flight work items, 0 for default
* remaining args: args for @fmt
*
- * Allocate a workqueue with the specified parameters. For detailed
- * information on WQ_* flags, please refer to
+ * For a per-cpu workqueue, @max_active limits the number of in-flight work
+ * items for each CPU. e.g. @max_active of 1 indicates that each CPU can be
+ * executing at most one work item for the workqueue.
+ *
+ * For unbound workqueues, @max_active limits the number of in-flight work items
+ * for the whole system. e.g. @max_active of 16 indicates that that there can be
+ * at most 16 work items executing for the workqueue in the whole system.
+ *
+ * As sharing the same active counter for an unbound workqueue across multiple
+ * NUMA nodes can be expensive, @max_active is distributed to each NUMA node
+ * according to the proportion of the number of online CPUs and enforced
+ * independently.
+ *
+ * Depending on online CPU distribution, a node may end up with per-node
+ * max_active which is significantly lower than @max_active, which can lead to
+ * deadlocks if the per-node concurrency limit is lower than the maximum number
+ * of interdependent work items for the workqueue.
+ *
+ * To guarantee forward progress regardless of online CPU distribution, the
+ * concurrency limit on every node is guaranteed to be equal to or greater than
+ * min_active which is set to min(@max_active, %WQ_DFL_MIN_ACTIVE). This means
+ * that the sum of per-node max_active's may be larger than @max_active.
+ *
+ * For detailed information on %WQ_* flags, please refer to
* Documentation/core-api/workqueue.rst.
*
* RETURNS:
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6aa6f2eee94e..0017e9094034 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -123,6 +123,9 @@ enum {
*
* L: pool->lock protected. Access with pool->lock held.
*
+ * LN: pool->lock and wq_node_nr_active->lock protected for writes. Either for
+ * reads.
+ *
* K: Only modified by worker while holding pool->lock. Can be safely read by
* self, while holding pool->lock or from IRQ context if %current is the
* kworker.
@@ -241,17 +244,18 @@ struct pool_workqueue {
* pwq->inactive_works instead of pool->worklist and marked with
* WORK_STRUCT_INACTIVE.
*
- * All work items marked with WORK_STRUCT_INACTIVE do not participate
- * in pwq->nr_active and all work items in pwq->inactive_works are
- * marked with WORK_STRUCT_INACTIVE. But not all WORK_STRUCT_INACTIVE
- * work items are in pwq->inactive_works. Some of them are ready to
- * run in pool->worklist or worker->scheduled. Those work itmes are
- * only struct wq_barrier which is used for flush_work() and should
- * not participate in pwq->nr_active. For non-barrier work item, it
- * is marked with WORK_STRUCT_INACTIVE iff it is in pwq->inactive_works.
+ * All work items marked with WORK_STRUCT_INACTIVE do not participate in
+ * nr_active and all work items in pwq->inactive_works are marked with
+ * WORK_STRUCT_INACTIVE. But not all WORK_STRUCT_INACTIVE work items are
+ * in pwq->inactive_works. Some of them are ready to run in
+ * pool->worklist or worker->scheduled. Those work itmes are only struct
+ * wq_barrier which is used for flush_work() and should not participate
+ * in nr_active. For non-barrier work item, it is marked with
+ * WORK_STRUCT_INACTIVE iff it is in pwq->inactive_works.
*/
int nr_active; /* L: nr of active works */
struct list_head inactive_works; /* L: inactive works */
+ struct list_head pending_node; /* LN: node on wq_node_nr_active->pending_pwqs */
struct list_head pwqs_node; /* WR: node on wq->pwqs */
struct list_head mayday_node; /* MD: node on wq->maydays */
@@ -283,9 +287,18 @@ struct wq_device;
* on each CPU, in an unbound workqueue, max_active applies to the whole system.
* As sharing a single nr_active across multiple sockets can be very expensive,
* the counting and enforcement is per NUMA node.
+ *
+ * The following struct is used to enforce per-node max_active. When a pwq wants
+ * to start executing a work item, it should increment ->count using
+ * tryinc_node_nr_active(). If acquisition fails due to the count already being
+ * over wq_node_max_active(), the pwq is queued on ->pending_pwqs. As in-flight
+ * work items finish and decrement ->count, node_activate_pending_pwq()
+ * activates the pending pwqs in round-robin order.
*/
struct wq_node_nr_active {
atomic_t count; /* per-node nr_active count */
+ raw_spinlock_t lock; /* nests inside pool locks */
+ struct list_head pending_pwqs; /* LN: pwqs with inactive works */
};
/*
@@ -308,8 +321,12 @@ struct workqueue_struct {
struct worker *rescuer; /* MD: rescue worker */
int nr_drainers; /* WQ: drain in progress */
+
+ /* See alloc_workqueue() function comment for info on min/max_active */
int max_active; /* WQ: max active works */
+ int min_active; /* WQ: min active works */
int saved_max_active; /* WQ: saved max_active */
+ int saved_min_active; /* WQ: saved min_active */
struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */
struct pool_workqueue *dfl_pwq; /* PW: only for unbound wqs */
@@ -1434,6 +1451,31 @@ static struct wq_node_nr_active *wq_node_nr_active(struct workqueue_struct *wq,
return wq->node_nr_active[node];
}
+/**
+ * wq_node_max_active - Determine max_active to use
+ * @wq: workqueue of interest
+ * @node: NUMA node, can be %NUMA_NO_NODE
+ *
+ * Determine the max_active @wq should use on @node. @wq must be unbound.
+ * max_active is distributed among nodes according to the proportions of numbers
+ * of online cpus. The result is always between @wq->min_active and max_active.
+ */
+static int wq_node_max_active(struct workqueue_struct *wq, int node)
+{
+ int node_max_active;
+
+ if (node == NUMA_NO_NODE)
+ return wq->min_active;
+
+ node_max_active = DIV_ROUND_UP(wq->max_active * node_nr_cpus[node],
+ num_online_cpus());
+ /*
+ * We aren't synchronizing against CPU hotplug operations and can get
+ * stray numbers. Clamp between min and max.
+ */
+ return clamp(node_max_active, wq->min_active, wq->max_active);
+}
+
/**
* get_pwq - get an extra reference on the specified pool_workqueue
* @pwq: pool_workqueue to get
@@ -1528,19 +1570,35 @@ static bool pwq_activate_work(struct pool_workqueue *pwq,
return true;
}
+static bool tryinc_node_nr_active(struct wq_node_nr_active *nna, int max)
+{
+ while (true) {
+ int old, tmp;
+
+ old = atomic_read(&nna->count);
+ if (old >= max)
+ return false;
+ tmp = atomic_cmpxchg_relaxed(&nna->count, old, old + 1);
+ if (tmp == old)
+ return true;
+ }
+}
+
/**
* pwq_tryinc_nr_active - Try to increment nr_active for a pwq
* @pwq: pool_workqueue of interest
+ * @fill: max_active may have increased, try to increase concurrency level
*
* Try to increment nr_active for @pwq. Returns %true if an nr_active count is
* successfully obtained. %false otherwise.
*/
-static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
+static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
{
struct workqueue_struct *wq = pwq->wq;
struct worker_pool *pool = pwq->pool;
struct wq_node_nr_active *nna = wq_node_nr_active(wq, pool->node);
bool obtained = false;
+ int node_max_active;
lockdep_assert_held(&pool->lock);
@@ -1550,9 +1608,51 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
goto out;
}
- atomic_inc(&nna->count);
- obtained = true;
+ /*
+ * Unbound workqueue uses per-node shared nr_active $nna. If @pwq is
+ * already waiting on $nna, pwq_dec_nr_active() will maintain the
+ * concurrency level. Don't jump the line.
+ *
+ * We need to ignore the pending test after max_active has increased as
+ * pwq_dec_nr_active() can only maintain the concurrency level but not
+ * increase it. This is indicated by @fill.
+ */
+ if (!list_empty(&pwq->pending_node) && likely(!fill))
+ goto out;
+
+ node_max_active = wq_node_max_active(wq, pool->node);
+
+ obtained = tryinc_node_nr_active(nna, node_max_active);
+ if (obtained)
+ goto out;
+
+ /*
+ * Lockless acquisition failed. Lock, add ourself to $nna->pending_pwqs
+ * and try again. The smp_mb() is paired with the implied memory barrier
+ * of atomic_dec_return() in pwq_dec_nr_active() to ensure that either
+ * we see the decremented $nna->count or they see non-empty
+ * $nna->pending_pwqs.
+ */
+ raw_spin_lock(&nna->lock);
+
+ if (list_empty(&pwq->pending_node))
+ list_add_tail(&pwq->pending_node, &nna->pending_pwqs);
+ else if (likely(!fill))
+ goto out_unlock;
+
+ smp_mb();
+
+ obtained = tryinc_node_nr_active(nna, node_max_active);
+ /*
+ * If @fill, @pwq might have already been pending. Being spuriously
+ * pending in cold paths doesn't affect anything. Let's leave it be.
+ */
+ if (obtained && likely(!fill))
+ list_del_init(&pwq->pending_node);
+
+out_unlock:
+ raw_spin_unlock(&nna->lock);
out:
if (obtained)
pwq->nr_active++;
@@ -1562,6 +1662,7 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
/**
* pwq_activate_first_inactive - Activate the first inactive work item on a pwq
* @pwq: pool_workqueue of interest
+ * @fill: max_active may have increased, try to increase concurrency level
*
* Activate the first inactive work item of @pwq if available and allowed by
* max_active limit.
@@ -1569,13 +1670,13 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
* Returns %true if an inactive work item has been activated. %false if no
* inactive work item is found or max_active limit is reached.
*/
-static bool pwq_activate_first_inactive(struct pool_workqueue *pwq)
+static bool pwq_activate_first_inactive(struct pool_workqueue *pwq, bool fill)
{
struct work_struct *work =
list_first_entry_or_null(&pwq->inactive_works,
struct work_struct, entry);
- if (work && pwq_tryinc_nr_active(pwq)) {
+ if (work && pwq_tryinc_nr_active(pwq, fill)) {
__pwq_activate_work(pwq, work);
return true;
} else {
@@ -1583,11 +1684,95 @@ static bool pwq_activate_first_inactive(struct pool_workqueue *pwq)
}
}
+/**
+ * node_activate_pending_pwq - Activate a pending pwq on a wq_node_nr_active
+ * @nna: wq_node_nr_active to activate a pending pwq for
+ * @caller_pool: worker_pool the caller is locking
+ *
+ * Activate a pwq in @nna->pending_pwqs. Called with @caller_pool locked.
+ * @caller_pool may be unlocked and relocked to lock other worker_pools.
+ */
+static void node_activate_pending_pwq(struct wq_node_nr_active *nna,
+ struct worker_pool *caller_pool)
+{
+ struct worker_pool *locked_pool = caller_pool;
+ struct pool_workqueue *pwq;
+ struct work_struct *work;
+ int node_max_active;
+
+ lockdep_assert_held(&caller_pool->lock);
+
+ raw_spin_lock(&nna->lock);
+retry:
+ pwq = list_first_entry_or_null(&nna->pending_pwqs,
+ struct pool_workqueue, pending_node);
+ if (!pwq)
+ goto out_unlock;
+
+ /*
+ * If @pwq is for a different pool than @locked_pool, we need to lock
+ * @pwq->pool->lock. Let's trylock first. If unsuccessful, do the unlock
+ * / lock dance. For that, we also need to release @nna->lock as it's
+ * nested inside pool locks.
+ */
+ if (pwq->pool != locked_pool) {
+ raw_spin_unlock(&locked_pool->lock);
+ locked_pool = pwq->pool;
+ if (!raw_spin_trylock(&locked_pool->lock)) {
+ raw_spin_unlock(&nna->lock);
+ raw_spin_lock(&locked_pool->lock);
+ raw_spin_lock(&nna->lock);
+ goto retry;
+ }
+ }
+
+ /*
+ * $pwq may not have any inactive work items due to e.g. cancellations.
+ * Drop it from pending_pwqs and see if there's another one.
+ */
+ work = list_first_entry_or_null(&pwq->inactive_works,
+ struct work_struct, entry);
+ if (!work) {
+ list_del_init(&pwq->pending_node);
+ goto retry;
+ }
+
+ /*
+ * Acquire an nr_active count and activate the inactive work item. If
+ * $pwq still has inactive work items, rotate it to the end of the
+ * pending_pwqs so that we round-robin through them. This means that
+ * inactive work items are not activated in queueing order which is fine
+ * given that there has never been any ordering across different pwqs.
+ */
+ node_max_active = wq_node_max_active(pwq->wq, pwq->pool->node);
+ if (likely(tryinc_node_nr_active(nna, node_max_active))) {
+ pwq->nr_active++;
+ __pwq_activate_work(pwq, work);
+
+ if (list_empty(&pwq->inactive_works))
+ list_del_init(&pwq->pending_node);
+ else
+ list_move_tail(&pwq->pending_node, &nna->pending_pwqs);
+
+ /* if activating a foreign pool, make sure it's running */
+ if (pwq->pool != caller_pool)
+ kick_pool(pwq->pool);
+ }
+
+out_unlock:
+ raw_spin_unlock(&nna->lock);
+ if (locked_pool != caller_pool) {
+ raw_spin_unlock(&locked_pool->lock);
+ raw_spin_lock(&caller_pool->lock);
+ }
+}
+
/**
* pwq_dec_nr_active - Retire an active count
* @pwq: pool_workqueue of interest
*
* Decrement @pwq's nr_active and try to activate the first inactive work item.
+ * For unbound workqueues, this function may temporarily drop @pwq->pool->lock.
*/
static void pwq_dec_nr_active(struct pool_workqueue *pwq)
{
@@ -1607,12 +1792,30 @@ static void pwq_dec_nr_active(struct pool_workqueue *pwq)
* inactive work item on @pwq itself.
*/
if (!nna) {
- pwq_activate_first_inactive(pwq);
+ pwq_activate_first_inactive(pwq, false);
return;
}
- atomic_dec(&nna->count);
- pwq_activate_first_inactive(pwq);
+ /*
+ * If @pwq is for an unbound workqueue, it's more complicated because
+ * multiple pwqs and pools may be sharing the nr_active count. When a
+ * pwq needs to wait for an nr_active count, it puts itself on
+ * $nna->pending_pwqs. The following atomic_dec_return()'s implied
+ * memory barrier is paired with smp_mb() in pwq_tryinc_nr_active() to
+ * guarantee that either we see non-empty pending_pwqs or they see
+ * decremented $nna->count.
+ *
+ * wq_node_max_active() may change as CPUs come online/offline and
+ * @pwq->wq's max_active gets updated. However, it is guaranteed to be
+ * euqal to or larger than @pwq->wq->min_active which is above zero
+ * unless freezing. This maintains the forward progress guarantee.
+ */
+ if (atomic_dec_return(&nna->count) >=
+ wq_node_max_active(pwq->wq, pool->node))
+ return;
+
+ if (!list_empty(&nna->pending_pwqs))
+ node_activate_pending_pwq(nna, pool);
}
/**
@@ -1935,7 +2138,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
pwq->nr_in_flight[pwq->work_color]++;
work_flags = work_color_to_flags(pwq->work_color);
- if (pwq_tryinc_nr_active(pwq)) {
+ if (pwq_tryinc_nr_active(pwq, false)) {
if (list_empty(&pool->worklist))
pool->watchdog_ts = jiffies;
@@ -3170,7 +3373,7 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
barr->task = current;
- /* The barrier work item does not participate in pwq->nr_active. */
+ /* The barrier work item does not participate in nr_active. */
work_flags |= WORK_STRUCT_INACTIVE;
/*
@@ -4084,6 +4287,8 @@ static void free_node_nr_active(struct wq_node_nr_active **ptr_ar)
static void init_node_nr_active(struct wq_node_nr_active *nna)
{
atomic_set(&nna->count, 0);
+ raw_spin_lock_init(&nna->lock);
+ INIT_LIST_HEAD(&nna->pending_pwqs);
}
/*
@@ -4325,6 +4530,15 @@ static void pwq_release_workfn(struct kthread_work *work)
mutex_unlock(&wq_pool_mutex);
}
+ if (!list_empty(&pwq->pending_node)) {
+ struct wq_node_nr_active *nna =
+ wq_node_nr_active(pwq->wq, pwq->pool->node);
+
+ raw_spin_lock_irq(&nna->lock);
+ list_del_init(&pwq->pending_node);
+ raw_spin_unlock_irq(&nna->lock);
+ }
+
call_rcu(&pwq->rcu, rcu_free_pwq);
/*
@@ -4350,6 +4564,7 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->inactive_works);
+ INIT_LIST_HEAD(&pwq->pending_node);
INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
kthread_init_work(&pwq->release_work, pwq_release_workfn);
@@ -4835,13 +5050,16 @@ static void wq_adjust_max_active(struct workqueue_struct *wq)
if ((wq->flags & WQ_FREEZABLE) && workqueue_freezing) {
wq->max_active = 0;
+ wq->min_active = 0;
return;
}
- if (wq->max_active == wq->saved_max_active)
+ if (wq->max_active == wq->saved_max_active &&
+ wq->min_active == wq->saved_min_active)
return;
wq->max_active = wq->saved_max_active;
+ wq->min_active = wq->saved_min_active;
/*
* Round-robin through pwq's activating the first inactive work item
@@ -4856,7 +5074,7 @@ static void wq_adjust_max_active(struct workqueue_struct *wq)
/* can be called during early boot w/ irq disabled */
raw_spin_lock_irqsave(&pwq->pool->lock, flags);
- if (pwq_activate_first_inactive(pwq)) {
+ if (pwq_activate_first_inactive(pwq, true)) {
activated = true;
kick_pool(pwq->pool);
}
@@ -4907,7 +5125,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
/* init wq */
wq->flags = flags;
wq->max_active = max_active;
- wq->saved_max_active = max_active;
+ wq->min_active = min(max_active, WQ_DFL_MIN_ACTIVE);
+ wq->saved_max_active = wq->max_active;
+ wq->saved_min_active = wq->min_active;
mutex_init(&wq->mutex);
atomic_set(&wq->nr_pwqs_to_flush, 0);
INIT_LIST_HEAD(&wq->pwqs);
@@ -5074,7 +5294,8 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
* @wq: target workqueue
* @max_active: new max_active value.
*
- * Set max_active of @wq to @max_active.
+ * Set max_active of @wq to @max_active. See the alloc_workqueue() function
+ * comment.
*
* CONTEXT:
* Don't call from IRQ context.
@@ -5091,6 +5312,9 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
wq->flags &= ~__WQ_ORDERED;
wq->saved_max_active = max_active;
+ if (wq->flags & WQ_UNBOUND)
+ wq->saved_min_active = min(wq->saved_min_active, max_active);
+
wq_adjust_max_active(wq);
mutex_unlock(&wq->mutex);
}
--
2.43.0
Because nr_active used to be tied to pwq, an ordered workqueue had to have a
single pwq to guarantee strict ordering. This led to several contortions to
avoid creating multiple pwqs.
Now that nr_active can be shared across multiple pwqs, we can simplify
ordered workqueue implementation. All that's necessary is ensuring that a
single wq_node_nr_active is shared across all pwqs, which is achieved by
making wq_node_nr_active() always return wq->node_nr_active[nr_node_ids] for
ordered workqueues.
The new implementation is simpler and allows ordered workqueues to share
locality aware worker_pools with other unbound workqueues which should
improve execution locality.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 44 ++++++--------------------------------------
1 file changed, 6 insertions(+), 38 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0017e9094034..bae7ed9cd1b4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -441,9 +441,6 @@ static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
/* I: attributes used when instantiating standard unbound pools on demand */
static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
-/* I: attributes used when instantiating ordered pools on demand */
-static struct workqueue_attrs *ordered_wq_attrs[NR_STD_WORKER_POOLS];
-
/*
* I: kthread_worker to release pwq's. pwq release needs to be bounced to a
* process context while holding a pool lock. Bounce to a dedicated kthread
@@ -1435,6 +1432,9 @@ work_func_t wq_worker_last_func(struct task_struct *task)
*
* - %NULL for per-cpu workqueues as they don't need to use shared nr_active.
*
+ * - node_nr_active[nr_node_ids] if the associated workqueue is ordered so that
+ * all pwq's are limited by the same nr_active.
+ *
* - node_nr_active[nr_node_ids] if @node is %NUMA_NO_NODE.
*
* - Otherwise, node_nr_active[@node].
@@ -1445,7 +1445,7 @@ static struct wq_node_nr_active *wq_node_nr_active(struct workqueue_struct *wq,
if (!(wq->flags & WQ_UNBOUND))
return NULL;
- if (node == NUMA_NO_NODE)
+ if ((wq->flags & __WQ_ORDERED) || node == NUMA_NO_NODE)
node = nr_node_ids;
return wq->node_nr_active[node];
@@ -4312,7 +4312,7 @@ static struct wq_node_nr_active **alloc_node_nr_active(void)
nna_ar[node] = nna;
}
- /* [nr_node_ids] is used as the fallback */
+ /* [nr_node_ids] is used for ordered workqueues and as the fallback */
nna = kzalloc_node(sizeof(*nna), GFP_KERNEL, NUMA_NO_NODE);
if (!nna)
goto err_free;
@@ -4799,14 +4799,6 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
return -EINVAL;
- /* creating multiple pwqs breaks ordering guarantee */
- if (!list_empty(&wq->pwqs)) {
- if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
- return -EINVAL;
-
- wq->flags &= ~__WQ_ORDERED;
- }
-
ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask);
if (IS_ERR(ctx))
return PTR_ERR(ctx);
@@ -4955,15 +4947,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
}
cpus_read_lock();
- if (wq->flags & __WQ_ORDERED) {
- ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
- /* there should only be single pwq for ordering guarantee */
- WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
- wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
- "ordering guarantee broken for workqueue %s\n", wq->name);
- } else {
- ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
- }
+ ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
cpus_read_unlock();
/* for unbound pwq, flush the pwq_release_worker ensures that the
@@ -6220,13 +6204,6 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
if (!(wq->flags & WQ_UNBOUND))
continue;
- /* creating multiple pwqs breaks ordering guarantee */
- if (!list_empty(&wq->pwqs)) {
- if (wq->flags & __WQ_ORDERED_EXPLICIT)
- continue;
- wq->flags &= ~__WQ_ORDERED;
- }
-
ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask);
if (IS_ERR(ctx)) {
ret = PTR_ERR(ctx);
@@ -7023,15 +7000,6 @@ void __init workqueue_init_early(void)
BUG_ON(!(attrs = alloc_workqueue_attrs()));
attrs->nice = std_nice[i];
unbound_std_wq_attrs[i] = attrs;
-
- /*
- * An ordered wq should have only one pwq as ordering is
- * guaranteed by max_active which is enforced by pwqs.
- */
- BUG_ON(!(attrs = alloc_workqueue_attrs()));
- attrs->nice = std_nice[i];
- attrs->ordered = true;
- ordered_wq_attrs[i] = attrs;
}
system_wq = alloc_workqueue("events", 0, 0);
--
2.43.0
On Wed, Dec 20, 2023 at 3:25 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> A pool_workqueue (pwq) represents the connection between a workqueue and a
> worker_pool. One of the roles that a pwq plays is enforcement of the
> max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
> for per-cpu workqueues and per each NUMA node for unbound workqueues, which
> was a natural result of per-cpu workqueues being served by per-cpu pools and
> unbound by per-NUMA pools.
>
> In terms of max_active enforcement, this was, while not perfect, workable.
> For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
> NUMA machines would get max_active that's multiplied by the number of nodes
> but didn't cause huge problems because NUMA machines are relatively rare and
> the node count is usually pretty low.
>
Hello, Tejun
The patchset seems complicated to me. For me, reverting a bit to the behavior
of 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues"), like the following code (untested, just for showing
the idea),
seems simpler.
max_active will have the same behavior as before if the wq is configured
with WQ_AFFN_NUMA. For WQ_AFFN_{CPU|SMT|CACHE}, the problem
isn't fixed.
Thanks
Lai
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2989b57e154a..d0f133f1441b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4364,8 +4364,16 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
ctx->dfl_pwq->refcnt++;
ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
} else {
+ int head_cpu;
+
wq_calc_pod_cpumask(new_attrs, cpu, -1);
- ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, new_attrs);
+ head_cpu = cpumask_first(new_attrs->__pod_cpumask);
+ if (!ctx->pwq_tbl[head_cpu]) {
+ ctx->pwq_tbl[cpu] =
alloc_unbound_pwq(wq, new_attrs);
+ } else {
+ ctx->pwq_tbl[head_cpu]->refcnt++;
+ ctx->pwq_tbl[cpu] = ctx->pwq_tbl[head_cpu];
+ }
if (!ctx->pwq_tbl[cpu])
goto out_free;
}
Hello, Lai.
On Wed, Dec 20, 2023 at 05:20:18PM +0800, Lai Jiangshan wrote:
> The patchset seems complicated to me. For me, reverting a bit to the behavior
> of 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
> pool_workqueues"), like the following code (untested, just for showing
> the idea),
> seems simpler.
>
> max_active will have the same behavior as before if the wq is configured
> with WQ_AFFN_NUMA. For WQ_AFFN_{CPU|SMT|CACHE}, the problem
> isn't fixed.
Yeah, it is complicated but the complications come from the fact that the
domain we count nr_active can't match the worker_pools, and that's because
unbound workqueue behavior is noticeably worse if we let them roam across L3
boundaries on modern processors with multiple chiplets or otherwise
segmented L3 caches.
We need WQ_AFFN_CACHE to behave well on these chips and max_active
enforcement breaks if we keep them tied to pool in such cases, so I'm afraid
our hands are tied here. The hardware has changed and we have to adapt to
it. In this case, that comes at the cost of extra complexity to divorce
max_active enforcement domain from worker_pool boundaries.
Thanks.
--
tejun
On Fri, Dec 22, 2023 at 7:01 AM Tejun Heo <[email protected]> wrote:
>
> Hello, Lai.
>
> On Wed, Dec 20, 2023 at 05:20:18PM +0800, Lai Jiangshan wrote:
> > The patchset seems complicated to me. For me, reverting a bit to the behavior
> > of 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
> > pool_workqueues"), like the following code (untested, just for showing
> > the idea),
> > seems simpler.
> >
> > max_active will have the same behavior as before if the wq is configured
> > with WQ_AFFN_NUMA. For WQ_AFFN_{CPU|SMT|CACHE}, the problem
> > isn't fixed.
>
> Yeah, it is complicated but the complications come from the fact that the
> domain we count nr_active can't match the worker_pools, and that's because
> unbound workqueue behavior is noticeably worse if we let them roam across L3
> boundaries on modern processors with multiple chiplets or otherwise
> segmented L3 caches.
>
> We need WQ_AFFN_CACHE to behave well on these chips and max_active
> enforcement breaks if we keep them tied to pool in such cases, so I'm afraid
> our hands are tied here. The hardware has changed and we have to adapt to
> it. In this case, that comes at the cost of extra complexity to divorce
> max_active enforcement domain from worker_pool boundaries.
>
Hello, Tejun
The current "max_active enforcement domain" is just a historical
accident. IMO, it is
better to change the semantics of max_active and the related alloc_workqueue()
callers rather than add a large bulk of complicated code. Most of
alloc_workqueue()
are called with max_active=0, so it is feasible.
Thanks
Lai
Hello,
On Fri, Dec 22, 2023 at 04:04:27PM +0800, Lai Jiangshan wrote:
> The current "max_active enforcement domain" is just a historical accident.
I mean, the part that it's bound to NUMA is accidental but the fact that we
need an enforcement domain which is closer to the whole machine isn't. Most
users don't depend on max_active limit but there are a number of users that
do and what they usually want to express is "I want to be able to saturate
the whole machine but avoid creating pointless over-saturation", which
usually comes down to some multiples of the number of CPUs.
Note that there may be a single issuer or multiple issuers and we want to be
able to saturate the machine while avoding over-saturation in both cases. If
we segment max_limit enforcement to smaller units like CPUs or L3 caches,
there's no good way to express these constraints. A number which is too
smaller for single issuer case is already too big for multiple issuer case.
I tried to explain the conundrum in the cover letter but if that's not
sufficient, we can keep discussing. It'd be also useful to read what Naohiro
reported as that shows the problem pretty well.
> IMO, it is better to change the semantics of max_active and the related
> alloc_workqueue() callers rather than add a large bulk of complicated
> code. Most of alloc_workqueue() are called with max_active=0, so it is
> feasible.
Yeah, I mean, if we can, that'd be simpler. I don't think we can. It's a
rather fundamental problem. If you have concrete suggestions, please feel
free to share.
Thanks.
--
tejun
On Wed, Dec 20, 2023 at 3:25 PM Tejun Heo <[email protected]> wrote:
> +static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
> {
> - struct work_struct *work = list_first_entry(&pwq->inactive_works,
> - struct work_struct, entry);
> + struct workqueue_struct *wq = pwq->wq;
> + struct worker_pool *pool = pwq->pool;
> + bool obtained;
>
> - pwq_activate_work(pwq, work);
> + lockdep_assert_held(&pool->lock);
> +
> + obtained = pwq->nr_active < wq->max_active;
It is better to use READ_ONCE(wq->max_active) here in case
the compiler issues code to calculate "obtained" multi-time.
Theoretically, READ_ONCE(wq->max_active) is recommended
from the patch1 in pwq_dec_nr_in_flight() and __queue_work()
since there is no wq->mutex for them.
> +
> + if (obtained)
> + pwq->nr_active++;
> + return obtained;
> +}
> +
On Wed, Dec 20, 2023 at 3:26 PM Tejun Heo <[email protected]> wrote:
> @@ -1498,12 +1539,21 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
> {
> struct workqueue_struct *wq = pwq->wq;
> struct worker_pool *pool = pwq->pool;
> - bool obtained;
> + struct wq_node_nr_active *nna = wq_node_nr_active(wq, pool->node);
> + bool obtained = false;
>
> lockdep_assert_held(&pool->lock);
>
> - obtained = pwq->nr_active < wq->max_active;
> + if (!nna) {
> + /* per-cpu workqueue, pwq->nr_active is sufficient */
> + obtained = pwq->nr_active < wq->max_active;
> + goto out;
> + }
For unbound workqueue, it is not checked against wq->max_active anymore
and it is increased unconditionally. Is it by design?
> +
> + atomic_inc(&nna->count);
> + obtained = true;
>
> +out:
> if (obtained)
> pwq->nr_active++;
> return obtained;
On Tue, Dec 26, 2023 at 05:12:55PM +0800, Lai Jiangshan wrote:
> On Wed, Dec 20, 2023 at 3:25 PM Tejun Heo <[email protected]> wrote:
>
> > +static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
> > {
> > - struct work_struct *work = list_first_entry(&pwq->inactive_works,
> > - struct work_struct, entry);
> > + struct workqueue_struct *wq = pwq->wq;
> > + struct worker_pool *pool = pwq->pool;
> > + bool obtained;
> >
> > - pwq_activate_work(pwq, work);
> > + lockdep_assert_held(&pool->lock);
> > +
> > + obtained = pwq->nr_active < wq->max_active;
>
> It is better to use READ_ONCE(wq->max_active) here in case
> the compiler issues code to calculate "obtained" multi-time.
>
> Theoretically, READ_ONCE(wq->max_active) is recommended
> from the patch1 in pwq_dec_nr_in_flight() and __queue_work()
> since there is no wq->mutex for them.
Yeah, good point. Lemme add WRITE/READ_ONCE() around max_active accesses.
Thanks.
--
tejun
On Tue, Dec 26, 2023 at 05:14:18PM +0800, Lai Jiangshan wrote:
> On Wed, Dec 20, 2023 at 3:26 PM Tejun Heo <[email protected]> wrote:
>
> > @@ -1498,12 +1539,21 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq)
> > {
> > struct workqueue_struct *wq = pwq->wq;
> > struct worker_pool *pool = pwq->pool;
> > - bool obtained;
> > + struct wq_node_nr_active *nna = wq_node_nr_active(wq, pool->node);
> > + bool obtained = false;
> >
> > lockdep_assert_held(&pool->lock);
> >
> > - obtained = pwq->nr_active < wq->max_active;
> > + if (!nna) {
> > + /* per-cpu workqueue, pwq->nr_active is sufficient */
> > + obtained = pwq->nr_active < wq->max_active;
> > + goto out;
> > + }
>
> For unbound workqueue, it is not checked against wq->max_active anymore
> and it is increased unconditionally. Is it by design?
Ah, I made a mistake while splitting the patches. This gets added by a later
patch but this step should have an explicit check against wq->max_active.
Lemme add a check for the unbound path.
Thanks.
--
tejun
Hello,
On Tue, Dec 26, 2023 at 05:13:40PM +0800, Lai Jiangshan wrote:
> On Wed, Dec 20, 2023 at 3:25 PM Tejun Heo <[email protected]> wrote:
>
> > +static void wq_adjust_max_active(struct workqueue_struct *wq)
> > +{
> > + struct pool_workqueue *pwq;
> > +
> > + lockdep_assert_held(&wq->mutex);
> > +
> > + if ((wq->flags & WQ_FREEZABLE) && workqueue_freezing) {
> > + wq->max_active = 0;
> > + return;
> > + }
> > +
> > + if (wq->max_active == wq->saved_max_active)
> > + return;
> > +
> > + wq->max_active = wq->saved_max_active;
> > +
>
> If a work item gets queued now, it will get scheduled earlier than a
> previous queued one which is still in the inactive list.
Is that a problem tho? There's no execution order guarantee except for
ordered workqueues which is not affected by this. In a later change, we
switch to list of pending pwqs instead of work items and the issue ordering
is lost anyway. This isn't a significant departure from previous behaviors
either given that there has never been ordering across pwq boundaries.
> To solve it, I recommend adding wq->queue_max_active which will be
> updated after the following code and used only when queue_work().
> But it requires round-robin through PWQs the second time after
> wq->queue_max_active is updated to catch the new inactivated items.
I'm reluctant to add complications for this given that it's not a real
problem to begin with and the operation is pretty cold.
> Or just keep pwq->max_active and will be
> updated after activating inactivated items and used only when queue_work().
This probably is simpler but would make things more complicated. I'm not
sure it's worth it.
Thanks.
--
tejun
On Wed, Dec 27, 2023 at 05:05:49AM +0900, Tejun Heo wrote:
> Is that a problem tho? There's no execution order guarantee except for
> ordered workqueues which is not affected by this. In a later change, we
> switch to list of pending pwqs instead of work items and the issue ordering
> is lost anyway. This isn't a significant departure from previous behaviors
> either given that there has never been ordering across pwq boundaries.
Thought more about it and I was wrong. This introduces reordering within pwq
which is new and can break ordered workqueues. Will fix.
Thanks.
--
tejun
On Wed, Dec 20, 2023 at 04:24:31PM +0900, Tejun Heo wrote:
> Hello,
>
> A pool_workqueue (pwq) represents the connection between a workqueue and a
> worker_pool. One of the roles that a pwq plays is enforcement of the
> max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
> for per-cpu workqueues and per each NUMA node for unbound workqueues, which
> was a natural result of per-cpu workqueues being served by per-cpu pools and
> unbound by per-NUMA pools.
>
> In terms of max_active enforcement, this was, while not perfect, workable.
> For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
> NUMA machines would get max_active that's multiplied by the number of nodes
> but didn't cause huge problems because NUMA machines are relatively rare and
> the node count is usually pretty low.
>
> However, cache layouts are more complex now and sharing a worker pool across
> a whole node didn't really work well for unbound workqueues. Thus, a series
> of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
> to use per-cpu pool_workqueues") implemented more flexible affinity
> mechanism for unbound workqueues which enables using e.g. last-level-cache
> aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues") made unbound workqueues use
> per-cpu pwqs like per-cpu workqueues.
>
> While the change was necessary to enable more flexible affinity scopes, this
> came with the side effect of blowing up the effective max_active for unbound
> workqueues. Before, the effective max_active for unbound workqueues was
> multiplied by the number of nodes. After, by the number of CPUs.
>
> 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
> pool_workqueues") claims that this should generally be okay. It is okay for
> users which self-regulates concurrency level which are the vast majority;
> however, there are enough use cases which actually depend on max_active to
> prevent the level of concurrency from going bonkers including several IO
> handling workqueues that can issue a work item for each in-flight IO. With
> targeted benchmarks, the misbehavior can easily be exposed as reported in
> http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
>
> Unfortunately, there is no way to express what these use cases need using
> per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
> to set max_active too low but as soon as we increase max_active a bit, we
> can end up with unreasonable number of in-flight work items when many CPUs
> issue IOs at the same time. ie. The acceptable lowest max_active is higher
> than the acceptable highest max_active.
>
> Ideally, max_active for an unbound workqueue should be system-wide so that
> the users can regulate the total level of concurrency regardless of node and
> cache layout. The reasons workqueue hasn't implemented that yet are:
>
> - One max_active enforcement decouples from pool boundaires, chaining
> execution after a work item finishes requires inter-pool operations which
> would require lock dancing, which is nasty.
>
> - Sharing a single nr_active count across the whole system can be pretty
> expensive on NUMA machines.
>
> - Per-pwq enforcement had been more or less okay while we were using
> per-node pools.
>
> It looks like we no longer can avoid decoupling max_active enforcement from
> pool boundaries. This patchset implements system-wide nr_active mechanism
> with the following design characteristics:
>
> - To avoid sharing a single counter across multiple nodes, the configured
> max_active is split across nodes according to the proportion of online
> CPUs per node. e.g. A node with twice more online CPUs will get twice
> higher portion of max_active.
>
> - Workqueue used to be able to process a chain of interdependent work items
> which is as long as max_active. We can't do this anymore as max_active is
> distributed across the nodes. Instead, a new parameter min_active is
> introduced which determines the minimum level of concurrency within a node
> regardless of how max_active distribution comes out to be.
>
> It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
> This can lead to higher effective max_weight than configured and also
> deadlocks if a workqueue was depending on being able to handle chains of
> interdependent work items that are longer than 8.
>
> I believe these should be fine given that the number of CPUs in each NUMA
> node is usually higher than 8 and work item chain longer than 8 is pretty
> unlikely. However, if these assumptions turn out to be wrong, we'll need
> to add an interface to adjust min_active.
>
> - Each unbound wq has an array of struct wq_node_nr_active which tracks
> per-node nr_active. When its pwq wants to run a work item, it has to
> obtain the matching node's nr_active. If over the node's max_active, the
> pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
> the completion path round-robins the pending pwqs activating the first
> inactive work item of each, which involves some pool lock dancing and
> kicking other pools. It's not the simplest code but doesn't look too bad.
>
> This patchset includes the following patches:
>
> 0001-workqueue-Move-pwq-max_active-to-wq-max_active.patch
> 0002-workqueue-Factor-out-pwq_is_empty.patch
> 0003-workqueue-Replace-pwq_activate_inactive_work-with-__.patch
> 0004-workqueue-Move-nr_active-handling-into-helpers.patch
> 0005-workqueue-Make-wq_adjust_max_active-round-robin-pwqs.patch
> 0006-workqueue-Add-first_possible_node-and-node_nr_cpus.patch
> 0007-workqueue-Move-pwq_dec_nr_in_flight-to-the-end-of-wo.patch
> 0008-workqueue-Introduce-struct-wq_node_nr_active.patch
> 0009-workqueue-Implement-system-wide-nr_active-enforcemen.patch
> 0010-workqueue-Reimplement-ordered-workqueue-using-shared.patch
>
> This pachset is also available in the following git branch.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git unbound-system-wide-max_active
Thank you for the series. I applied the patches on btrfs's development tree
below, and ran the benchmark.
https://gitlab.com/kdave/btrfs-devel.git misc-next
- misc-next, numa=off (baseline)
WRITE: bw=1117MiB/s (1171MB/s), 1117MiB/s-1117MiB/s (1171MB/s-1171MB/s), io=332GiB (356GB), run=304322-304322msec
- misc-next + wq patches, numa=off
WRITE: bw=1866MiB/s (1957MB/s), 1866MiB/s-1866MiB/s (1957MB/s-1957MB/s), io=684GiB (735GB), run=375472-375472msec
So, the patches surely improved the performance. However, as show below, it
is still lower than reverting previous workqueue patches. The reverting is
done by reverse applying output of "git diff 4cbfd3de737b
kernel/workqueue.c kernel/workqueue_internal.h include/linux/workqueue*
init/main.c"
- misc-next + wq reverted, numa=off
WRITE: bw=2472MiB/s (2592MB/s), 2472MiB/s-2472MiB/s (2592MB/s-2592MB/s), io=732GiB (786GB), run=303257-303257msec
I also tested Lai Jiangshan's patches as below.
https://lore.kernel.org/all/[email protected]/
- misc-next+wq another approach, numa=off
WRITE: bw=1697MiB/s (1780MB/s), 1697MiB/s-1697MiB/s (1780MB/s-1780MB/s), io=678GiB (728GB), run=409213-409213msec
>
> diffstat follows.
>
> include/linux/workqueue.h | 35 ++-
> kernel/workqueue.c | 669 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 565 insertions(+), 139 deletions(-)
>
> --
> tejun
Hello,
On Fri, Jan 05, 2024 at 02:44:08AM +0000, Naohiro Aota wrote:
> Thank you for the series. I applied the patches on btrfs's development tree
> below, and ran the benchmark.
>
> https://gitlab.com/kdave/btrfs-devel.git misc-next
>
> - misc-next, numa=off (baseline)
> WRITE: bw=1117MiB/s (1171MB/s), 1117MiB/s-1117MiB/s (1171MB/s-1171MB/s), io=332GiB (356GB), run=304322-304322msec
> - misc-next + wq patches, numa=off
> WRITE: bw=1866MiB/s (1957MB/s), 1866MiB/s-1866MiB/s (1957MB/s-1957MB/s), io=684GiB (735GB), run=375472-375472msec
>
> So, the patches surely improved the performance. However, as show below, it
> is still lower than reverting previous workqueue patches. The reverting is
> done by reverse applying output of "git diff 4cbfd3de737b
> kernel/workqueue.c kernel/workqueue_internal.h include/linux/workqueue*
> init/main.c"
>
> - misc-next + wq reverted, numa=off
> WRITE: bw=2472MiB/s (2592MB/s), 2472MiB/s-2472MiB/s (2592MB/s-2592MB/s), io=732GiB (786GB), run=303257-303257msec
Can you describe the test setup in detail? What kind of machine is it? What
do you mean by `numa=off`? Can you report tools/workqueue/wq_dump.py output?
Thanks.
--
tejun
Hello,
On Thu, Jan 11, 2024 at 02:49:21PM -1000, Tejun Heo wrote:
> On Fri, Jan 05, 2024 at 02:44:08AM +0000, Naohiro Aota wrote:
> > Thank you for the series. I applied the patches on btrfs's development tree
> > below, and ran the benchmark.
> >
> > https://gitlab.com/kdave/btrfs-devel.git misc-next
> >
> > - misc-next, numa=off (baseline)
> > WRITE: bw=1117MiB/s (1171MB/s), 1117MiB/s-1117MiB/s (1171MB/s-1171MB/s), io=332GiB (356GB), run=304322-304322msec
> > - misc-next + wq patches, numa=off
> > WRITE: bw=1866MiB/s (1957MB/s), 1866MiB/s-1866MiB/s (1957MB/s-1957MB/s), io=684GiB (735GB), run=375472-375472msec
> >
> > So, the patches surely improved the performance. However, as show below, it
> > is still lower than reverting previous workqueue patches. The reverting is
> > done by reverse applying output of "git diff 4cbfd3de737b
> > kernel/workqueue.c kernel/workqueue_internal.h include/linux/workqueue*
> > init/main.c"
> >
> > - misc-next + wq reverted, numa=off
> > WRITE: bw=2472MiB/s (2592MB/s), 2472MiB/s-2472MiB/s (2592MB/s-2592MB/s), io=732GiB (786GB), run=303257-303257msec
>
> Can you describe the test setup in detail? What kind of machine is it? What
> do you mean by `numa=off`? Can you report tools/workqueue/wq_dump.py output?
So, I fixed the possible ordering bug that Lai noticed and dropped the last
patch (more on this in the reply to that path) and did some benchmarking
with fio and dm-crypt and at least in that testing the new code seems to
perform just as well as before. The only variable seems to be what
max_active is used for the workqueue in question.
For dm-crypt, kcryptd workqueue uses num_online_cpus(). Depending on how the
value is interpreted, it may not provide high enough concurrency as some
workers wait for IOs and show slightly slower performance but that's easily
fixed by bumping max_active value so that there's some buffer, which is the
right way to configure it anyway.
It'd be great if you can share more details on the benchmarks you're
running, so that we can rule out similar issues.
Thanks.
--
tejun
On Wed, Dec 20, 2023 at 04:24:41PM +0900, Tejun Heo wrote:
> Because nr_active used to be tied to pwq, an ordered workqueue had to have a
> single pwq to guarantee strict ordering. This led to several contortions to
> avoid creating multiple pwqs.
>
> Now that nr_active can be shared across multiple pwqs, we can simplify
> ordered workqueue implementation. All that's necessary is ensuring that a
> single wq_node_nr_active is shared across all pwqs, which is achieved by
> making wq_node_nr_active() always return wq->node_nr_active[nr_node_ids] for
> ordered workqueues.
>
> The new implementation is simpler and allows ordered workqueues to share
> locality aware worker_pools with other unbound workqueues which should
> improve execution locality.
>
> Signed-off-by: Tejun Heo <[email protected]>
This patch breaks ordered workqueues as the inactive pwq RR logic doesn't
follow work item queueing order. I could reproduce severe perf degradations
and outright hangs. I'm dropping this patch.
Thanks.
--
tejun
On Thu, Jan 11, 2024 at 02:49:21PM -1000, Tejun Heo wrote:
> Hello,
>
> On Fri, Jan 05, 2024 at 02:44:08AM +0000, Naohiro Aota wrote:
> > Thank you for the series. I applied the patches on btrfs's development tree
> > below, and ran the benchmark.
> >
> > https://gitlab.com/kdave/btrfs-devel.git misc-next
> >
> > - misc-next, numa=off (baseline)
> > WRITE: bw=1117MiB/s (1171MB/s), 1117MiB/s-1117MiB/s (1171MB/s-1171MB/s), io=332GiB (356GB), run=304322-304322msec
> > - misc-next + wq patches, numa=off
> > WRITE: bw=1866MiB/s (1957MB/s), 1866MiB/s-1866MiB/s (1957MB/s-1957MB/s), io=684GiB (735GB), run=375472-375472msec
> >
> > So, the patches surely improved the performance. However, as show below, it
> > is still lower than reverting previous workqueue patches. The reverting is
> > done by reverse applying output of "git diff 4cbfd3de737b
> > kernel/workqueue.c kernel/workqueue_internal.h include/linux/workqueue*
> > init/main.c"
> >
> > - misc-next + wq reverted, numa=off
> > WRITE: bw=2472MiB/s (2592MB/s), 2472MiB/s-2472MiB/s (2592MB/s-2592MB/s), io=732GiB (786GB), run=303257-303257msec
>
> Can you describe the test setup in detail? What kind of machine is it? What
> do you mean by `numa=off`? Can you report tools/workqueue/wq_dump.py output?
The testing machines' hardware is as follows:
CPU: Intel(R) Xeon(R) Platinum 8260 CPU, 96 cores
NUMA nodes: 2
RAM: 1024 GB
However, for another benchmark experiment I'm doing, I booted the machine
with "numa=off mem=16G" kernel command-line. I admit this is an unusual
setup...
On that machine, I create a fresh btrfs with "mkfs.btrfs -d raid0 -m raid0
<devices>" with 6 SSD devices. And, I run the following command on the FS.
fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
--rw=write --fallocate=none \
--direct=1 --ioengine=libaio --iodepth=32 \
--filesize=100G \
--blocksize=64k \
--time_based --runtime=300s \
--end_fsync=1 \
--directory=${MNT} \
--name=writer --numjobs=32
tools/workqueue/wq_dump.py output is pasted at the
bottom. "btrfs-endio-write" is the workqueue, which had many workers on the
unpatched kernel.
FYI, without the kernel command-line (i.e, numa=on and all RAM available as
usual), as shown below, your patch series (v1) improved the performance
significantly. It is even better than the reverted case.
- misc-next, numa=on
WRITE: bw=1121MiB/s (1175MB/s), 1121MiB/s-1121MiB/s (1175MB/s-1175MB/s), io=332GiB (356GB), run=303030-303030msec
- misc-next+wq patches, numa=on
WRITE: bw=2185MiB/s (2291MB/s), 2185MiB/s-2185MiB/s (2291MB/s-2291MB/s), io=667GiB (717GB), run=312806-312806msec
- misc-next+wq reverted, numa=on
WRITE: bw=1557MiB/s (1633MB/s), 1557MiB/s-1557MiB/s (1633MB/s-1633MB/s), io=659GiB (708GB), run=433426-433426msec
> Thanks.
>
> --
> tejun
Affinity Scopes
===============
wq_unbound_cpumask=0xffffffff 0xffffffff ffffffff
CPU
nr_pods 96
pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 [4]=00000010 [5]=00000020 [6]=00000040 [7]=00000080 [8]=00000100 [9]=00000200 [10]=00000400 [11]=00000800 [12]=00001000 [13]=00002000 [14]=00004000 [15]=00008000 [16]=00010000 [17]=00020000 [18]=00040000 [19]=00080000 [20]=00100000 [21]=00200000 [22]=00400000 [23]=00800000 [24]=01000000 [25]=02000000 [26]=04000000 [27]=08000000 [28]=10000000 [29]=20000000 [30]=40000000 [31]=80000000 [32]=0x0 00000001 [33]=0x0 00000002 [34]=0x0 00000004 [35]=0x0 00000008 [36]=0x0 00000010 [37]=0x0 00000020 [38]=0x0 00000040 [39]=0x0 00000080 [40]=0x0 00000100 [41]=0x0 00000200 [42]=0x0 00000400 [43]=0x0 00000800 [44]=0x0 00001000 [45]=0x0 00002000 [46]=0x0 00004000 [47]=0x0 00008000 [48]=0x0 00010000 [49]=0x0 00020000 [50]=0x0 00040000 [51]=0x0 00080000 [52]=0x0 00100000 [53]=0x0 00200000 [54]=0x0 00400000 [55]=0x0 00800000 [56]=0x0 01000000 [57]=0x0 02000000 [58]=0x0 04000000 [59]=0x0 08000000 [60]=0x0 10000000 [61]=0x0 20000000 [62]=0x0 40000000 [63]=0x0 80000000 [64]=0x0 0x0 00000001 [65]=0x0 0x0 00000002 [66]=0x0 0x0 00000004 [67]=0x0 0x0 00000008 [68]=0x0 0x0 00000010 [69]=0x0 0x0 00000020 [70]=0x0 0x0 00000040 [71]=0x0 0x0 00000080 [72]=0x0 0x0 00000100 [73]=0x0 0x0 00000200 [74]=0x0 0x0 00000400 [75]=0x0 0x0 00000800 [76]=0x0 0x0 00001000 [77]=0x0 0x0 00002000 [78]=0x0 0x0 00004000 [79]=0x0 0x0 00008000 [80]=0x0 0x0 00010000 [81]=0x0 0x0 00020000 [82]=0x0 0x0 00040000 [83]=0x0 0x0 00080000 [84]=0x0 0x0 00100000 [85]=0x0 0x0 00200000 [86]=0x0 0x0 00400000 [87]=0x0 0x0 00800000 [88]=0x0 0x0 01000000 [89]=0x0 0x0 02000000 [90]=0x0 0x0 04000000 [91]=0x0 0x0 08000000 [92]=0x0 0x0 10000000 [93]=0x0 0x0 20000000 [94]=0x0 0x0 40000000 [95]=0x0 0x0 80000000
pod_node [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0 [16]=0 [17]=0 [18]=0 [19]=0 [20]=0 [21]=0 [22]=0 [23]=0 [24]=0 [25]=0 [26]=0 [27]=0 [28]=0 [29]=0 [30]=0 [31]=0 [32]=0 [33]=0 [34]=0 [35]=0 [36]=0 [37]=0 [38]=0 [39]=0 [40]=0 [41]=0 [42]=0 [43]=0 [44]=0 [45]=0 [46]=0 [47]=0 [48]=0 [49]=0 [50]=0 [51]=0 [52]=0 [53]=0 [54]=0 [55]=0 [56]=0 [57]=0 [58]=0 [59]=0 [60]=0 [61]=0 [62]=0 [63]=0 [64]=0 [65]=0 [66]=0 [67]=0 [68]=0 [69]=0 [70]=0 [71]=0 [72]=0 [73]=0 [74]=0 [75]=0 [76]=0 [77]=0 [78]=0 [79]=0 [80]=0 [81]=0 [82]=0 [83]=0 [84]=0 [85]=0 [86]=0 [87]=0 [88]=0 [89]=0 [90]=0 [91]=0 [92]=0 [93]=0 [94]=0 [95]=0
cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 [4]=4 [5]=5 [6]=6 [7]=7 [8]=8 [9]=9 [10]=10 [11]=11 [12]=12 [13]=13 [14]=14 [15]=15 [16]=16 [17]=17 [18]=18 [19]=19 [20]=20 [21]=21 [22]=22 [23]=23 [24]=24 [25]=25 [26]=26 [27]=27 [28]=28 [29]=29 [30]=30 [31]=31 [32]=32 [33]=33 [34]=34 [35]=35 [36]=36 [37]=37 [38]=38 [39]=39 [40]=40 [41]=41 [42]=42 [43]=43 [44]=44 [45]=45 [46]=46 [47]=47 [48]=48 [49]=49 [50]=50 [51]=51 [52]=52 [53]=53 [54]=54 [55]=55 [56]=56 [57]=57 [58]=58 [59]=59 [60]=60 [61]=61 [62]=62 [63]=63 [64]=64 [65]=65 [66]=66 [67]=67 [68]=68 [69]=69 [70]=70 [71]=71 [72]=72 [73]=73 [74]=74 [75]=75 [76]=76 [77]=77 [78]=78 [79]=79 [80]=80 [81]=81 [82]=82 [83]=83 [84]=84 [85]=85 [86]=86 [87]=87 [88]=88 [89]=89 [90]=90 [91]=91 [92]=92 [93]=93 [94]=94 [95]=95
SMT
nr_pods 48
pod_cpus [0]=0x1 00010000 [1]=0x2 00020000 [2]=0x4 00040000 [3]=0x8 00080000 [4]=0x10 00100000 [5]=0x20 00200000 [6]=0x40 00400000 [7]=0x80 00800000 [8]=0x100 01000000 [9]=0x200 02000000 [10]=0x400 04000000 [11]=0x800 08000000 [12]=0x1000 10000000 [13]=0x2000 20000000 [14]=0x4000 40000000 [15]=0x8000 80000000 [16]=0x10000 0x0 00000001 [17]=0x20000 0x0 00000002 [18]=0x40000 0x0 00000004 [19]=0x80000 0x0 00000008 [20]=0x100000 0x0 00000010 [21]=0x200000 0x0 00000020 [22]=0x400000 0x0 00000040 [23]=0x800000 0x0 00000080 [24]=0x1000000 0x0 00000100 [25]=0x2000000 0x0 00000200 [26]=0x4000000 0x0 00000400 [27]=0x8000000 0x0 00000800 [28]=0x10000000 0x0 00001000 [29]=0x20000000 0x0 00002000 [30]=0x40000000 0x0 00004000 [31]=0x80000000 0x0 00008000 [32]=0x0 0x1 00010000 [33]=0x0 0x2 00020000 [34]=0x0 0x4 00040000 [35]=0x0 0x8 00080000 [36]=0x0 0x10 00100000 [37]=0x0 0x20 00200000 [38]=0x0 0x40 00400000 [39]=0x0 0x80 00800000 [40]=0x0 0x100 01000000 [41]=0x0 0x200 02000000 [42]=0x0 0x400 04000000 [43]=0x0 0x800 08000000 [44]=0x0 0x1000 10000000 [45]=0x0 0x2000 20000000 [46]=0x0 0x4000 40000000 [47]=0x0 0x8000 80000000
pod_node [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0 [16]=0 [17]=0 [18]=0 [19]=0 [20]=0 [21]=0 [22]=0 [23]=0 [24]=0 [25]=0 [26]=0 [27]=0 [28]=0 [29]=0 [30]=0 [31]=0 [32]=0 [33]=0 [34]=0 [35]=0 [36]=0 [37]=0 [38]=0 [39]=0 [40]=0 [41]=0 [42]=0 [43]=0 [44]=0 [45]=0 [46]=0 [47]=0
cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 [4]=4 [5]=5 [6]=6 [7]=7 [8]=8 [9]=9 [10]=10 [11]=11 [12]=12 [13]=13 [14]=14 [15]=15 [16]=16 [17]=17 [18]=18 [19]=19 [20]=20 [21]=21 [22]=22 [23]=23 [24]=24 [25]=25 [26]=26 [27]=27 [28]=28 [29]=29 [30]=30 [31]=31 [32]=32 [33]=33 [34]=34 [35]=35 [36]=36 [37]=37 [38]=38 [39]=39 [40]=40 [41]=41 [42]=42 [43]=43 [44]=44 [45]=45 [46]=46 [47]=47 [48]=0 [49]=1 [50]=2 [51]=3 [52]=4 [53]=5 [54]=6 [55]=7 [56]=8 [57]=9 [58]=10 [59]=11 [60]=12 [61]=13 [62]=14 [63]=15 [64]=16 [65]=17 [66]=18 [67]=19 [68]=20 [69]=21 [70]=22 [71]=23 [72]=24 [73]=25 [74]=26 [75]=27 [76]=28 [77]=29 [78]=30 [79]=31 [80]=32 [81]=33 [82]=34 [83]=35 [84]=36 [85]=37 [86]=38 [87]=39 [88]=40 [89]=41 [90]=42 [91]=43 [92]=44 [93]=45 [94]=46 [95]=47
CACHE (default)
nr_pods 2
pod_cpus [0]=0xffffff 0xffff0000 000000ff [1]=0xff000000 0xffff ffffff00
pod_node [0]=0 [1]=0
cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0 [16]=0 [17]=0 [18]=0 [19]=0 [20]=0 [21]=0 [22]=0 [23]=0 [24]=1 [25]=1 [26]=1 [27]=1 [28]=1 [29]=1 [30]=1 [31]=1 [32]=1 [33]=1 [34]=1 [35]=1 [36]=1 [37]=1 [38]=1 [39]=1 [40]=1 [41]=1 [42]=1 [43]=1 [44]=1 [45]=1 [46]=1 [47]=1 [48]=0 [49]=0 [50]=0 [51]=0 [52]=0 [53]=0 [54]=0 [55]=0 [56]=0 [57]=0 [58]=0 [59]=0 [60]=0 [61]=0 [62]=0 [63]=0 [64]=0 [65]=0 [66]=0 [67]=0 [68]=0 [69]=0 [70]=0 [71]=0 [72]=1 [73]=1 [74]=1 [75]=1 [76]=1 [77]=1 [78]=1 [79]=1 [80]=1 [81]=1 [82]=1 [83]=1 [84]=1 [85]=1 [86]=1 [87]=1 [88]=1 [89]=1 [90]=1 [91]=1 [92]=1 [93]=1 [94]=1 [95]=1
NUMA
nr_pods 1
pod_cpus [0]=0xffffffff 0xffffffff ffffffff
pod_node [0]=0
cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0 [16]=0 [17]=0 [18]=0 [19]=0 [20]=0 [21]=0 [22]=0 [23]=0 [24]=0 [25]=0 [26]=0 [27]=0 [28]=0 [29]=0 [30]=0 [31]=0 [32]=0 [33]=0 [34]=0 [35]=0 [36]=0 [37]=0 [38]=0 [39]=0 [40]=0 [41]=0 [42]=0 [43]=0 [44]=0 [45]=0 [46]=0 [47]=0 [48]=0 [49]=0 [50]=0 [51]=0 [52]=0 [53]=0 [54]=0 [55]=0 [56]=0 [57]=0 [58]=0 [59]=0 [60]=0 [61]=0 [62]=0 [63]=0 [64]=0 [65]=0 [66]=0 [67]=0 [68]=0 [69]=0 [70]=0 [71]=0 [72]=0 [73]=0 [74]=0 [75]=0 [76]=0 [77]=0 [78]=0 [79]=0 [80]=0 [81]=0 [82]=0 [83]=0 [84]=0 [85]=0 [86]=0 [87]=0 [88]=0 [89]=0 [90]=0 [91]=0 [92]=0 [93]=0 [94]=0 [95]=0
SYSTEM
nr_pods 1
pod_cpus [0]=0xffffffff 0xffffffff ffffffff
pod_node [0]=-1
cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0 [16]=0 [17]=0 [18]=0 [19]=0 [20]=0 [21]=0 [22]=0 [23]=0 [24]=0 [25]=0 [26]=0 [27]=0 [28]=0 [29]=0 [30]=0 [31]=0 [32]=0 [33]=0 [34]=0 [35]=0 [36]=0 [37]=0 [38]=0 [39]=0 [40]=0 [41]=0 [42]=0 [43]=0 [44]=0 [45]=0 [46]=0 [47]=0 [48]=0 [49]=0 [50]=0 [51]=0 [52]=0 [53]=0 [54]=0 [55]=0 [56]=0 [57]=0 [58]=0 [59]=0 [60]=0 [61]=0 [62]=0 [63]=0 [64]=0 [65]=0 [66]=0 [67]=0 [68]=0 [69]=0 [70]=0 [71]=0 [72]=0 [73]=0 [74]=0 [75]=0 [76]=0 [77]=0 [78]=0 [79]=0 [80]=0 [81]=0 [82]=0 [83]=0 [84]=0 [85]=0 [86]=0 [87]=0 [88]=0 [89]=0 [90]=0 [91]=0 [92]=0 [93]=0 [94]=0 [95]=0
Worker Pools
============
pool[000] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 0
pool[001] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0
pool[002] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 1
pool[003] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1
pool[004] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 2
pool[005] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2
pool[006] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 3
pool[007] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3
pool[008] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 4
pool[009] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 4
pool[010] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 5
pool[011] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 5
pool[012] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 6
pool[013] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 6
pool[014] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 7
pool[015] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 7
pool[016] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 8
pool[017] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 8
pool[018] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 9
pool[019] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 9
pool[020] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 10
pool[021] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 10
pool[022] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 11
pool[023] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 11
pool[024] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 12
pool[025] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 12
pool[026] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 13
pool[027] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 13
pool[028] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 14
pool[029] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 14
pool[030] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 15
pool[031] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 15
pool[032] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 16
pool[033] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 16
pool[034] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 17
pool[035] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 17
pool[036] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 18
pool[037] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 18
pool[038] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 19
pool[039] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 19
pool[040] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 20
pool[041] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 20
pool[042] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 21
pool[043] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 21
pool[044] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 22
pool[045] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 22
pool[046] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 23
pool[047] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 23
pool[048] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 24
pool[049] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 24
pool[050] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 25
pool[051] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 25
pool[052] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 26
pool[053] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 26
pool[054] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 27
pool[055] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 27
pool[056] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 28
pool[057] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 28
pool[058] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 29
pool[059] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 29
pool[060] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 30
pool[061] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 30
pool[062] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 31
pool[063] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 31
pool[064] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 32
pool[065] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 32
pool[066] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 33
pool[067] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 33
pool[068] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 34
pool[069] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 34
pool[070] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 35
pool[071] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 35
pool[072] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 36
pool[073] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 36
pool[074] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 37
pool[075] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 37
pool[076] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 38
pool[077] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 38
pool[078] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 39
pool[079] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 39
pool[080] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 40
pool[081] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 40
pool[082] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 41
pool[083] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 41
pool[084] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 42
pool[085] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 42
pool[086] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 43
pool[087] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 43
pool[088] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 44
pool[089] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 44
pool[090] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 45
pool[091] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 45
pool[092] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 46
pool[093] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 46
pool[094] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 47
pool[095] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 47
pool[096] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 48
pool[097] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 48
pool[098] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 49
pool[099] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 49
pool[100] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 50
pool[101] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 50
pool[102] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 51
pool[103] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 51
pool[104] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 52
pool[105] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 52
pool[106] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 53
pool[107] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 53
pool[108] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 54
pool[109] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 54
pool[110] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 55
pool[111] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 55
pool[112] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 56
pool[113] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 56
pool[114] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 57
pool[115] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 57
pool[116] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 58
pool[117] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 58
pool[118] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 59
pool[119] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 59
pool[120] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 60
pool[121] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 60
pool[122] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 61
pool[123] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 61
pool[124] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 62
pool[125] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 62
pool[126] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 63
pool[127] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 63
pool[128] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 64
pool[129] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 64
pool[130] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 65
pool[131] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 65
pool[132] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 66
pool[133] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 66
pool[134] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 67
pool[135] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 67
pool[136] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 68
pool[137] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 68
pool[138] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 69
pool[139] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 69
pool[140] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 70
pool[141] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 70
pool[142] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 71
pool[143] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 71
pool[144] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 72
pool[145] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 72
pool[146] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 73
pool[147] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 73
pool[148] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 74
pool[149] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 74
pool[150] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 75
pool[151] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 75
pool[152] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 76
pool[153] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 76
pool[154] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 77
pool[155] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 77
pool[156] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 78
pool[157] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 78
pool[158] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 79
pool[159] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 79
pool[160] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 80
pool[161] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 80
pool[162] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 81
pool[163] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 81
pool[164] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 82
pool[165] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 82
pool[166] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 83
pool[167] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 83
pool[168] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 84
pool[169] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 84
pool[170] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 85
pool[171] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 85
pool[172] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 86
pool[173] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 86
pool[174] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 87
pool[175] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 87
pool[176] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 88
pool[177] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 88
pool[178] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 89
pool[179] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 89
pool[180] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 90
pool[181] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 90
pool[182] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 91
pool[183] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 91
pool[184] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 92
pool[185] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 92
pool[186] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 93
pool[187] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 93
pool[188] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 94
pool[189] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 94
pool[190] ref= 1 nice= 0 idle/workers= 2/ 2 cpu= 95
pool[191] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 95
pool[192] ref= 78 nice= 0 idle/workers= 2/ 2 cpus=0xffffffff 0xffffffff ffffffff pod_cpus=0xffffffff 0xffffffff ffffffff
pool[193] ref=3744 nice= 0 idle/workers= 8/ 10 cpus=0xffffffff 0xffffffff ffffffff pod_cpus=0xffffff 0xffff0000 000000ff
pool[194] ref=3744 nice= 0 idle/workers= 5/ 9 cpus=0xffffffff 0xffffffff ffffffff pod_cpus=0xff000000 0xffff ffffff00
pool[195] ref= 7 nice=-20 idle/workers= 1/ 1 cpus=0xffffffff 0xffffffff ffffffff pod_cpus=0xffffffff 0xffffffff ffffffff
pool[196] ref= 336 nice=-20 idle/workers= 1/ 1 cpus=0xffffffff 0xffffffff ffffffff pod_cpus=0xffffff 0xffff0000 000000ff
pool[197] ref= 336 nice=-20 idle/workers= 2/ 2 cpus=0xffffffff 0xffffffff ffffffff pod_cpus=0xff000000 0xffff ffffff00
Workqueue CPU -> pool
=====================
[ workqueue \ type CPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 dfl]
events percpu 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
events_highpri percpu 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 171 173 175 177 179 181 183 185 187 189 191
events_long percpu 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
events_unbound unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
events_freezable percpu 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
events_power_efficient percpu 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
events_freezable_power_ percpu 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
(snip)
btrfs-worker unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-delalloc unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-flush_delalloc ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-cache ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-fixup ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-endio unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-endio-meta unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-rmw unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-endio-write unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-compressed-write unbound 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-freespace-write ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-delayed-meta ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs-qgroup-rescan ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
btrfs_discard ordered 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 192
dio/sdbj percpu 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
Hello,
On Mon, Jan 15, 2024 at 05:46:07AM +0000, Naohiro Aota wrote:
> CPU: Intel(R) Xeon(R) Platinum 8260 CPU, 96 cores
> NUMA nodes: 2
> RAM: 1024 GB
>
> However, for another benchmark experiment I'm doing, I booted the machine
> with "numa=off mem=16G" kernel command-line. I admit this is an unusual
> setup...
So, does that end up using only memory from one node while making the kernel
unaware of NUMA topology?
> On that machine, I create a fresh btrfs with "mkfs.btrfs -d raid0 -m raid0
> <devices>" with 6 SSD devices. And, I run the following command on the FS.
>
> fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
> --rw=write --fallocate=none \
> --direct=1 --ioengine=libaio --iodepth=32 \
> --filesize=100G \
> --blocksize=64k \
> --time_based --runtime=300s \
> --end_fsync=1 \
> --directory=${MNT} \
> --name=writer --numjobs=32
>
> tools/workqueue/wq_dump.py output is pasted at the
> bottom. "btrfs-endio-write" is the workqueue, which had many workers on the
> unpatched kernel.
If so, I'm not sure how meaningful the result is. e.g. The perf would depend
heavily on random factors like which threads end up on which node and so on.
Sure, if we're slow because we're creating huge number of concurrent
workers, that's still a problem but comparing relatively small perf delta
might not be all that meaningful. How much is the result variance in that
setup?
> FYI, without the kernel command-line (i.e, numa=on and all RAM available as
> usual), as shown below, your patch series (v1) improved the performance
> significantly. It is even better than the reverted case.
>
> - misc-next, numa=on
> WRITE: bw=1121MiB/s (1175MB/s), 1121MiB/s-1121MiB/s (1175MB/s-1175MB/s), io=332GiB (356GB), run=303030-303030msec
> - misc-next+wq patches, numa=on
> WRITE: bw=2185MiB/s (2291MB/s), 2185MiB/s-2185MiB/s (2291MB/s-2291MB/s), io=667GiB (717GB), run=312806-312806msec
> - misc-next+wq reverted, numa=on
> WRITE: bw=1557MiB/s (1633MB/s), 1557MiB/s-1557MiB/s (1633MB/s-1633MB/s), io=659GiB (708GB), run=433426-433426msec
That looks pretty good, right?
Thanks.
--
tejun
Hello,
I'm sorry for the late response.
On Tue, Jan 16, 2024 at 11:04:53AM -1000, Tejun Heo wrote:
> Hello,
>
> On Mon, Jan 15, 2024 at 05:46:07AM +0000, Naohiro Aota wrote:
> > CPU: Intel(R) Xeon(R) Platinum 8260 CPU, 96 cores
> > NUMA nodes: 2
> > RAM: 1024 GB
> >
> > However, for another benchmark experiment I'm doing, I booted the machine
> > with "numa=off mem=16G" kernel command-line. I admit this is an unusual
> > setup...
>
> So, does that end up using only memory from one node while making the kernel
> unaware of NUMA topology?
Yes.
>
> > On that machine, I create a fresh btrfs with "mkfs.btrfs -d raid0 -m raid0
> > <devices>" with 6 SSD devices. And, I run the following command on the FS.
> >
> > fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
> > --rw=write --fallocate=none \
> > --direct=1 --ioengine=libaio --iodepth=32 \
> > --filesize=100G \
> > --blocksize=64k \
> > --time_based --runtime=300s \
> > --end_fsync=1 \
> > --directory=${MNT} \
> > --name=writer --numjobs=32
> >
> > tools/workqueue/wq_dump.py output is pasted at the
> > bottom. "btrfs-endio-write" is the workqueue, which had many workers on the
> > unpatched kernel.
>
> If so, I'm not sure how meaningful the result is. e.g. The perf would depend
> heavily on random factors like which threads end up on which node and so on.
> Sure, if we're slow because we're creating huge number of concurrent
> workers, that's still a problem but comparing relatively small perf delta
> might not be all that meaningful. How much is the result variance in that
> setup?
Yeah, that is true. I conducted the benchmark 30 times, and the sample standard
deviation is 320.30. They ranged as follow.
Min 1732 MiB/s - Max 2565 MiB/s
Mean: 2212.3 MiB/s Sample stddev 320.30
Comparing to that, here is the result on the baseline.
Min 1113 MiB/s - Max 1498 MiB/s
Mean: 1231.85 Sample stddev 104.31
For a reference, a result on reverted case is as follow:
Min 2211 MiB/s - Max 2506 MiB/s
Mean 2372.23 MiB/s Sample stddev 82.49
So, the patched one is indeed better than the baseline. Even the worst case
on patched version is better than the best on baseline. And, as you
mentioned. patched version has far larger variance than baseline and
reverted one.
>
> > FYI, without the kernel command-line (i.e, numa=on and all RAM available as
> > usual), as shown below, your patch series (v1) improved the performance
> > significantly. It is even better than the reverted case.
> >
> > - misc-next, numa=on
> > WRITE: bw=1121MiB/s (1175MB/s), 1121MiB/s-1121MiB/s (1175MB/s-1175MB/s), io=332GiB (356GB), run=303030-303030msec
> > - misc-next+wq patches, numa=on
> > WRITE: bw=2185MiB/s (2291MB/s), 2185MiB/s-2185MiB/s (2291MB/s-2291MB/s), io=667GiB (717GB), run=312806-312806msec
> > - misc-next+wq reverted, numa=on
> > WRITE: bw=1557MiB/s (1633MB/s), 1557MiB/s-1557MiB/s (1633MB/s-1633MB/s), io=659GiB (708GB), run=433426-433426msec
>
> That looks pretty good, right?
Yes, it is so good. Since the numa=off case is quite unusual and it has a
large variance, I believe this patch series is a good improvement.
>
> Thanks.
>
> --
> tejun
Hello,
On Tue, Jan 30, 2024 at 02:24:47AM +0000, Naohiro Aota wrote:
> > If so, I'm not sure how meaningful the result is. e.g. The perf would depend
> > heavily on random factors like which threads end up on which node and so on.
> > Sure, if we're slow because we're creating huge number of concurrent
> > workers, that's still a problem but comparing relatively small perf delta
> > might not be all that meaningful. How much is the result variance in that
> > setup?
>
> Yeah, that is true. I conducted the benchmark 30 times, and the sample standard
> deviation is 320.30. They ranged as follow.
> Min 1732 MiB/s - Max 2565 MiB/s
> Mean: 2212.3 MiB/s Sample stddev 320.30
>
> Comparing to that, here is the result on the baseline.
> Min 1113 MiB/s - Max 1498 MiB/s
> Mean: 1231.85 Sample stddev 104.31
>
> For a reference, a result on reverted case is as follow:
> Min 2211 MiB/s - Max 2506 MiB/s
> Mean 2372.23 MiB/s Sample stddev 82.49
>
> So, the patched one is indeed better than the baseline. Even the worst case
> on patched version is better than the best on baseline. And, as you
> mentioned. patched version has far larger variance than baseline and
> reverted one.
Yeah, the average being similar while the variance being way larger makes
sense. Before the revert, it's spraying things across the machine. After,
per run, the execution is more sticky, so you basically end up amplifying
the varince.
> > > FYI, without the kernel command-line (i.e, numa=on and all RAM available as
> > > usual), as shown below, your patch series (v1) improved the performance
> > > significantly. It is even better than the reverted case.
> > >
> > > - misc-next, numa=on
> > > WRITE: bw=1121MiB/s (1175MB/s), 1121MiB/s-1121MiB/s (1175MB/s-1175MB/s), io=332GiB (356GB), run=303030-303030msec
> > > - misc-next+wq patches, numa=on
> > > WRITE: bw=2185MiB/s (2291MB/s), 2185MiB/s-2185MiB/s (2291MB/s-2291MB/s), io=667GiB (717GB), run=312806-312806msec
> > > - misc-next+wq reverted, numa=on
> > > WRITE: bw=1557MiB/s (1633MB/s), 1557MiB/s-1557MiB/s (1633MB/s-1633MB/s), io=659GiB (708GB), run=433426-433426msec
> >
> > That looks pretty good, right?
>
> Yes, it is so good. Since the numa=off case is quite unusual and it has a
> large variance, I believe this patch series is a good improvement.
Great to hear.
Thanks.
--
tejun