2013-03-20 00:00:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

Hello,

There are two types of workqueues - per-cpu and unbound. The former
is bound to each CPU and the latter isn't not bound to any by default.
While the recently added attrs support allows unbound workqueues to be
confined to subset of CPUs, it still is quite cumbersome for
applications where CPU affinity is too constricted but NUMA locality
still matters.

This patchset tries to solve that issue by automatically making
unbound workqueues affine to NUMA nodes by default. A work item
queued to an unbound workqueue is executed on one of the CPUs allowed
by the workqueue in the same node. If there's none allowed, it may be
executed on any cpu allowed by the workqueue. It doesn't require any
changes on the user side. Every interface of workqueues functions the
same as before.

This would be most helpful to subsystems which use some form of async
execution to process significant amount of data - e.g. crypto and
btrfs; however, I wanted to find out whether it would make any dent in
much less favorable use cases. The following is total run time in
seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
opteron machine with writeback thread pool converted to unbound
workqueue and thus made NUMA-affine. The file system is ext4 on top
of a WD SSD.

before conversion after conversion
1396.126 1394.763
1397.621 1394.965
1399.636 1394.738
1397.463 1398.162
1395.543 1393.670

AVG 1397.278 1395.260 DIFF 2.018
STDEV 1.585 1.700

And, yes, it actually made things go faster by about 1.2 sigma, which
isn't completely conclusive but is a pretty good indication that it's
actually faster. Note that this is a workload which is dominated by
CPU time and while there's writeback going on continously it really
isn't touching too much data or a dominating factor, so the gain is
understandably small, 0.14%, but hey it's still a gain and it should
be much more interesting for crypto and btrfs which would actully
access the data or workloads which are more sensitive to NUMA
affinity.

The implementation is fairly simple. After the recent attrs support
changes, a lot of the differences in pwq (pool_workqueue) handling
between unbound and per-cpu workqueues are gone. An unbound workqueue
still has one "current" pwq that it uses for queueing any new work
items but can handle multiple pwqs perfectly well while they're
draining, so this patchset adds pwq dispatch table to unbound
workqueues which is indexed by NUMA node and points to the matching
pwq. Unbound workqueues now simply have multiple "current" pwqs keyed
by NUMA node.

NUMA affinity can be turned off system-wide by workqueue.disable_numa
kernel param or per-workqueue using "numa" sysfs file.

This patchset contains the following ten patches.

0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
0004-workqueue-add-workqueue-unbound_attrs.patch
0005-workqueue-make-workqueue-name-fixed-len.patch
0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch

0001 adds basic NUMA topology knoweldge to workqueue.

0002-0006 are prep patches.

0007-0009 implement NUMA affinity.

0010 adds control knobs and updates sysfs interface.

This patchset is on top of

wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable iff CONFIG_FREEZER")

and also available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

diffstat follows.

Documentation/kernel-parameters.txt | 9
include/linux/workqueue.h | 5
kernel/workqueue.c | 393 ++++++++++++++++++++++++++++--------
3 files changed, 325 insertions(+), 82 deletions(-)

Thanks.

--
tejun


2013-03-20 00:00:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation. The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these. Future patches will make use of
them.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 775c2f4..9b096e3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,7 @@
#include <linux/jhash.h>
#include <linux/hashtable.h>
#include <linux/rculist.h>
+#include <linux/nodemask.h>

#include "workqueue_internal.h"

@@ -256,6 +257,11 @@ struct workqueue_struct {

static struct kmem_cache *pwq_cache;

+static int wq_numa_tbl_len; /* highest possible NUMA node id + 1 */
+static cpumask_var_t *wq_numa_possible_cpumask;
+ /* possible CPUs of each node, may
+ be NULL if init failed */
+
static DEFINE_MUTEX(wq_mutex); /* protects workqueues and pools */
static DEFINE_SPINLOCK(pwq_lock); /* protects pool_workqueues */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
@@ -4416,7 +4422,7 @@ out_unlock:
static int __init init_workqueues(void)
{
int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
- int i, cpu;
+ int i, node, cpu;

/* make sure we have enough bits for OFFQ pool ID */
BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
@@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);

+ /* determine NUMA pwq table len - highest node id + 1 */
+ for_each_node(node)
+ wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
+
+ /*
+ * We want masks of possible CPUs of each node which isn't readily
+ * available. Build one from cpu_to_node() which should have been
+ * fully initialized by now.
+ */
+ wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
+ sizeof(wq_numa_possible_cpumask[0]),
+ GFP_KERNEL);
+ BUG_ON(!wq_numa_possible_cpumask);
+
+ for_each_node(node)
+ BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
+ GFP_KERNEL, node));
+ for_each_possible_cpu(cpu) {
+ node = cpu_to_node(cpu);
+ if (WARN_ON(node == NUMA_NO_NODE)) {
+ pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+ wq_numa_possible_cpumask = NULL;
+ break;
+ }
+ cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+ }
+
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
--
1.8.1.4

2013-03-20 00:00:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/10] workqueue: drop 'H' from kworker names of unbound worker pools

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names. This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational. TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs. We're expecting to have more unbound pools with the
scheduled NUMA awareness support. Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9b096e3..c1a931c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1648,9 +1648,10 @@ static struct worker *alloc_worker(void)
*/
static struct worker *create_worker(struct worker_pool *pool)
{
- const char *pri = pool->attrs->nice < 0 ? "H" : "";
struct worker *worker = NULL;
+ int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
+ char id_buf[16];

lockdep_assert_held(&pool->manager_mutex);

@@ -1676,13 +1677,13 @@ static struct worker *create_worker(struct worker_pool *pool)
worker->id = id;

if (pool->cpu >= 0)
- worker->task = kthread_create_on_node(worker_thread,
- worker, cpu_to_node(pool->cpu),
- "kworker/%d:%d%s", pool->cpu, id, pri);
+ snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+ pool->attrs->nice < 0 ? "H" : "");
else
- worker->task = kthread_create(worker_thread, worker,
- "kworker/u%d:%d%s",
- pool->id, id, pri);
+ snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+ worker->task = kthread_create_on_node(worker_thread, worker, node,
+ "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;

--
1.8.1.4

2013-03-20 00:00:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/10] workqueue: determine NUMA node of workers accourding to the allowed cpumask

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal. Add pool->node which is
determined by the pool's cpumask. If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c1a931c..2768ed2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -141,6 +141,7 @@ enum {
struct worker_pool {
spinlock_t lock; /* the pool lock */
int cpu; /* I: the associated cpu */
+ int node; /* I: the associated node ID */
int id; /* I: pool ID */
unsigned int flags; /* X: flags */

@@ -1649,7 +1650,6 @@ static struct worker *alloc_worker(void)
static struct worker *create_worker(struct worker_pool *pool)
{
struct worker *worker = NULL;
- int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
char id_buf[16];

@@ -1682,7 +1682,7 @@ static struct worker *create_worker(struct worker_pool *pool)
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);

- worker->task = kthread_create_on_node(worker_thread, worker, node,
+ worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
"kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
@@ -3390,6 +3390,7 @@ static int init_worker_pool(struct worker_pool *pool)
spin_lock_init(&pool->lock);
pool->id = -1;
pool->cpu = -1;
+ pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
@@ -3496,6 +3497,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
{
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
+ int node;

mutex_lock(&wq_mutex);

@@ -3515,6 +3517,17 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
lockdep_set_subclass(&pool->lock, 1); /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);

+ /* if cpumask is contained inside a NUMA node, we belong to that node */
+ if (wq_numa_possible_cpumask) {
+ for_each_node(node) {
+ if (cpumask_subset(pool->attrs->cpumask,
+ wq_numa_possible_cpumask[node])) {
+ pool->node = node;
+ break;
+ }
+ }
+ }
+
if (worker_pool_assign_id(pool) < 0)
goto fail;

@@ -4473,6 +4486,7 @@ static int __init init_workqueues(void)
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];
+ pool->node = cpu_to_node(cpu);

/* alloc pool ID */
mutex_lock(&wq_mutex);
--
1.8.1.4

2013-03-20 00:00:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/10] workqueue: move hot fields of workqueue_struct to the end

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline. These two fields are used in the work item
issue path and thus hot. The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues. The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 151ce49..25dab9d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -230,8 +230,6 @@ struct wq_device;
* the appropriate worker_pool through its pool_workqueues.
*/
struct workqueue_struct {
- unsigned int flags; /* WQ: WQ_* flags */
- struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
struct list_head pwqs; /* FR: all pwqs of this wq */
struct list_head list; /* WQ: list of all workqueues */

@@ -258,6 +256,10 @@ struct workqueue_struct {
struct lockdep_map lockdep_map;
#endif
char name[WQ_NAME_LEN]; /* I: workqueue name */
+
+ /* hot fields used during command issue, aligned to cacheline */
+ unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
+ struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
};

static struct kmem_cache *pwq_cache;
--
1.8.1.4

2013-03-20 00:01:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/10] workqueue: map an unbound workqueues to multiple per-node pool_workqueues

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it. It may have multple pool_workqueues but only the
first pool_workqueue servies new work items. For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node. This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 48 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 25dab9d..3f820a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -260,6 +260,7 @@ struct workqueue_struct {
/* hot fields used during command issue, aligned to cacheline */
unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+ struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
};

static struct kmem_cache *pwq_cache;
@@ -529,6 +530,22 @@ static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
pwqs_node);
}

+/**
+ * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
+ * @wq: the target workqueue
+ * @node: the node ID
+ *
+ * This must be called either with pwq_lock held or sched RCU read locked.
+ * If the pwq needs to be used beyond the locking in effect, the caller is
+ * responsible for guaranteeing that the pwq stays online.
+ */
+static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
+ int node)
+{
+ assert_rcu_or_pwq_lock();
+ return rcu_dereference_sched(wq->numa_pwq_tbl[node]);
+}
+
static unsigned int work_color_to_flags(int color)
{
return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1282,14 +1299,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
WARN_ON_ONCE(!is_chained_work(wq)))
return;
retry:
+ if (req_cpu == WORK_CPU_UNBOUND)
+ cpu = raw_smp_processor_id();
+
/* pwq which will be used unless @work is executing elsewhere */
- if (!(wq->flags & WQ_UNBOUND)) {
- if (cpu == WORK_CPU_UNBOUND)
- cpu = raw_smp_processor_id();
+ if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
- } else {
- pwq = first_pwq(wq);
- }
+ else
+ pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

/*
* If @work was previously on a different pool, it might still be
@@ -1319,8 +1336,8 @@ retry:
* pwq is determined and locked. For unbound pools, we could have
* raced with pwq release and it could already be dead. If its
* refcnt is zero, repeat pwq selection. Note that pwqs never die
- * without another pwq replacing it as the first pwq or while a
- * work item is executing on it, so the retying is guaranteed to
+ * without another pwq replacing it in the numa_pwq_tbl or while
+ * work items are executing on it, so the retrying is guaranteed to
* make forward-progress.
*/
if (unlikely(!pwq->refcnt)) {
@@ -3635,6 +3652,8 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
struct worker_pool *pool,
struct pool_workqueue **p_last_pwq)
{
+ int node;
+
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

pwq->pool = pool;
@@ -3662,8 +3681,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);

- if (wq->flags & WQ_UNBOUND)
+ if (wq->flags & WQ_UNBOUND) {
copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+ for_each_node(node)
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ }

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);
@@ -3759,12 +3781,16 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
struct lock_class_key *key,
const char *lock_name, ...)
{
+ size_t tbl_size = 0;
va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;

/* allocate wq and format name */
- wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+ if (flags & WQ_UNBOUND)
+ tbl_size = wq_numa_tbl_len * sizeof(wq->numa_pwq_tbl[0]);
+
+ wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
if (!wq)
return NULL;

@@ -3991,7 +4017,7 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
else
- pwq = first_pwq(wq);
+ pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

ret = !list_empty(&pwq->delayed_works);
preempt_enable();
--
1.8.1.4

2013-03-20 00:01:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/10] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity

Unbound workqueues are now NUMA aware. Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
node:pool_id pairs. This change is userland-visible but "pool_id"
hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
individual workqueues. This is implemented as attrs->no_numa whichn
is special in that it isn't part of a pool's attributes. It only
affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++
include/linux/workqueue.h | 5 ++
kernel/workqueue.c | 125 +++++++++++++++++++++++++-----------
3 files changed, 102 insertions(+), 37 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..c75ea0b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3222,6 +3222,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
or other driver-specific files in the
Documentation/watchdog/ directory.

+ workqueue.disable_numa
+ By default, all work items queued to unbound
+ workqueues are affine to the NUMA nodes they're
+ issued on, which results in better behavior in
+ general. If NUMA affinity needs to be disabled for
+ whatever reason, this option can be used. Note
+ that this also can be controlled per-workqueue for
+ workqueues visible under /sys/bus/workqueue/.
+
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 835d12b..7179756 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,10 +119,15 @@ struct delayed_work {
/*
* A struct for workqueue attributes. This can be used to change
* attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool. It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
*/
struct workqueue_attrs {
int nice; /* nice level */
cpumask_var_t cpumask; /* allowed CPUs */
+ bool no_numa; /* disable NUMA affinity */
};

static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0c36327..b48373a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
#include <linux/hashtable.h>
#include <linux/rculist.h>
#include <linux/nodemask.h>
+#include <linux/moduleparam.h>

#include "workqueue_internal.h"

@@ -302,6 +303,9 @@ EXPORT_SYMBOL_GPL(system_unbound_wq);
struct workqueue_struct *system_freezable_wq __read_mostly;
EXPORT_SYMBOL_GPL(system_freezable_wq);

+static bool wq_disable_numa;
+module_param_named(disable_numa, wq_disable_numa, bool, 0444);
+
static int worker_thread(void *__worker);
static void copy_workqueue_attrs(struct workqueue_attrs *to,
const struct workqueue_attrs *from);
@@ -516,21 +520,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
}

/**
- * first_pwq - return the first pool_workqueue of the specified workqueue
- * @wq: the target workqueue
- *
- * This must be called either with pwq_lock held or sched RCU read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- */
-static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
-{
- assert_rcu_or_pwq_lock();
- return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
-}
-
-/**
* unbound_pwq_by_node - return the unbound pool_workqueue for the given node
* @wq: the target workqueue
* @node: the node ID
@@ -3101,16 +3090,21 @@ static struct device_attribute wq_sysfs_attrs[] = {
__ATTR_NULL,
};

-static ssize_t wq_pool_id_show(struct device *dev,
- struct device_attribute *attr, char *buf)
+static ssize_t wq_pool_ids_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
{
struct workqueue_struct *wq = dev_to_wq(dev);
- struct worker_pool *pool;
- int written;
+ const char *delim = "";
+ int node, written = 0;

rcu_read_lock_sched();
- pool = first_pwq(wq)->pool;
- written = scnprintf(buf, PAGE_SIZE, "%d\n", pool->id);
+ for_each_node(node) {
+ written += scnprintf(buf + written, PAGE_SIZE - written,
+ "%s%d:%d", delim, node,
+ unbound_pwq_by_node(wq, node)->pool->id);
+ delim = " ";
+ }
+ written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
rcu_read_unlock_sched();

return written;
@@ -3199,10 +3193,52 @@ static ssize_t wq_cpumask_store(struct device *dev,
return ret ?: count;
}

+static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int written;
+
+ mutex_lock(&wq_mutex);
+ written = scnprintf(buf, PAGE_SIZE, "%d\n",
+ !wq->unbound_attrs->no_numa &&
+ wq_numa_possible_cpumask);
+ mutex_unlock(&wq_mutex);
+
+ return written;
+}
+
+static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct workqueue_attrs *attrs;
+ int v, ret;
+
+ attrs = wq_sysfs_prep_attrs(wq);
+ if (!attrs)
+ return -ENOMEM;
+
+ ret = -EINVAL;
+ if (sscanf(buf, "%d", &v) == 1) {
+ if (!v || wq_numa_possible_cpumask) {
+ attrs->no_numa = !v;
+ ret = apply_workqueue_attrs(wq, attrs);
+ } else {
+ printk_ratelimited(KERN_WARNING "workqueue: can't enable NUMA affinity for \"%s\", disabled system-wide\n",
+ wq->name);
+ }
+ }
+
+ free_workqueue_attrs(attrs);
+ return ret ?: count;
+}
+
static struct device_attribute wq_sysfs_unbound_attrs[] = {
- __ATTR(pool_id, 0444, wq_pool_id_show, NULL),
+ __ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+ __ATTR(numa, 0644, wq_numa_show, wq_numa_store),
__ATTR_NULL,
};

@@ -3725,6 +3761,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
{
struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
struct workqueue_attrs *tmp_attrs = NULL;
+ bool do_numa = !attrs->no_numa && wq_numa_possible_cpumask;
int node;

/* only unbound workqueues can change attributes */
@@ -3740,7 +3777,15 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (!pwq_tbl || !tmp_attrs)
goto enomem;

+ /*
+ * We'll be creating multiple pwqs with differing cpumasks. Make a
+ * copy of @attrs which will be modified and used to obtain pools.
+ * no_numa attribute is special in that it isn't a part of pool
+ * attributes but modifies how pools are selected in this function.
+ * Let's not leak no_numa to pool handling functions.
+ */
copy_workqueue_attrs(tmp_attrs, attrs);
+ tmp_attrs->no_numa = false;

/*
* We want NUMA affinity. For each node with intersecting possible
@@ -3755,7 +3800,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
* Just fall through if NUMA affinity isn't enabled. We'll
* end up using the default pwq which is what we want.
*/
- if (wq_numa_possible_cpumask) {
+ if (do_numa) {
cpumask_and(cpumask, wq_numa_possible_cpumask[node],
attrs->cpumask);
if (cpumask_empty(cpumask))
@@ -4588,22 +4633,28 @@ static int __init init_workqueues(void)
* available. Build one from cpu_to_node() which should have been
* fully initialized by now.
*/
- wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
- sizeof(wq_numa_possible_cpumask[0]),
- GFP_KERNEL);
- BUG_ON(!wq_numa_possible_cpumask);
+ if (!wq_disable_numa) {
+ static cpumask_var_t *tbl;

- for_each_node(node)
- BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
- GFP_KERNEL, node));
- for_each_possible_cpu(cpu) {
- node = cpu_to_node(cpu);
- if (WARN_ON(node == NUMA_NO_NODE)) {
- pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
- wq_numa_possible_cpumask = NULL;
- break;
+ tbl = kzalloc(wq_numa_tbl_len * sizeof(tbl[0]), GFP_KERNEL);
+ BUG_ON(!tbl);
+
+ for_each_node(node)
+ BUG_ON(!alloc_cpumask_var_node(&tbl[node], GFP_KERNEL,
+ node));
+ for_each_possible_cpu(cpu) {
+ node = cpu_to_node(cpu);
+ if (WARN_ON(node == NUMA_NO_NODE)) {
+ pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+ tbl = NULL;
+ break;
+ }
+ cpumask_set_cpu(cpu, tbl[node]);
}
- cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+
+ wq_numa_possible_cpumask = tbl;
+ } else {
+ pr_info("workqueue: NUMA affinity support disabled\n");
}

/* initialize CPU pools */
--
1.8.1.4

2013-03-20 00:01:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3f820a5..bbbfc92 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
spin_unlock(&pwq->pool->lock);
}

-static void init_and_link_pwq(struct pool_workqueue *pwq,
- struct workqueue_struct *wq,
- struct worker_pool *pool,
- struct pool_workqueue **p_last_pwq)
+/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
+ struct worker_pool *pool)
{
- int node;
-
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

pwq->pool = pool;
@@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+}

- mutex_lock(&wq->flush_mutex);
- spin_lock_irq(&pwq_lock);
+/* sync @pwq with the current state of its associated wq and link it */
+static void link_pwq(struct pool_workqueue *pwq,
+ struct pool_workqueue **p_last_pwq)
+{
+ struct workqueue_struct *wq = pwq->wq;
+
+ lockdep_assert_held(&wq->flush_mutex);
+ lockdep_assert_held(&pwq_lock);

/*
* Set the matching work_color. This is synchronized with
@@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,

/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
+}

- if (wq->flags & WQ_UNBOUND) {
- copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
- for_each_node(node)
- rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
+static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs)
+{
+ struct worker_pool *pool;
+ struct pool_workqueue *pwq;
+
+ pool = get_unbound_pool(attrs);
+ if (!pool)
+ return NULL;
+
+ pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ if (!pwq) {
+ put_unbound_pool(pool);
+ return NULL;
}

- spin_unlock_irq(&pwq_lock);
- mutex_unlock(&wq->flush_mutex);
+ init_pwq(pwq, wq, pool);
+ return pwq;
}

/**
@@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
struct pool_workqueue *pwq, *last_pwq;
- struct worker_pool *pool;
+ int node;

/* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ pwq = alloc_unbound_pwq(wq, attrs);
if (!pwq)
return -ENOMEM;

- pool = get_unbound_pool(attrs);
- if (!pool) {
- kmem_cache_free(pwq_cache, pwq);
- return -ENOMEM;
- }
+ mutex_lock(&wq->flush_mutex);
+ spin_lock_irq(&pwq_lock);
+
+ link_pwq(pwq, &last_pwq);
+
+ copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
+ for_each_node(node)
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+
+ spin_unlock_irq(&pwq_lock);
+ mutex_unlock(&wq->flush_mutex);

- init_and_link_pwq(pwq, wq, pool, &last_pwq);
if (last_pwq) {
spin_lock_irq(&last_pwq->pool->lock);
put_pwq(last_pwq);
@@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
struct worker_pool *cpu_pools =
per_cpu(cpu_worker_pools, cpu);

- init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
+ init_pwq(pwq, wq, &cpu_pools[highpri]);
+
+ mutex_lock(&wq->flush_mutex);
+ spin_lock_irq(&pwq_lock);
+
+ link_pwq(pwq, NULL);
+
+ spin_unlock_irq(&pwq_lock);
+ mutex_unlock(&wq->flush_mutex);
}
return 0;
} else {
--
1.8.1.4

2013-03-20 00:01:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 86 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bbbfc92..0c36327 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+ INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
}

/* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
- struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
{
struct workqueue_struct *wq = pwq->wq;

@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
* Set the matching work_color. This is synchronized with
* flush_mutex to avoid confusing flush_workqueue().
*/
- if (p_last_pwq)
- *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;

/* sync max_active to the current setting */
@@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
* @wq: the target workqueue
* @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
*
- * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items. Older pwqs are released as in-flight
- * work items finish. Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
*
* Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
* failure.
@@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
- struct pool_workqueue *pwq, *last_pwq;
+ struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+ struct workqueue_attrs *tmp_attrs = NULL;
int node;

/* only unbound workqueues can change attributes */
@@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- pwq = alloc_unbound_pwq(wq, attrs);
- if (!pwq)
- return -ENOMEM;
+ pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+ tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pwq_tbl || !tmp_attrs)
+ goto enomem;
+
+ copy_workqueue_attrs(tmp_attrs, attrs);
+
+ /*
+ * We want NUMA affinity. For each node with intersecting possible
+ * CPUs with the requested cpumask, create a separate pwq covering
+ * the instersection. Nodes without intersection are covered by
+ * the default pwq covering the whole requested cpumask.
+ */
+ for_each_node(node) {
+ cpumask_t *cpumask = tmp_attrs->cpumask;
+
+ /*
+ * Just fall through if NUMA affinity isn't enabled. We'll
+ * end up using the default pwq which is what we want.
+ */
+ if (wq_numa_possible_cpumask) {
+ cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+ attrs->cpumask);
+ if (cpumask_empty(cpumask))
+ cpumask_copy(cpumask, attrs->cpumask);
+ }
+
+ if (cpumask_equal(cpumask, attrs->cpumask)) {
+ if (!dfl_pwq) {
+ dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!dfl_pwq)
+ goto enomem;
+ } else {
+ dfl_pwq->refcnt++;
+ }
+ pwq_tbl[node] = dfl_pwq;
+ } else {
+ pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq_tbl[node])
+ goto enomem;
+ }
+ }

+ /* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&pwq_lock);

- link_pwq(pwq, &last_pwq);
+ /* @attrs is now current */
+ copy_workqueue_attrs(wq->unbound_attrs, attrs);

- copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
- for_each_node(node)
- rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ for_each_node(node) {
+ struct pool_workqueue *pwq;
+
+ /* each new pwq should be linked once */
+ if (list_empty(&pwq_tbl[node]->pwqs_node))
+ link_pwq(pwq_tbl[node]);
+
+ /* save the previous pwq and install the new one */
+ pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+ pwq_tbl[node] = pwq;
+ }

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);

- if (last_pwq) {
- spin_lock_irq(&last_pwq->pool->lock);
- put_pwq(last_pwq);
- spin_unlock_irq(&last_pwq->pool->lock);
+ /* put the old pwqs */
+ for_each_node(node) {
+ struct pool_workqueue *pwq = pwq_tbl[node];
+
+ if (pwq) {
+ spin_lock_irq(&pwq->pool->lock);
+ put_pwq(pwq);
+ spin_unlock_irq(&pwq->pool->lock);
+ }
}

return 0;
+
+enomem:
+ free_workqueue_attrs(tmp_attrs);
+ if (pwq_tbl) {
+ for_each_node(node)
+ kfree(pwq_tbl[node]);
+ kfree(pwq_tbl);
+ }
+ return -ENOMEM;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3781,7 +3845,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&pwq_lock);

- link_pwq(pwq, NULL);
+ link_pwq(pwq);

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);
--
1.8.1.4

2013-03-20 00:00:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/10] workqueue: add workqueue->unbound_attrs

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold. Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2768ed2..57e6139 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -247,6 +247,8 @@ struct workqueue_struct {
int nr_drainers; /* WQ: drain in progress */
int saved_max_active; /* PW: saved pwq max_active */

+ struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
+
#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
#endif
@@ -3099,10 +3101,9 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;

- rcu_read_lock_sched();
- written = scnprintf(buf, PAGE_SIZE, "%d\n",
- first_pwq(wq)->pool->attrs->nice);
- rcu_read_unlock_sched();
+ mutex_lock(&wq_mutex);
+ written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+ mutex_unlock(&wq_mutex);

return written;
}
@@ -3116,9 +3117,9 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
if (!attrs)
return NULL;

- rcu_read_lock_sched();
- copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
- rcu_read_unlock_sched();
+ mutex_lock(&wq_mutex);
+ copy_workqueue_attrs(attrs, wq->unbound_attrs);
+ mutex_unlock(&wq_mutex);
return attrs;
}

@@ -3149,10 +3150,9 @@ static ssize_t wq_cpumask_show(struct device *dev,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;

- rcu_read_lock_sched();
- written = cpumask_scnprintf(buf, PAGE_SIZE,
- first_pwq(wq)->pool->attrs->cpumask);
- rcu_read_unlock_sched();
+ mutex_lock(&wq_mutex);
+ written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
+ mutex_unlock(&wq_mutex);

written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
return written;
@@ -3585,8 +3585,10 @@ static void pwq_unbound_release_workfn(struct work_struct *work)
* If we're the last pwq going away, @wq is already dead and no one
* is gonna access it anymore. Free it.
*/
- if (list_empty(&wq->pwqs))
+ if (list_empty(&wq->pwqs)) {
+ free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
+ }
}

/**
@@ -3656,6 +3658,9 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);

+ if (wq->flags & WQ_UNBOUND)
+ copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+
spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);
}
@@ -3764,6 +3769,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
if (!wq)
return NULL;

+ if (flags & WQ_UNBOUND) {
+ wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!wq->unbound_attrs)
+ goto err_free_wq;
+ }
+
vsnprintf(wq->name, namelen, fmt, args1);
va_end(args);
va_end(args1);
@@ -3832,6 +3843,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
return wq;

err_free_wq:
+ free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
return NULL;
err_destroy:
--
1.8.1.4

2013-03-20 00:02:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/10] workqueue: make workqueue->name[] fixed len

Currently workqueue->name[] is of flexible length. We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway. Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 57e6139..151ce49 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -101,6 +101,8 @@ enum {
*/
RESCUER_NICE_LEVEL = -20,
HIGHPRI_NICE_LEVEL = -20,
+
+ WQ_NAME_LEN = 24,
};

/*
@@ -255,7 +257,7 @@ struct workqueue_struct {
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
- char name[]; /* I: workqueue name */
+ char name[WQ_NAME_LEN]; /* I: workqueue name */
};

static struct kmem_cache *pwq_cache;
@@ -3755,17 +3757,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
struct lock_class_key *key,
const char *lock_name, ...)
{
- va_list args, args1;
+ va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
- size_t namelen;
-
- /* determine namelen, allocate wq and format name */
- va_start(args, lock_name);
- va_copy(args1, args);
- namelen = vsnprintf(NULL, 0, fmt, args) + 1;

- wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
+ /* allocate wq and format name */
+ wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
return NULL;

@@ -3775,9 +3772,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
goto err_free_wq;
}

- vsnprintf(wq->name, namelen, fmt, args1);
+ va_start(args, lock_name);
+ vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
- va_end(args1);

max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);
--
1.8.1.4

2013-03-20 12:14:10

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

(off-topic)

Hi, tj,

I think a0265a7f5161b6cb55e82b71edb236bbe0d9b3ae(tj/for-3.10) is
wrong direction, if workqueue_freezing is used only in
freeze_workqueues_begin()/thaw_workqueues(), which means it can be
removed or it is bug which is needed to be fixed. BUT
a0265a7f5161b6cb55e82b71edb236bbe0d9b3ae did not remove it nor fixed
it, it is unacceptable to me. actually it is bug, and I fixed it in my
patch1. any thought?

Thanks,
Lai

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> There are two types of workqueues - per-cpu and unbound. The former
> is bound to each CPU and the latter isn't not bound to any by default.
> While the recently added attrs support allows unbound workqueues to be
> confined to subset of CPUs, it still is quite cumbersome for
> applications where CPU affinity is too constricted but NUMA locality
> still matters.
>
> This patchset tries to solve that issue by automatically making
> unbound workqueues affine to NUMA nodes by default. A work item
> queued to an unbound workqueue is executed on one of the CPUs allowed
> by the workqueue in the same node. If there's none allowed, it may be
> executed on any cpu allowed by the workqueue. It doesn't require any
> changes on the user side. Every interface of workqueues functions the
> same as before.
>
> This would be most helpful to subsystems which use some form of async
> execution to process significant amount of data - e.g. crypto and
> btrfs; however, I wanted to find out whether it would make any dent in
> much less favorable use cases. The following is total run time in
> seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
> opteron machine with writeback thread pool converted to unbound
> workqueue and thus made NUMA-affine. The file system is ext4 on top
> of a WD SSD.
>
> before conversion after conversion
> 1396.126 1394.763
> 1397.621 1394.965
> 1399.636 1394.738
> 1397.463 1398.162
> 1395.543 1393.670
>
> AVG 1397.278 1395.260 DIFF 2.018
> STDEV 1.585 1.700
>
> And, yes, it actually made things go faster by about 1.2 sigma, which
> isn't completely conclusive but is a pretty good indication that it's
> actually faster. Note that this is a workload which is dominated by
> CPU time and while there's writeback going on continously it really
> isn't touching too much data or a dominating factor, so the gain is
> understandably small, 0.14%, but hey it's still a gain and it should
> be much more interesting for crypto and btrfs which would actully
> access the data or workloads which are more sensitive to NUMA
> affinity.
>
> The implementation is fairly simple. After the recent attrs support
> changes, a lot of the differences in pwq (pool_workqueue) handling
> between unbound and per-cpu workqueues are gone. An unbound workqueue
> still has one "current" pwq that it uses for queueing any new work
> items but can handle multiple pwqs perfectly well while they're
> draining, so this patchset adds pwq dispatch table to unbound
> workqueues which is indexed by NUMA node and points to the matching
> pwq. Unbound workqueues now simply have multiple "current" pwqs keyed
> by NUMA node.
>
> NUMA affinity can be turned off system-wide by workqueue.disable_numa
> kernel param or per-workqueue using "numa" sysfs file.
>
> This patchset contains the following ten patches.
>
> 0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
> 0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
> 0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
> 0004-workqueue-add-workqueue-unbound_attrs.patch
> 0005-workqueue-make-workqueue-name-fixed-len.patch
> 0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
> 0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
> 0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
> 0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
> 0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch
>
> 0001 adds basic NUMA topology knoweldge to workqueue.
>
> 0002-0006 are prep patches.
>
> 0007-0009 implement NUMA affinity.
>
> 0010 adds control knobs and updates sysfs interface.
>
> This patchset is on top of
>
> wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable iff CONFIG_FREEZER")
>
> and also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> diffstat follows.
>
> Documentation/kernel-parameters.txt | 9
> include/linux/workqueue.h | 5
> kernel/workqueue.c | 393 ++++++++++++++++++++++++++++--------
> 3 files changed, 325 insertions(+), 82 deletions(-)
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-20 14:08:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

2013/3/20 Tejun Heo <[email protected]>:
> Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation. The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.

It is better to move this code to topology.c or cpumask.c,
then it can be generally used.

Thanks.

> This patch only introduces these. Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
> #include <linux/jhash.h>
> #include <linux/hashtable.h>
> #include <linux/rculist.h>
> +#include <linux/nodemask.h>
>
> #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
> static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len; /* highest possible NUMA node id + 1 */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> + /* possible CPUs of each node, may
> + be NULL if init failed */
> +
> static DEFINE_MUTEX(wq_mutex); /* protects workqueues and pools */
> static DEFINE_SPINLOCK(pwq_lock); /* protects pool_workqueues */
> static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
> @@ -4416,7 +4422,7 @@ out_unlock:
> static int __init init_workqueues(void)
> {
> int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> - int i, cpu;
> + int i, node, cpu;
>
> /* make sure we have enough bits for OFFQ pool ID */
> BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
> cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
> hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> + /* determine NUMA pwq table len - highest node id + 1 */
> + for_each_node(node)
> + wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> + /*
> + * We want masks of possible CPUs of each node which isn't readily
> + * available. Build one from cpu_to_node() which should have been
> + * fully initialized by now.
> + */
> + wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> + sizeof(wq_numa_possible_cpumask[0]),
> + GFP_KERNEL);
> + BUG_ON(!wq_numa_possible_cpumask);
> +
> + for_each_node(node)
> + BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> + GFP_KERNEL, node));
> + for_each_possible_cpu(cpu) {
> + node = cpu_to_node(cpu);
> + if (WARN_ON(node == NUMA_NO_NODE)) {
> + pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> + wq_numa_possible_cpumask = NULL;
> + break;
> + }
> + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> + }
> +
> /* initialize CPU pools */
> for_each_possible_cpu(cpu) {
> struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-20 14:49:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

On Wed, Mar 20, 2013 at 11:08:29PM +0900, JoonSoo Kim wrote:
> 2013/3/20 Tejun Heo <[email protected]>:
> > Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
> > and wq_numa_possible_cpumask[] in preparation. The former is the
> > highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> > each NUMA node.
>
> It is better to move this code to topology.c or cpumask.c,
> then it can be generally used.

Yeah, it just isn't clear where it should go. Most NUMA
initialization happens during early boot and arch-specific and
different archs expect NUMA information to be valid at different
point. We can do it from one of the early initcalls by which all NUMA
information should be valid but, I don't know, having different NUMA
information coming up at different times seems error-prone. There
would be no apparent indication that certain part is available while
others are not.

We could solve this by unifying how NUMA information is represented
and initialized. e.g. if all NUMA archs used
CONFIG_USE_PERCPU_NUMA_NODE_ID, we can simply modify
set_cpu_numa_node() to build all data structures as NUMA nodes are
discovered. Unfortunately, this isn't true yet and it's gonna be a
bit of work to get it in consistent state as it spans over multiple
architectures (not too many tho, NUMA fortunately is rare), so if
somebody can clean it up, I'll be happy to move these to topology.
Right now, I think it's best to just carry it in workqueue.c.

Thanks.

--
tejun

2013-03-20 15:03:55

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <[email protected]> wrote:
> Currently, an unbound workqueue has single current, or first, pwq
> (pool_workqueue) to which all new work items are queued. This often
> isn't optimal on NUMA machines as workers may jump around across node
> boundaries and work items get assigned to workers without any regard
> to NUMA affinity.
>
> This patch implements NUMA affinity for unbound workqueues. Instead
> of mapping all entries of numa_pwq_tbl[] to the same pwq,
> apply_workqueue_attrs() now creates a separate pwq covering the
> intersecting CPUs for each NUMA node which has possible CPUs in
> @attrs->cpumask. Nodes which don't have intersecting possible CPUs
> are mapped to pwqs covering whole @attrs->cpumask.
>
> This ensures that all work items issued on a NUMA node is executed on
> the same node as long as the workqueue allows execution on the CPUs of
> the node.
>
> As this maps a workqueue to multiple pwqs and max_active is per-pwq,
> this change the behavior of max_active. The limit is now per NUMA
> node instead of global. While this is an actual change, max_active is
> already per-cpu for per-cpu workqueues and primarily used as safety
> mechanism rather than for active concurrency control. Concurrency is
> usually limited from workqueue users by the number of concurrently
> active work items and this change shouldn't matter much.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 86 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index bbbfc92..0c36327 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
> pwq->flush_color = -1;
> pwq->refcnt = 1;
> INIT_LIST_HEAD(&pwq->delayed_works);
> + INIT_LIST_HEAD(&pwq->pwqs_node);
> INIT_LIST_HEAD(&pwq->mayday_node);
> INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
> }
>
> /* sync @pwq with the current state of its associated wq and link it */
> -static void link_pwq(struct pool_workqueue *pwq,
> - struct pool_workqueue **p_last_pwq)
> +static void link_pwq(struct pool_workqueue *pwq)
> {
> struct workqueue_struct *wq = pwq->wq;
>
> @@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
> * Set the matching work_color. This is synchronized with
> * flush_mutex to avoid confusing flush_workqueue().
> */
> - if (p_last_pwq)
> - *p_last_pwq = first_pwq(wq);
> pwq->work_color = wq->work_color;
>
> /* sync max_active to the current setting */
> @@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> * @wq: the target workqueue
> * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
> *
> - * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
> - * current attributes, a new pwq is created and made the first pwq which
> - * will serve all new work items. Older pwqs are released as in-flight
> - * work items finish. Note that a work item which repeatedly requeues
> - * itself back-to-back will stay on its current pwq.
> + * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
> + * machines, this function maps a separate pwq to each NUMA node with
> + * possibles CPUs in @attrs->cpumask so that work items are affine to the
> + * NUMA node it was issued on. Older pwqs are released as in-flight work
> + * items finish. Note that a work item which repeatedly requeues itself
> + * back-to-back will stay on its current pwq.
> *
> * Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
> * failure.
> @@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> int apply_workqueue_attrs(struct workqueue_struct *wq,
> const struct workqueue_attrs *attrs)
> {
> - struct pool_workqueue *pwq, *last_pwq;
> + struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
> + struct workqueue_attrs *tmp_attrs = NULL;
> int node;
>
> /* only unbound workqueues can change attributes */
> @@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
> return -EINVAL;
>
> - pwq = alloc_unbound_pwq(wq, attrs);
> - if (!pwq)
> - return -ENOMEM;
> + pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
> + tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
> + if (!pwq_tbl || !tmp_attrs)
> + goto enomem;
> +
> + copy_workqueue_attrs(tmp_attrs, attrs);
> +
> + /*
> + * We want NUMA affinity. For each node with intersecting possible
> + * CPUs with the requested cpumask, create a separate pwq covering
> + * the instersection. Nodes without intersection are covered by
> + * the default pwq covering the whole requested cpumask.
> + */
> + for_each_node(node) {
> + cpumask_t *cpumask = tmp_attrs->cpumask;
> +
> + /*
> + * Just fall through if NUMA affinity isn't enabled. We'll
> + * end up using the default pwq which is what we want.
> + */
> + if (wq_numa_possible_cpumask) {
> + cpumask_and(cpumask, wq_numa_possible_cpumask[node],
> + attrs->cpumask);
> + if (cpumask_empty(cpumask))
> + cpumask_copy(cpumask, attrs->cpumask);
> + }
> +
> + if (cpumask_equal(cpumask, attrs->cpumask)) {
> + if (!dfl_pwq) {
> + dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
> + if (!dfl_pwq)
> + goto enomem;
> + } else {
> + dfl_pwq->refcnt++;
> + }
> + pwq_tbl[node] = dfl_pwq;
> + } else {
> + pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
> + if (!pwq_tbl[node])
> + goto enomem;
> + }
> + }
>
> + /* all pwqs have been created successfully, let's install'em */
> mutex_lock(&wq->flush_mutex);
> spin_lock_irq(&pwq_lock);
>
> - link_pwq(pwq, &last_pwq);
> + /* @attrs is now current */
> + copy_workqueue_attrs(wq->unbound_attrs, attrs);
>
> - copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
> - for_each_node(node)
> - rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> + for_each_node(node) {
> + struct pool_workqueue *pwq;
> +
> + /* each new pwq should be linked once */
> + if (list_empty(&pwq_tbl[node]->pwqs_node))
> + link_pwq(pwq_tbl[node]);
> +
> + /* save the previous pwq and install the new one */
> + pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
> + rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
> + pwq_tbl[node] = pwq;
> + }
>
> spin_unlock_irq(&pwq_lock);
> mutex_unlock(&wq->flush_mutex);
>
> - if (last_pwq) {
> - spin_lock_irq(&last_pwq->pool->lock);
> - put_pwq(last_pwq);
> - spin_unlock_irq(&last_pwq->pool->lock);
> + /* put the old pwqs */
> + for_each_node(node) {
> + struct pool_workqueue *pwq = pwq_tbl[node];
> +
> + if (pwq) {
> + spin_lock_irq(&pwq->pool->lock);
> + put_pwq(pwq);
> + spin_unlock_irq(&pwq->pool->lock);
> + }
> }
>
> return 0;
> +
> +enomem:
> + free_workqueue_attrs(tmp_attrs);
> + if (pwq_tbl) {
> + for_each_node(node)
> + kfree(pwq_tbl[node]);

It will free the dfl_pwq multi times.

> + kfree(pwq_tbl);
> + }
> + return -ENOMEM;
> }
>
> static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> @@ -3781,7 +3845,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> mutex_lock(&wq->flush_mutex);
> spin_lock_irq(&pwq_lock);
>
> - link_pwq(pwq, NULL);
> + link_pwq(pwq);
>
> spin_unlock_irq(&pwq_lock);
> mutex_unlock(&wq->flush_mutex);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-20 15:06:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
> > +enomem:
> > + free_workqueue_attrs(tmp_attrs);
> > + if (pwq_tbl) {
> > + for_each_node(node)
> > + kfree(pwq_tbl[node]);
>
> It will free the dfl_pwq multi times.

Oops, you're right. We need to do


for_eahc_node(node)
if (pwq_tbl[node] != dfl_pwq)
kfree(pwq_tbl[node]);
kfree(dfl_pwq);

Thanks!

--
tejun

2013-03-20 15:26:14

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

On Wed, Mar 20, 2013 at 11:05 PM, Tejun Heo <[email protected]> wrote:
> On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
>> > +enomem:
>> > + free_workqueue_attrs(tmp_attrs);
>> > + if (pwq_tbl) {
>> > + for_each_node(node)
>> > + kfree(pwq_tbl[node]);
>>
>> It will free the dfl_pwq multi times.
>
> Oops, you're right. We need to do
>
>
> for_eahc_node(node)
> if (pwq_tbl[node] != dfl_pwq)
> kfree(pwq_tbl[node]);
> kfree(dfl_pwq);

I also missed.
we still need to put_unbound_pool() before free(pwq)

>
> Thanks!
>
> --
> tejun

2013-03-20 15:32:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

On Wed, Mar 20, 2013 at 8:26 AM, Lai Jiangshan <[email protected]> wrote:
>> for_eahc_node(node)
>> if (pwq_tbl[node] != dfl_pwq)
>> kfree(pwq_tbl[node]);
>> kfree(dfl_pwq);
>
> I also missed.
> we still need to put_unbound_pool() before free(pwq)

Yeap, we do. Wonder whether we can just use put_pwq() there. I'll see if I can.

--
tejun

2013-03-20 15:43:34

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <[email protected]> wrote:
> Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation. The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.
>
> This patch only introduces these. Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
> #include <linux/jhash.h>
> #include <linux/hashtable.h>
> #include <linux/rculist.h>
> +#include <linux/nodemask.h>
>
> #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
> static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len; /* highest possible NUMA node id + 1 */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> + /* possible CPUs of each node, may
> + be NULL if init failed */
> +
> static DEFINE_MUTEX(wq_mutex); /* protects workqueues and pools */
> static DEFINE_SPINLOCK(pwq_lock); /* protects pool_workqueues */
> static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
> @@ -4416,7 +4422,7 @@ out_unlock:
> static int __init init_workqueues(void)
> {
> int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> - int i, cpu;
> + int i, node, cpu;
>
> /* make sure we have enough bits for OFFQ pool ID */
> BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
> cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
> hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> + /* determine NUMA pwq table len - highest node id + 1 */
> + for_each_node(node)
> + wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> + /*
> + * We want masks of possible CPUs of each node which isn't readily
> + * available. Build one from cpu_to_node() which should have been
> + * fully initialized by now.
> + */
> + wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> + sizeof(wq_numa_possible_cpumask[0]),
> + GFP_KERNEL);
> + BUG_ON(!wq_numa_possible_cpumask);
> +
> + for_each_node(node)
> + BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> + GFP_KERNEL, node));
> + for_each_possible_cpu(cpu) {
> + node = cpu_to_node(cpu);
> + if (WARN_ON(node == NUMA_NO_NODE)) {
> + pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);

since you used WARN_ON(), not BUG_ON(), I think you need to free
allocated memory here.

> + wq_numa_possible_cpumask = NULL;
> + break;
> + }
> + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> + }
> +
> /* initialize CPU pools */
> for_each_possible_cpu(cpu) {
> struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-20 15:48:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
> > + for_each_node(node)
> > + BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> > + GFP_KERNEL, node));
> > + for_each_possible_cpu(cpu) {
> > + node = cpu_to_node(cpu);
> > + if (WARN_ON(node == NUMA_NO_NODE)) {
> > + pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
>
> since you used WARN_ON(), not BUG_ON(), I think you need to free
> allocated memory here.

I don't know. Is it necessary? It shouldn't happen sans bugs in arch
code and we're vomiting warning message with full stack trace. The
system will function but something is seriously screwed. I don't
think it matters whether we free the puny number of bytes there or
not and I don't wanna nest deeper there. :P

--
tejun

2013-03-20 15:52:07

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <[email protected]> wrote:
> Break init_and_link_pwq() into init_pwq() and link_pwq() and move
> unbound-workqueue specific handling into apply_workqueue_attrs().
> Also, factor out unbound pool and pool_workqueue allocation into
> alloc_unbound_pwq().
>
> This reorganization is to prepare for NUMA affinity and doesn't
> introduce any functional changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 52 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 3f820a5..bbbfc92 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
> spin_unlock(&pwq->pool->lock);
> }
>
> -static void init_and_link_pwq(struct pool_workqueue *pwq,
> - struct workqueue_struct *wq,
> - struct worker_pool *pool,
> - struct pool_workqueue **p_last_pwq)
> +/* initialize newly zalloced @pwq which is associated with @wq and @pool */
> +static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
> + struct worker_pool *pool)
> {
> - int node;
> -
> BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
>
> pwq->pool = pool;
> @@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
> INIT_LIST_HEAD(&pwq->delayed_works);
> INIT_LIST_HEAD(&pwq->mayday_node);
> INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
> +}
>
> - mutex_lock(&wq->flush_mutex);
> - spin_lock_irq(&pwq_lock);
> +/* sync @pwq with the current state of its associated wq and link it */
> +static void link_pwq(struct pool_workqueue *pwq,
> + struct pool_workqueue **p_last_pwq)
> +{
> + struct workqueue_struct *wq = pwq->wq;
> +
> + lockdep_assert_held(&wq->flush_mutex);
> + lockdep_assert_held(&pwq_lock);
>
> /*
> * Set the matching work_color. This is synchronized with
> @@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
>
> /* link in @pwq */
> list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
> +}
>
> - if (wq->flags & WQ_UNBOUND) {
> - copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
> - for_each_node(node)
> - rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
> +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> + const struct workqueue_attrs *attrs)
> +{
> + struct worker_pool *pool;
> + struct pool_workqueue *pwq;
> +
> + pool = get_unbound_pool(attrs);
> + if (!pool)
> + return NULL;
> +
> + pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);

this allocation is not numa-awared, you may use pool->node here.

> + if (!pwq) {
> + put_unbound_pool(pool);
> + return NULL;
> }
>
> - spin_unlock_irq(&pwq_lock);
> - mutex_unlock(&wq->flush_mutex);
> + init_pwq(pwq, wq, pool);
> + return pwq;
> }
>
> /**
> @@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> const struct workqueue_attrs *attrs)
> {
> struct pool_workqueue *pwq, *last_pwq;
> - struct worker_pool *pool;
> + int node;
>
> /* only unbound workqueues can change attributes */
> if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
> @@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
> return -EINVAL;
>
> - pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
> + pwq = alloc_unbound_pwq(wq, attrs);
> if (!pwq)
> return -ENOMEM;
>
> - pool = get_unbound_pool(attrs);
> - if (!pool) {
> - kmem_cache_free(pwq_cache, pwq);
> - return -ENOMEM;
> - }
> + mutex_lock(&wq->flush_mutex);
> + spin_lock_irq(&pwq_lock);
> +
> + link_pwq(pwq, &last_pwq);
> +
> + copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
> + for_each_node(node)
> + rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +
> + spin_unlock_irq(&pwq_lock);
> + mutex_unlock(&wq->flush_mutex);
>
> - init_and_link_pwq(pwq, wq, pool, &last_pwq);
> if (last_pwq) {
> spin_lock_irq(&last_pwq->pool->lock);
> put_pwq(last_pwq);
> @@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> struct worker_pool *cpu_pools =
> per_cpu(cpu_worker_pools, cpu);
>
> - init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
> + init_pwq(pwq, wq, &cpu_pools[highpri]);
> +
> + mutex_lock(&wq->flush_mutex);
> + spin_lock_irq(&pwq_lock);
> +
> + link_pwq(pwq, NULL);
> +
> + spin_unlock_irq(&pwq_lock);
> + mutex_unlock(&wq->flush_mutex);
> }
> return 0;
> } else {
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-20 16:04:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

On Wed, Mar 20, 2013 at 11:52:03PM +0800, Lai Jiangshan wrote:
> > +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> > + const struct workqueue_attrs *attrs)
> > +{
> > + struct worker_pool *pool;
> > + struct pool_workqueue *pwq;
> > +
> > + pool = get_unbound_pool(attrs);
> > + if (!pool)
> > + return NULL;
> > +
> > + pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
>
> this allocation is not numa-awared, you may use pool->node here.

Nice catch. Will do.

Thanks.

--
tejun

2013-03-20 16:43:11

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

On Wed, Mar 20, 2013 at 11:48 PM, Tejun Heo <[email protected]> wrote:
> On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
>> > + for_each_node(node)
>> > + BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
>> > + GFP_KERNEL, node));
>> > + for_each_possible_cpu(cpu) {
>> > + node = cpu_to_node(cpu);
>> > + if (WARN_ON(node == NUMA_NO_NODE)) {
>> > + pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
>>
>> since you used WARN_ON(), not BUG_ON(), I think you need to free
>> allocated memory here.
>
> I don't know. Is it necessary? It shouldn't happen sans bugs in arch
> code and we're vomiting warning message with full stack trace. The
> system will function but something is seriously screwed. I don't
> think it matters whether we free the puny number of bytes there or
> not and I don't wanna nest deeper there. :P


I agree with you, but many people use semantic analysis tools to do
hunting in the kernel. they may interrupt you again.

>
> --
> tejun

2013-03-20 17:08:10

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 09/10] workqueue: implement NUMA affinity for unbound workqueues

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 120 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 98 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workque
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+ INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
}

/* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
- struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
{
struct workqueue_struct *wq = pwq->wq;

@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workque
* Set the matching work_color. This is synchronized with
* flush_mutex to avoid confusing flush_workqueue().
*/
- if (p_last_pwq)
- *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;

/* sync max_active to the current setting */
@@ -3707,16 +3705,26 @@ static struct pool_workqueue *alloc_unbo
return pwq;
}

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+ if (pwq) {
+ put_unbound_pool(pwq->pool);
+ kfree(pwq);
+ }
+}
+
/**
* apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
* @wq: the target workqueue
* @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
*
- * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items. Older pwqs are released as in-flight
- * work items finish. Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
*
* Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
* failure.
@@ -3724,7 +3732,8 @@ static struct pool_workqueue *alloc_unbo
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
- struct pool_workqueue *pwq, *last_pwq;
+ struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+ struct workqueue_attrs *tmp_attrs = NULL;
int node;

/* only unbound workqueues can change attributes */
@@ -3735,29 +3744,96 @@ int apply_workqueue_attrs(struct workque
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- pwq = alloc_unbound_pwq(wq, attrs);
- if (!pwq)
- return -ENOMEM;
+ pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+ tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pwq_tbl || !tmp_attrs)
+ goto enomem;
+
+ copy_workqueue_attrs(tmp_attrs, attrs);
+
+ /*
+ * We want NUMA affinity. For each node with intersecting possible
+ * CPUs with the requested cpumask, create a separate pwq covering
+ * the instersection. Nodes without intersection are covered by
+ * the default pwq covering the whole requested cpumask.
+ */
+ for_each_node(node) {
+ cpumask_t *cpumask = tmp_attrs->cpumask;
+
+ /*
+ * Just fall through if NUMA affinity isn't enabled. We'll
+ * end up using the default pwq which is what we want.
+ */
+ if (wq_numa_possible_cpumask) {
+ cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+ attrs->cpumask);
+ if (cpumask_empty(cpumask))
+ cpumask_copy(cpumask, attrs->cpumask);
+ }
+
+ if (cpumask_equal(cpumask, attrs->cpumask)) {
+ if (!dfl_pwq) {
+ dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!dfl_pwq)
+ goto enomem;
+ } else {
+ dfl_pwq->refcnt++;
+ }
+ pwq_tbl[node] = dfl_pwq;
+ } else {
+ pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq_tbl[node])
+ goto enomem;
+ }
+ }

+ /* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&pwq_lock);

- link_pwq(pwq, &last_pwq);
+ /* @attrs is now current */
+ copy_workqueue_attrs(wq->unbound_attrs, attrs);

- copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
- for_each_node(node)
- rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ for_each_node(node) {
+ struct pool_workqueue *pwq;
+
+ /* each new pwq should be linked once */
+ if (list_empty(&pwq_tbl[node]->pwqs_node))
+ link_pwq(pwq_tbl[node]);
+
+ /* save the previous pwq and install the new one */
+ pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+ pwq_tbl[node] = pwq;
+ }

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);

- if (last_pwq) {
- spin_lock_irq(&last_pwq->pool->lock);
- put_pwq(last_pwq);
- spin_unlock_irq(&last_pwq->pool->lock);
+ /* put the old pwqs */
+ for_each_node(node) {
+ struct pool_workqueue *pwq = pwq_tbl[node];
+
+ if (pwq) {
+ spin_lock_irq(&pwq->pool->lock);
+ put_pwq(pwq);
+ spin_unlock_irq(&pwq->pool->lock);
+ }
}

return 0;
+
+enomem:
+ free_workqueue_attrs(tmp_attrs);
+ if (pwq_tbl) {
+ for_each_node(node) {
+ if (pwq_tbl[node] != dfl_pwq)
+ free_unbound_pwq(pwq_tbl[node]);
+ free_unbound_pwq(dfl_pwq);
+ }
+ kfree(pwq_tbl);
+ }
+ return -ENOMEM;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3781,7 +3857,7 @@ static int alloc_and_link_pwqs(struct wo
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&pwq_lock);

- link_pwq(pwq, NULL);
+ link_pwq(pwq);

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);

2013-03-20 17:08:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/10] workqueue: use NUMA-aware allocation for pool_workqueues workqueues

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool. As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3683,12 +3683,14 @@ static void pwq_adjust_max_active(struct
spin_unlock(&pwq->pool->lock);
}

-/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+/* initialize newly alloced @pwq which is associated with @wq and @pool */
static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
struct worker_pool *pool)
{
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

+ memset(pwq, 0, sizeof(*pwq));
+
pwq->pool = pool;
pwq->wq = wq;
pwq->flush_color = -1;
@@ -3731,7 +3733,7 @@ static struct pool_workqueue *alloc_unbo
if (!pool)
return NULL;

- pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
if (!pwq) {
put_unbound_pool(pool);
return NULL;

2013-03-20 18:55:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 UPDATED 09/10] workqueue: implement NUMA affinity for unbound workqueues

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
Please forget about the previous posting. It was freeing dfl_pwq
multiple times. This one, hopefully, is correct.

Thanks.

kernel/workqueue.c | 119 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 97 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3666,13 +3666,13 @@ static void init_pwq(struct pool_workque
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+ INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
}

/* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
- struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
{
struct workqueue_struct *wq = pwq->wq;

@@ -3683,8 +3683,6 @@ static void link_pwq(struct pool_workque
* Set the matching work_color. This is synchronized with
* flush_mutex to avoid confusing flush_workqueue().
*/
- if (p_last_pwq)
- *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;

/* sync max_active to the current setting */
@@ -3715,16 +3713,26 @@ static struct pool_workqueue *alloc_unbo
return pwq;
}

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+ if (pwq) {
+ put_unbound_pool(pwq->pool);
+ kfree(pwq);
+ }
+}
+
/**
* apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
* @wq: the target workqueue
* @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
*
- * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items. Older pwqs are released as in-flight
- * work items finish. Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
*
* Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
* failure.
@@ -3732,7 +3740,8 @@ static struct pool_workqueue *alloc_unbo
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
- struct pool_workqueue *pwq, *last_pwq;
+ struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+ struct workqueue_attrs *tmp_attrs = NULL;
int node;

/* only unbound workqueues can change attributes */
@@ -3743,29 +3752,95 @@ int apply_workqueue_attrs(struct workque
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- pwq = alloc_unbound_pwq(wq, attrs);
- if (!pwq)
- return -ENOMEM;
+ pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+ tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pwq_tbl || !tmp_attrs)
+ goto enomem;
+
+ copy_workqueue_attrs(tmp_attrs, attrs);
+
+ /*
+ * We want NUMA affinity. For each node with intersecting possible
+ * CPUs with the requested cpumask, create a separate pwq covering
+ * the instersection. Nodes without intersection are covered by
+ * the default pwq covering the whole requested cpumask.
+ */
+ for_each_node(node) {
+ cpumask_t *cpumask = tmp_attrs->cpumask;
+
+ /*
+ * Just fall through if NUMA affinity isn't enabled. We'll
+ * end up using the default pwq which is what we want.
+ */
+ if (wq_numa_possible_cpumask) {
+ cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+ attrs->cpumask);
+ if (cpumask_empty(cpumask))
+ cpumask_copy(cpumask, attrs->cpumask);
+ }
+
+ if (cpumask_equal(cpumask, attrs->cpumask)) {
+ if (!dfl_pwq) {
+ dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!dfl_pwq)
+ goto enomem;
+ } else {
+ dfl_pwq->refcnt++;
+ }
+ pwq_tbl[node] = dfl_pwq;
+ } else {
+ pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq_tbl[node])
+ goto enomem;
+ }
+ }

+ /* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&pwq_lock);

- link_pwq(pwq, &last_pwq);
+ /* @attrs is now current */
+ copy_workqueue_attrs(wq->unbound_attrs, attrs);

- copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
- for_each_node(node)
- rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ for_each_node(node) {
+ struct pool_workqueue *pwq;
+
+ /* each new pwq should be linked once */
+ if (list_empty(&pwq_tbl[node]->pwqs_node))
+ link_pwq(pwq_tbl[node]);
+
+ /* save the previous pwq and install the new one */
+ pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+ pwq_tbl[node] = pwq;
+ }

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);

- if (last_pwq) {
- spin_lock_irq(&last_pwq->pool->lock);
- put_pwq(last_pwq);
- spin_unlock_irq(&last_pwq->pool->lock);
+ /* put the old pwqs */
+ for_each_node(node) {
+ struct pool_workqueue *pwq = pwq_tbl[node];
+
+ if (pwq) {
+ spin_lock_irq(&pwq->pool->lock);
+ put_pwq(pwq);
+ spin_unlock_irq(&pwq->pool->lock);
+ }
}

return 0;
+
+enomem:
+ free_workqueue_attrs(tmp_attrs);
+ if (pwq_tbl) {
+ for_each_node(node)
+ if (pwq_tbl[node] != dfl_pwq)
+ free_unbound_pwq(pwq_tbl[node]);
+ free_unbound_pwq(dfl_pwq);
+ kfree(pwq_tbl);
+ }
+ return -ENOMEM;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3789,7 +3864,7 @@ static int alloc_and_link_pwqs(struct wo
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&pwq_lock);

- link_pwq(pwq, NULL);
+ link_pwq(pwq);

spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);

2013-03-20 18:57:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
> and also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

Branch rebased on top of the current wq/for-3.10 with updated patches.
The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.

Thanks.

--
tejun

2013-03-24 16:04:21

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

Hi, TJ

After long time(again) thought.

I think this patchset has this problem:
the work may be running on wrong CPU when
there is online cpu in its wq's cpumask.

example:
node0(cpu0,cpu1),node1(cpu2,cpu3),
wq's cpumask: 1,3
the pwq of this wq on the node1's cpumask: 3
current online cpu: 0-2.
so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
we expect it is executed on cpu1 only.

It can be fixed by swapping pwqs(node's pwq <-> default pwq)
when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?
the whole patchset is more complicated than my brain.

If you agree, I will rebase my patches again.

Thanks,
Lai

On Thu, Mar 21, 2013 at 2:57 AM, Tejun Heo <[email protected]> wrote:
> On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
>> and also available in the following git branch.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> Branch rebased on top of the current wq/for-3.10 with updated patches.
> The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-24 18:55:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

Hey, Lai.

On Mon, Mar 25, 2013 at 12:04:19AM +0800, Lai Jiangshan wrote:
> example:
> node0(cpu0,cpu1),node1(cpu2,cpu3),
> wq's cpumask: 1,3
> the pwq of this wq on the node1's cpumask: 3
> current online cpu: 0-2.
> so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
> so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
> we expect it is executed on cpu1 only.

Ah, right. I was hoping to avoid doing pwq swaps on CPU hot[un]plugs
doing everything on possible mask. Looks like we'll need some
massaging after all.

> It can be fixed by swapping pwqs(node's pwq <-> default pwq)
> when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?

We're still only at -rc4 meaning we still have plenty of time to
resolve whatever issues which come up, so I think I'm still gonna
target 3.10.

> the whole patchset is more complicated than my brain.

It isn't that complex, is it? I mean, the difficult part - using
multiple pwqs on unbound wq - already happened, and even that wasn't
too complex as it in most part synchronized the behaviors between
per-cpu and unbound workqueues. All that NUMA support is doing is
mapping different pwqs to different issuing CPUs.

Thanks.

--
tejun

2013-03-25 19:15:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > the whole patchset is more complicated than my brain.
>
> It isn't that complex, is it? I mean, the difficult part - using
> multiple pwqs on unbound wq - already happened, and even that wasn't
> too complex as it in most part synchronized the behaviors between
> per-cpu and unbound workqueues. All that NUMA support is doing is
> mapping different pwqs to different issuing CPUs.

Oh, BTW, please feel free to rebase your patchset on top of the
current wq/for-3.10. I'll take care of the conflicts with the numa
one, if there are any.

Thanks.

--
tejun

2013-03-25 20:48:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

On Mon, Mar 25, 2013 at 12:15:00PM -0700, Tejun Heo wrote:
> On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > > the whole patchset is more complicated than my brain.
> >
> > It isn't that complex, is it? I mean, the difficult part - using
> > multiple pwqs on unbound wq - already happened, and even that wasn't
> > too complex as it in most part synchronized the behaviors between
> > per-cpu and unbound workqueues. All that NUMA support is doing is
> > mapping different pwqs to different issuing CPUs.
>
> Oh, BTW, please feel free to rebase your patchset on top of the
> current wq/for-3.10. I'll take care of the conflicts with the numa
> one, if there are any.

Sorry, never mind. I'm cherry picking wq->mutex patches. Will apply
the rebased versions in a couple hours. Let's base everything else on
top of those.

Thanks.

--
tejun