2013-03-28 06:43:26

by Tejun Heo

[permalink] [raw]
Subject: Subject: [PATCHSET v2 wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

Hello,

Changes from the last take[L] are

* Lai pointed out that the previous implementation was broken in that
if a workqueue spans over multiple nodes and some of the nodes don't
have any desired online CPUs, work items queued on those nodes would
be spread across all CPUs violating the configured cpumask.

The patchset is updated such that apply_workqueue_attrs() now only
assigns NUMA-affine pwqs to nodes with desired online CPUs and
default pwq to all other nodes. To track CPU online state,
wq_update_unbound_numa_attrs() is added. The function is called for
each workqueue during hot[un]plug and updates pwq association
accordingly.

* Rolled in updated patches from the previous posting.

* More helper routines are factored out such that
apply_workqueue_attrs() is easier to follow and can share code paths
with wq_update_unbound_numa_attrs().

* Various cleanups and fixes.

Patchset description from the original posting follows.

There are two types of workqueues - per-cpu and unbound. The former
is bound to each CPU and the latter isn't not bound to any by default.
While the recently added attrs support allows unbound workqueues to be
confined to subset of CPUs, it still is quite cumbersome for
applications where CPU affinity is too constricted but NUMA locality
still matters.

This patchset tries to solve that issue by automatically making
unbound workqueues affine to NUMA nodes by default. A work item
queued to an unbound workqueue is executed on one of the CPUs allowed
by the workqueue in the same node. If there's none allowed, it may be
executed on any cpu allowed by the workqueue. It doesn't require any
changes on the user side. Every interface of workqueues functions the
same as before.

This would be most helpful to subsystems which use some form of async
execution to process significant amount of data - e.g. crypto and
btrfs; however, I wanted to find out whether it would make any dent in
much less favorable use cases. The following is total run time in
seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
opteron machine with writeback thread pool converted to unbound
workqueue and thus made NUMA-affine. The file system is ext4 on top
of a WD SSD.

before conversion after conversion
1396.126 1394.763
1397.621 1394.965
1399.636 1394.738
1397.463 1398.162
1395.543 1393.670

AVG 1397.278 1395.260 DIFF 2.018
STDEV 1.585 1.700

And, yes, it actually made things go faster by about 1.2 sigma, which
isn't completely conclusive but is a pretty good indication that it's
actually faster. Note that this is a workload which is dominated by
CPU time and while there's writeback going on continously it really
isn't touching too much data or a dominating factor, so the gain is
understandably small, 0.14%, but hey it's still a gain and it should
be much more interesting for crypto and btrfs which would actully
access the data or workloads which are more sensitive to NUMA
affinity.

The implementation is fairly simple. After the recent attrs support
changes, a lot of the differences in pwq (pool_workqueue) handling
between unbound and per-cpu workqueues are gone. An unbound workqueue
still has one "current" pwq that it uses for queueing any new work
items but can handle multiple pwqs perfectly well while they're
draining, so this patchset adds pwq dispatch table to unbound
workqueues which is indexed by NUMA node and points to the matching
pwq. Unbound workqueues now simply have multiple "current" pwqs keyed
by NUMA node.

NUMA affinity can be turned off system-wide by workqueue.disable_numa
kernel param or per-workqueue using "numa" sysfs file.

This patchset contains the following fourteen patches.

0001-workqueue-move-pwq_pool_locking-outside-of-get-put_u.patch
0002-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
0003-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
0004-workqueue-determine-NUMA-node-of-workers-accourding-.patch
0005-workqueue-add-workqueue-unbound_attrs.patch
0006-workqueue-make-workqueue-name-fixed-len.patch
0007-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
0008-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
0009-workqueue-break-init_and_link_pwq-into-two-functions.patch
0010-workqueue-use-NUMA-aware-allocation-for-pool_workque.patch
0011-workqueue-introduce-numa_pwq_tbl_install.patch
0012-workqueue-introduce-put_pwq_unlocked.patch
0013-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
0014-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch

0001-0009 are prep patches.

0010-0013 implement NUMA affinity.

0014 adds control knobs and updates sysfs interface.

This patchset is on top of

wq/for-3.10 b59276054 ("workqueue: remove pwq_lock which is no longer used")
+ [1] ("workqueue: fix race condition in unbound workqueue free path") patch
+ [2] ("workqueue: fix unbound workqueue attrs hashing / comparison") patch

and also available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

diffstat follows.

Documentation/kernel-parameters.txt | 9
include/linux/workqueue.h | 5
kernel/workqueue.c | 640 ++++++++++++++++++++++++++++++------
3 files changed, 549 insertions(+), 105 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel.cryptoapi/8501
[1] http://article.gmane.org/gmane.linux.kernel/1465618
[2] http://article.gmane.org/gmane.linux.kernel/1465619


2013-03-28 06:43:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/14] workqueue: move pwq_pool_locking outside of get/put_unbound_pool()

The scheduled NUMA affinity support for unbound workqueues would need
to walk workqueues list and pool related operations on each workqueue.

Move wq_pool_mutex locking out of get/put_unbound_pool() to their
callers so that pool operations can be performed while walking the
workqueues list, which is also protected by wq_pool_mutex.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index abe1f0d..26771f4e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3395,31 +3395,28 @@ static void rcu_free_pool(struct rcu_head *rcu)
* safe manner. get_unbound_pool() calls this function on its failure path
* and this function should be able to release pools which went through,
* successfully or not, init_worker_pool().
+ *
+ * Should be called with wq_pool_mutex held.
*/
static void put_unbound_pool(struct worker_pool *pool)
{
struct worker *worker;

- mutex_lock(&wq_pool_mutex);
- if (--pool->refcnt) {
- mutex_unlock(&wq_pool_mutex);
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (--pool->refcnt)
return;
- }

/* sanity checks */
if (WARN_ON(!(pool->flags & POOL_DISASSOCIATED)) ||
- WARN_ON(!list_empty(&pool->worklist))) {
- mutex_unlock(&wq_pool_mutex);
+ WARN_ON(!list_empty(&pool->worklist)))
return;
- }

/* release id and unhash */
if (pool->id >= 0)
idr_remove(&worker_pool_idr, pool->id);
hash_del(&pool->hash_node);

- mutex_unlock(&wq_pool_mutex);
-
/*
* Become the manager and destroy all workers. Grabbing
* manager_arb prevents @pool's workers from blocking on
@@ -3453,13 +3450,15 @@ static void put_unbound_pool(struct worker_pool *pool)
* reference count and return it. If there already is a matching
* worker_pool, it will be used; otherwise, this function attempts to
* create a new one. On failure, returns NULL.
+ *
+ * Should be called with wq_pool_mutex held.
*/
static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
{
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;

- mutex_lock(&wq_pool_mutex);
+ lockdep_assert_held(&wq_pool_mutex);

/* do we already have a matching pool? */
hash_for_each_possible(unbound_pool_hash, pool, hash_node, hash) {
@@ -3490,10 +3489,8 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
/* install */
hash_add(unbound_pool_hash, &pool->hash_node, hash);
out_unlock:
- mutex_unlock(&wq_pool_mutex);
return pool;
fail:
- mutex_unlock(&wq_pool_mutex);
if (pool)
put_unbound_pool(pool);
return NULL;
@@ -3530,7 +3527,10 @@ static void pwq_unbound_release_workfn(struct work_struct *work)
is_last = list_empty(&wq->pwqs);
mutex_unlock(&wq->mutex);

+ mutex_lock(&wq_pool_mutex);
put_unbound_pool(pool);
+ mutex_unlock(&wq_pool_mutex);
+
call_rcu_sched(&pwq->rcu, rcu_free_pwq);

/*
@@ -3653,13 +3653,21 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);

+ mutex_lock(&wq_pool_mutex);
+
pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
- if (!pwq)
+ if (!pwq) {
+ mutex_unlock(&wq_pool_mutex);
goto enomem;
+ }

pool = get_unbound_pool(new_attrs);
- if (!pool)
+ if (!pool) {
+ mutex_unlock(&wq_pool_mutex);
goto enomem;
+ }
+
+ mutex_unlock(&wq_pool_mutex);

init_and_link_pwq(pwq, wq, pool, &last_pwq);
if (last_pwq) {
--
1.8.1.4

2013-03-28 06:43:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/14] workqueue: drop 'H' from kworker names of unbound worker pools

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names. This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational. TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs. We're expecting to have more unbound pools with the
scheduled NUMA awareness support. Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 54b5048..57ced49 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1644,9 +1644,10 @@ static struct worker *alloc_worker(void)
*/
static struct worker *create_worker(struct worker_pool *pool)
{
- const char *pri = pool->attrs->nice < 0 ? "H" : "";
struct worker *worker = NULL;
+ int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
+ char id_buf[16];

lockdep_assert_held(&pool->manager_mutex);

@@ -1672,13 +1673,13 @@ static struct worker *create_worker(struct worker_pool *pool)
worker->id = id;

if (pool->cpu >= 0)
- worker->task = kthread_create_on_node(worker_thread,
- worker, cpu_to_node(pool->cpu),
- "kworker/%d:%d%s", pool->cpu, id, pri);
+ snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+ pool->attrs->nice < 0 ? "H" : "");
else
- worker->task = kthread_create(worker_thread, worker,
- "kworker/u%d:%d%s",
- pool->id, id, pri);
+ snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+ worker->task = kthread_create_on_node(worker_thread, worker, node,
+ "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;

--
1.8.1.4

2013-03-28 06:43:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/14] workqueue: determine NUMA node of workers accourding to the allowed cpumask

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal. Add pool->node which is
determined by the pool's cpumask. If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 57ced49..fab2630 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -138,6 +138,7 @@ enum {
struct worker_pool {
spinlock_t lock; /* the pool lock */
int cpu; /* I: the associated cpu */
+ int node; /* I: the associated node ID */
int id; /* I: pool ID */
unsigned int flags; /* X: flags */

@@ -1645,7 +1646,6 @@ static struct worker *alloc_worker(void)
static struct worker *create_worker(struct worker_pool *pool)
{
struct worker *worker = NULL;
- int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
char id_buf[16];

@@ -1678,7 +1678,7 @@ static struct worker *create_worker(struct worker_pool *pool)
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);

- worker->task = kthread_create_on_node(worker_thread, worker, node,
+ worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
"kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
@@ -3360,6 +3360,7 @@ static int init_worker_pool(struct worker_pool *pool)
spin_lock_init(&pool->lock);
pool->id = -1;
pool->cpu = -1;
+ pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
@@ -3465,6 +3466,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
{
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
+ int node;

lockdep_assert_held(&wq_pool_mutex);

@@ -3487,6 +3489,17 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
lockdep_set_subclass(&pool->lock, 1); /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);

+ /* if cpumask is contained inside a NUMA node, we belong to that node */
+ if (wq_numa_enabled) {
+ for_each_node(node) {
+ if (cpumask_subset(pool->attrs->cpumask,
+ wq_numa_possible_cpumask[node])) {
+ pool->node = node;
+ break;
+ }
+ }
+ }
+
if (worker_pool_assign_id(pool) < 0)
goto fail;

@@ -4475,6 +4488,7 @@ static int __init init_workqueues(void)
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];
+ pool->node = cpu_to_node(cpu);

/* alloc pool ID */
mutex_lock(&wq_pool_mutex);
--
1.8.1.4

2013-03-28 06:44:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/14] workqueue: move hot fields of workqueue_struct to the end

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline. These two fields are used in the work item
issue path and thus hot. The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues. The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 23c9099..9297ea3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -227,8 +227,6 @@ struct wq_device;
* the appropriate worker_pool through its pool_workqueues.
*/
struct workqueue_struct {
- unsigned int flags; /* WQ: WQ_* flags */
- struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
struct list_head pwqs; /* WR: all pwqs of this wq */
struct list_head list; /* PL: list of all workqueues */

@@ -255,6 +253,10 @@ struct workqueue_struct {
struct lockdep_map lockdep_map;
#endif
char name[WQ_NAME_LEN]; /* I: workqueue name */
+
+ /* hot fields used during command issue, aligned to cacheline */
+ unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
+ struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
};

static struct kmem_cache *pwq_cache;
--
1.8.1.4

2013-03-28 06:43:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/14] workqueue: add workqueue->unbound_attrs

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold. Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fab2630..65f200c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -244,6 +244,8 @@ struct workqueue_struct {
int nr_drainers; /* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */

+ struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
+
#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
#endif
@@ -3088,10 +3090,9 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;

- rcu_read_lock_sched();
- written = scnprintf(buf, PAGE_SIZE, "%d\n",
- first_pwq(wq)->pool->attrs->nice);
- rcu_read_unlock_sched();
+ mutex_lock(&wq->mutex);
+ written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+ mutex_unlock(&wq->mutex);

return written;
}
@@ -3105,9 +3106,9 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
if (!attrs)
return NULL;

- rcu_read_lock_sched();
- copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
- rcu_read_unlock_sched();
+ mutex_lock(&wq->mutex);
+ copy_workqueue_attrs(attrs, wq->unbound_attrs);
+ mutex_unlock(&wq->mutex);
return attrs;
}

@@ -3138,10 +3139,9 @@ static ssize_t wq_cpumask_show(struct device *dev,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;

- rcu_read_lock_sched();
- written = cpumask_scnprintf(buf, PAGE_SIZE,
- first_pwq(wq)->pool->attrs->cpumask);
- rcu_read_unlock_sched();
+ mutex_lock(&wq->mutex);
+ written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
+ mutex_unlock(&wq->mutex);

written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
return written;
@@ -3558,8 +3558,10 @@ static void pwq_unbound_release_workfn(struct work_struct *work)
* If we're the last pwq going away, @wq is already dead and no one
* is gonna access it anymore. Free it.
*/
- if (is_last)
+ if (is_last) {
+ free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
+ }
}

/**
@@ -3634,6 +3636,9 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);

+ if (wq->flags & WQ_UNBOUND)
+ copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+
mutex_unlock(&wq->mutex);
}

@@ -3761,6 +3766,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
if (!wq)
return NULL;

+ if (flags & WQ_UNBOUND) {
+ wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!wq->unbound_attrs)
+ goto err_free_wq;
+ }
+
vsnprintf(wq->name, namelen, fmt, args1);
va_end(args);
va_end(args1);
@@ -3830,6 +3841,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
return wq;

err_free_wq:
+ free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
return NULL;
err_destroy:
--
1.8.1.4

2013-03-28 06:43:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/14] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 77 ++++++++++++++++++++++++++++++++++--------------------
1 file changed, 49 insertions(+), 28 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5b53705..58c7663 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3626,13 +3626,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
spin_unlock_irq(&pwq->pool->lock);
}

-static void init_and_link_pwq(struct pool_workqueue *pwq,
- struct workqueue_struct *wq,
- struct worker_pool *pool,
- struct pool_workqueue **p_last_pwq)
+/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
+ struct worker_pool *pool)
{
- int node;
-
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

pwq->pool = pool;
@@ -3642,8 +3639,15 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+}

- mutex_lock(&wq->mutex);
+/* sync @pwq with the current state of its associated wq and link it */
+static void link_pwq(struct pool_workqueue *pwq,
+ struct pool_workqueue **p_last_pwq)
+{
+ struct workqueue_struct *wq = pwq->wq;
+
+ lockdep_assert_held(&wq->mutex);

/*
* Set the matching work_color. This is synchronized with
@@ -3658,14 +3662,29 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,

/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
+}

- if (wq->flags & WQ_UNBOUND) {
- copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
- for_each_node(node)
- rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
+static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs)
+{
+ struct worker_pool *pool;
+ struct pool_workqueue *pwq;
+
+ lockdep_assert_held(&wq_pool_mutex);
+
+ pool = get_unbound_pool(attrs);
+ if (!pool)
+ return NULL;
+
+ pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ if (!pwq) {
+ put_unbound_pool(pool);
+ return NULL;
}

- mutex_unlock(&wq->mutex);
+ init_pwq(pwq, wq, pool);
+ return pwq;
}

/**
@@ -3686,8 +3705,8 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
struct workqueue_attrs *new_attrs;
- struct pool_workqueue *pwq = NULL, *last_pwq;
- struct worker_pool *pool;
+ struct pool_workqueue *pwq, *last_pwq;
+ int node;

/* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3706,22 +3725,21 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);

mutex_lock(&wq_pool_mutex);
-
- pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
- if (!pwq) {
- mutex_unlock(&wq_pool_mutex);
+ pwq = alloc_unbound_pwq(wq, new_attrs);
+ mutex_unlock(&wq_pool_mutex);
+ if (!pwq)
goto enomem;
- }

- pool = get_unbound_pool(new_attrs);
- if (!pool) {
- mutex_unlock(&wq_pool_mutex);
- goto enomem;
- }
+ mutex_lock(&wq->mutex);

- mutex_unlock(&wq_pool_mutex);
+ link_pwq(pwq, &last_pwq);
+
+ copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
+ for_each_node(node)
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+
+ mutex_unlock(&wq->mutex);

- init_and_link_pwq(pwq, wq, pool, &last_pwq);
if (last_pwq) {
spin_lock_irq(&last_pwq->pool->lock);
put_pwq(last_pwq);
@@ -3731,7 +3749,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
return 0;

enomem:
- kmem_cache_free(pwq_cache, pwq);
free_workqueue_attrs(new_attrs);
return -ENOMEM;
}
@@ -3752,7 +3769,11 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
struct worker_pool *cpu_pools =
per_cpu(cpu_worker_pools, cpu);

- init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
+ init_pwq(pwq, wq, &cpu_pools[highpri]);
+
+ mutex_lock(&wq->mutex);
+ link_pwq(pwq, NULL);
+ mutex_unlock(&wq->mutex);
}
return 0;
} else {
--
1.8.1.4

2013-03-28 06:44:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/14] workqueue: use NUMA-aware allocation for pool_workqueues

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool. As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 58c7663..a4420be 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3626,12 +3626,14 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
spin_unlock_irq(&pwq->pool->lock);
}

-/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+/* initialize newly alloced @pwq which is associated with @wq and @pool */
static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
struct worker_pool *pool)
{
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

+ memset(pwq, 0, sizeof(*pwq));
+
pwq->pool = pool;
pwq->wq = wq;
pwq->flush_color = -1;
@@ -3677,7 +3679,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
if (!pool)
return NULL;

- pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
if (!pwq) {
put_unbound_pool(pool);
return NULL;
--
1.8.1.4

2013-03-28 06:43:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/14] workqueue: map an unbound workqueues to multiple per-node pool_workqueues

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it. It may have multple pool_workqueues but only the
first pool_workqueue servies new work items. For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node. This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
may be called only with wq->mutex held.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 48 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9297ea3..5b53705 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -257,6 +257,7 @@ struct workqueue_struct {
/* hot fields used during command issue, aligned to cacheline */
unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+ struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
};

static struct kmem_cache *pwq_cache;
@@ -525,6 +526,22 @@ static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
pwqs_node);
}

+/**
+ * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
+ * @wq: the target workqueue
+ * @node: the node ID
+ *
+ * This must be called either with pwq_lock held or sched RCU read locked.
+ * If the pwq needs to be used beyond the locking in effect, the caller is
+ * responsible for guaranteeing that the pwq stays online.
+ */
+static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
+ int node)
+{
+ assert_rcu_or_wq_mutex(wq);
+ return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
+}
+
static unsigned int work_color_to_flags(int color)
{
return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1278,14 +1295,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
WARN_ON_ONCE(!is_chained_work(wq)))
return;
retry:
+ if (req_cpu == WORK_CPU_UNBOUND)
+ cpu = raw_smp_processor_id();
+
/* pwq which will be used unless @work is executing elsewhere */
- if (!(wq->flags & WQ_UNBOUND)) {
- if (cpu == WORK_CPU_UNBOUND)
- cpu = raw_smp_processor_id();
+ if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
- } else {
- pwq = first_pwq(wq);
- }
+ else
+ pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

/*
* If @work was previously on a different pool, it might still be
@@ -1315,8 +1332,8 @@ retry:
* pwq is determined and locked. For unbound pools, we could have
* raced with pwq release and it could already be dead. If its
* refcnt is zero, repeat pwq selection. Note that pwqs never die
- * without another pwq replacing it as the first pwq or while a
- * work item is executing on it, so the retying is guaranteed to
+ * without another pwq replacing it in the numa_pwq_tbl or while
+ * work items are executing on it, so the retrying is guaranteed to
* make forward-progress.
*/
if (unlikely(!pwq->refcnt)) {
@@ -3614,6 +3631,8 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
struct worker_pool *pool,
struct pool_workqueue **p_last_pwq)
{
+ int node;
+
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

pwq->pool = pool;
@@ -3640,8 +3659,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);

- if (wq->flags & WQ_UNBOUND)
+ if (wq->flags & WQ_UNBOUND) {
copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+ for_each_node(node)
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ }

mutex_unlock(&wq->mutex);
}
@@ -3756,12 +3778,16 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
struct lock_class_key *key,
const char *lock_name, ...)
{
+ size_t tbl_size = 0;
va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;

/* allocate wq and format name */
- wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+ if (flags & WQ_UNBOUND)
+ tbl_size = wq_numa_tbl_len * sizeof(wq->numa_pwq_tbl[0]);
+
+ wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
if (!wq)
return NULL;

@@ -3989,7 +4015,7 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
else
- pwq = first_pwq(wq);
+ pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

ret = !list_empty(&pwq->delayed_works);
rcu_read_unlock_sched();
--
1.8.1.4

2013-03-28 06:43:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/14] workqueue: make workqueue->name[] fixed len

Currently workqueue->name[] is of flexible length. We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway. Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 65f200c..23c9099 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -101,6 +101,8 @@ enum {
*/
RESCUER_NICE_LEVEL = -20,
HIGHPRI_NICE_LEVEL = -20,
+
+ WQ_NAME_LEN = 24,
};

/*
@@ -252,7 +254,7 @@ struct workqueue_struct {
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
- char name[]; /* I: workqueue name */
+ char name[WQ_NAME_LEN]; /* I: workqueue name */
};

static struct kmem_cache *pwq_cache;
@@ -3752,17 +3754,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
struct lock_class_key *key,
const char *lock_name, ...)
{
- va_list args, args1;
+ va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
- size_t namelen;
-
- /* determine namelen, allocate wq and format name */
- va_start(args, lock_name);
- va_copy(args1, args);
- namelen = vsnprintf(NULL, 0, fmt, args) + 1;

- wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
+ /* allocate wq and format name */
+ wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
return NULL;

@@ -3772,9 +3769,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
goto err_free_wq;
}

- vsnprintf(wq->name, namelen, fmt, args1);
+ va_start(args, lock_name);
+ vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
- va_end(args1);

max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);
--
1.8.1.4

2013-03-28 06:43:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/14] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity

Unbound workqueues are now NUMA aware. Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
node:pool_id pairs. This change is userland-visible but "pool_id"
hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
individual workqueues. This is implemented as attrs->no_numa whichn
is special in that it isn't part of a pool's attributes. It only
affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo <[email protected]>
---
Documentation/kernel-parameters.txt | 9 ++++
include/linux/workqueue.h | 5 +++
kernel/workqueue.c | 82 ++++++++++++++++++++++++++-----------
3 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..c75ea0b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3222,6 +3222,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
or other driver-specific files in the
Documentation/watchdog/ directory.

+ workqueue.disable_numa
+ By default, all work items queued to unbound
+ workqueues are affine to the NUMA nodes they're
+ issued on, which results in better behavior in
+ general. If NUMA affinity needs to be disabled for
+ whatever reason, this option can be used. Note
+ that this also can be controlled per-workqueue for
+ workqueues visible under /sys/bus/workqueue/.
+
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 835d12b..7179756 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,10 +119,15 @@ struct delayed_work {
/*
* A struct for workqueue attributes. This can be used to change
* attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool. It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
*/
struct workqueue_attrs {
int nice; /* nice level */
cpumask_var_t cpumask; /* allowed CPUs */
+ bool no_numa; /* disable NUMA affinity */
};

static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 637debe..0b6a3b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -268,6 +268,9 @@ static int wq_numa_tbl_len; /* highest possible NUMA node id + 1 */
static cpumask_var_t *wq_numa_possible_cpumask;
/* possible CPUs of each node */

+static bool wq_disable_numa;
+module_param_named(disable_numa, wq_disable_numa, bool, 0444);
+
static bool wq_numa_enabled; /* unbound NUMA affinity enabled */

/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
@@ -517,21 +520,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
}

/**
- * first_pwq - return the first pool_workqueue of the specified workqueue
- * @wq: the target workqueue
- *
- * This must be called either with wq->mutex held or sched RCU read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- */
-static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
-{
- assert_rcu_or_wq_mutex(wq);
- return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
-}
-
-/**
* unbound_pwq_by_node - return the unbound pool_workqueue for the given node
* @wq: the target workqueue
* @node: the node ID
@@ -3114,16 +3102,21 @@ static struct device_attribute wq_sysfs_attrs[] = {
__ATTR_NULL,
};

-static ssize_t wq_pool_id_show(struct device *dev,
- struct device_attribute *attr, char *buf)
+static ssize_t wq_pool_ids_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
{
struct workqueue_struct *wq = dev_to_wq(dev);
- struct worker_pool *pool;
- int written;
+ const char *delim = "";
+ int node, written = 0;

rcu_read_lock_sched();
- pool = first_pwq(wq)->pool;
- written = scnprintf(buf, PAGE_SIZE, "%d\n", pool->id);
+ for_each_node(node) {
+ written += scnprintf(buf + written, PAGE_SIZE - written,
+ "%s%d:%d", delim, node,
+ unbound_pwq_by_node(wq, node)->pool->id);
+ delim = " ";
+ }
+ written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
rcu_read_unlock_sched();

return written;
@@ -3212,10 +3205,46 @@ static ssize_t wq_cpumask_store(struct device *dev,
return ret ?: count;
}

+static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int written;
+
+ mutex_lock(&wq->mutex);
+ written = scnprintf(buf, PAGE_SIZE, "%d\n",
+ !wq->unbound_attrs->no_numa);
+ mutex_unlock(&wq->mutex);
+
+ return written;
+}
+
+static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct workqueue_attrs *attrs;
+ int v, ret;
+
+ attrs = wq_sysfs_prep_attrs(wq);
+ if (!attrs)
+ return -ENOMEM;
+
+ ret = -EINVAL;
+ if (sscanf(buf, "%d", &v) == 1) {
+ attrs->no_numa = !v;
+ ret = apply_workqueue_attrs(wq, attrs);
+ }
+
+ free_workqueue_attrs(attrs);
+ return ret ?: count;
+}
+
static struct device_attribute wq_sysfs_unbound_attrs[] = {
- __ATTR(pool_id, 0444, wq_pool_id_show, NULL),
+ __ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+ __ATTR(numa, 0644, wq_numa_show, wq_numa_store),
__ATTR_NULL,
};

@@ -3750,7 +3779,7 @@ static void free_unbound_pwq(struct pool_workqueue *pwq)
static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
int cpu_going_down, cpumask_t *cpumask)
{
- if (!wq_numa_enabled)
+ if (!wq_numa_enabled || attrs->no_numa)
goto use_dfl;

/* does @node have any online CPUs @attrs wants? */
@@ -3940,6 +3969,8 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
cpumask = target_attrs->cpumask;
retry:
mutex_lock(&wq->mutex);
+ if (wq->unbound_attrs->no_numa)
+ goto out_unlock;

copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
pwq = unbound_pwq_by_node(wq, node);
@@ -4757,6 +4788,11 @@ static void __init wq_numa_init(void)
if (num_possible_nodes() <= 1)
return;

+ if (wq_disable_numa) {
+ pr_info("workqueue: NUMA affinity support disabled\n");
+ return;
+ }
+
wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs(GFP_KERNEL);
BUG_ON(!wq_update_unbound_numa_attrs_buf);

--
1.8.1.4

2013-03-28 06:44:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/14] workqueue: introduce numa_pwq_tbl_install()

Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq()
is made safe to call multiple times. numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.

@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.

This is to prepare for NUMA affinity support for unbound workqueues.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a4420be..527dc418 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3639,24 +3639,26 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+ INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
}

/* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
- struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
{
struct workqueue_struct *wq = pwq->wq;

lockdep_assert_held(&wq->mutex);

+ /* may be called multiple times, ignore if already linked */
+ if (!list_empty(&pwq->pwqs_node))
+ return;
+
/*
* Set the matching work_color. This is synchronized with
* wq->mutex to avoid confusing flush_workqueue().
*/
- if (p_last_pwq)
- *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;

/* sync max_active to the current setting */
@@ -3689,6 +3691,23 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
return pwq;
}

+/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
+static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
+ int node,
+ struct pool_workqueue *pwq)
+{
+ struct pool_workqueue *old_pwq;
+
+ lockdep_assert_held(&wq->mutex);
+
+ /* link_pwq() can handle duplicate calls */
+ link_pwq(pwq);
+
+ old_pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ return old_pwq;
+}
+
/**
* apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
* @wq: the target workqueue
@@ -3707,7 +3726,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
struct workqueue_attrs *new_attrs;
- struct pool_workqueue *pwq, *last_pwq;
+ struct pool_workqueue *pwq, *last_pwq = NULL;
int node;

/* only unbound workqueues can change attributes */
@@ -3734,11 +3753,9 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,

mutex_lock(&wq->mutex);

- link_pwq(pwq, &last_pwq);
-
copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
for_each_node(node)
- rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+ last_pwq = numa_pwq_tbl_install(wq, node, pwq);

mutex_unlock(&wq->mutex);

@@ -3774,7 +3791,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
init_pwq(pwq, wq, &cpu_pools[highpri]);

mutex_lock(&wq->mutex);
- link_pwq(pwq, NULL);
+ link_pwq(pwq);
mutex_unlock(&wq->mutex);
}
return 0;
--
1.8.1.4

2013-03-28 06:43:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/14] workqueue: introduce put_pwq_unlocked()

Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked(). The two existing places are converted and there
will be more with NUMA affinity support.

This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 36 +++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 527dc418..e656931 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1057,6 +1057,25 @@ static void put_pwq(struct pool_workqueue *pwq)
schedule_work(&pwq->unbound_release_work);
}

+/**
+ * put_pwq_unlocked - put_pwq() with surrounding pool lock/unlock
+ * @pwq: pool_workqueue to put (can be %NULL)
+ *
+ * put_pwq() with locking. This function also allows %NULL @pwq.
+ */
+static void put_pwq_unlocked(struct pool_workqueue *pwq)
+{
+ if (pwq) {
+ /*
+ * As both pwqs and pools are sched-RCU protected, the
+ * following lock operations are safe.
+ */
+ spin_lock_irq(&pwq->pool->lock);
+ put_pwq(pwq);
+ spin_unlock_irq(&pwq->pool->lock);
+ }
+}
+
static void pwq_activate_delayed_work(struct work_struct *work)
{
struct pool_workqueue *pwq = get_work_pwq(work);
@@ -3759,12 +3778,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,

mutex_unlock(&wq->mutex);

- if (last_pwq) {
- spin_lock_irq(&last_pwq->pool->lock);
- put_pwq(last_pwq);
- spin_unlock_irq(&last_pwq->pool->lock);
- }
-
+ put_pwq_unlocked(last_pwq);
return 0;

enomem:
@@ -3975,16 +3989,12 @@ void destroy_workqueue(struct workqueue_struct *wq)
} else {
/*
* We're the sole accessor of @wq at this point. Directly
- * access the first pwq and put the base ref. As both pwqs
- * and pools are sched-RCU protected, the lock operations
- * are safe. @wq will be freed when the last pwq is
- * released.
+ * access the first pwq and put the base ref. @wq will be
+ * freed when the last pwq is released.
*/
pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
pwqs_node);
- spin_lock_irq(&pwq->pool->lock);
- put_pwq(pwq);
- spin_unlock_irq(&pwq->pool->lock);
+ put_pwq_unlocked(pwq);
}
}
EXPORT_SYMBOL_GPL(destroy_workqueue);
--
1.8.1.4

2013-03-28 06:45:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/14] workqueue: implement NUMA affinity for unbound workqueues

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 278 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 258 insertions(+), 20 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e656931..637debe 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
#include <linux/hashtable.h>
#include <linux/rculist.h>
#include <linux/nodemask.h>
+#include <linux/moduleparam.h>

#include "workqueue_internal.h"

@@ -245,6 +246,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */

struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
+ struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */

#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
@@ -268,6 +270,9 @@ static cpumask_var_t *wq_numa_possible_cpumask;

static bool wq_numa_enabled; /* unbound NUMA affinity enabled */

+/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
+static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+
static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */

@@ -3710,6 +3715,61 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
return pwq;
}

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (pwq) {
+ put_unbound_pool(pwq->pool);
+ kfree(pwq);
+ }
+}
+
+/**
+ * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
+ * @attrs: the wq_attrs of interest
+ * @node: the target NUMA node
+ * @cpu_going_down: if >= 0, the CPU to consider as offline
+ * @cpumask: outarg, the resulting cpumask
+ *
+ * Calculate the cpumask a workqueue with @attrs should use on @node. If
+ * @cpu_going_down is >= 0, that cpu is considered offline during
+ * calculation. The result is stored in @cpumask. This function returns
+ * %true if the resulting @cpumask is different from @attrs->cpumask,
+ * %false if equal.
+ *
+ * If NUMA affinity is not enabled, @attrs->cpumask is always used. If
+ * enabled and @node has online CPUs requested by @attrs, the returned
+ * cpumask is the intersection of the possible CPUs of @node and
+ * @attrs->cpumask.
+ *
+ * The caller is responsible for ensuring that the cpumask of @node stays
+ * stable.
+ */
+static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
+ int cpu_going_down, cpumask_t *cpumask)
+{
+ if (!wq_numa_enabled)
+ goto use_dfl;
+
+ /* does @node have any online CPUs @attrs wants? */
+ cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
+ if (cpu_going_down >= 0)
+ cpumask_clear_cpu(cpu_going_down, cpumask);
+
+ if (cpumask_empty(cpumask))
+ goto use_dfl;
+
+ /* yeap, return possible CPUs in @node that @attrs wants */
+ cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
+ return !cpumask_equal(cpumask, attrs->cpumask);
+
+use_dfl:
+ cpumask_copy(cpumask, attrs->cpumask);
+ return false;
+}
+
/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
int node,
@@ -3732,11 +3792,12 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
* @wq: the target workqueue
* @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
*
- * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items. Older pwqs are released as in-flight
- * work items finish. Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
*
* Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
* failure.
@@ -3744,8 +3805,8 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
- struct workqueue_attrs *new_attrs;
- struct pool_workqueue *pwq, *last_pwq = NULL;
+ struct workqueue_attrs *new_attrs, *tmp_attrs;
+ struct pool_workqueue **pwq_tbl, *dfl_pwq;
int node;

/* only unbound workqueues can change attributes */
@@ -3756,36 +3817,190 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- /* make a copy of @attrs and sanitize it */
+ pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
new_attrs = alloc_workqueue_attrs(GFP_KERNEL);
- if (!new_attrs)
+ tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pwq_tbl || !new_attrs || !tmp_attrs)
goto enomem;

+ /* make a copy of @attrs and sanitize it */
copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);

+ /*
+ * We may create multiple pwqs with differing cpumasks. Make a
+ * copy of @new_attrs which will be modified and used to obtain
+ * pools.
+ */
+ copy_workqueue_attrs(tmp_attrs, new_attrs);
+
+ /*
+ * CPUs should stay stable across pwq creations and installations.
+ * Pin CPUs, determine the target cpumask for each node and create
+ * pwqs accordingly.
+ */
+ get_online_cpus();
+
mutex_lock(&wq_pool_mutex);
- pwq = alloc_unbound_pwq(wq, new_attrs);
+
+ /*
+ * If something goes wrong during CPU up/down, we'll fall back to
+ * the default pwq covering whole @attrs->cpumask. Always create
+ * it even if we don't use it immediately.
+ */
+ dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
+ if (!dfl_pwq)
+ goto enomem_pwq;
+
+ for_each_node(node) {
+ if (wq_calc_node_cpumask(attrs, node, -1, tmp_attrs->cpumask)) {
+ pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq_tbl[node])
+ goto enomem_pwq;
+ } else {
+ dfl_pwq->refcnt++;
+ pwq_tbl[node] = dfl_pwq;
+ }
+ }
+
mutex_unlock(&wq_pool_mutex);
- if (!pwq)
- goto enomem;

+ /* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->mutex);

copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
+
+ /* save the previous pwq and install the new one */
for_each_node(node)
- last_pwq = numa_pwq_tbl_install(wq, node, pwq);
+ pwq_tbl[node] = numa_pwq_tbl_install(wq, node, pwq_tbl[node]);
+
+ /* @dfl_pwq might not have been used, ensure it's linked */
+ link_pwq(dfl_pwq);
+ swap(wq->dfl_pwq, dfl_pwq);

mutex_unlock(&wq->mutex);

- put_pwq_unlocked(last_pwq);
+ /* put the old pwqs */
+ for_each_node(node)
+ put_pwq_unlocked(pwq_tbl[node]);
+ put_pwq_unlocked(dfl_pwq);
+
+ put_online_cpus();
return 0;

+enomem_pwq:
+ free_unbound_pwq(dfl_pwq);
+ for_each_node(node)
+ if (pwq_tbl && pwq_tbl[node] != dfl_pwq)
+ free_unbound_pwq(pwq_tbl[node]);
+ mutex_unlock(&wq_pool_mutex);
+ put_online_cpus();
enomem:
+ free_workqueue_attrs(tmp_attrs);
free_workqueue_attrs(new_attrs);
+ kfree(pwq_tbl);
return -ENOMEM;
}

+/**
+ * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
+ * @wq: the target workqueue
+ * @cpu: the CPU coming up or going down
+ * @online: whether @cpu is coming up or going down
+ *
+ * This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and
+ * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update NUMA affinity of
+ * @wq accordingly.
+ *
+ * If NUMA affinity can't be adjusted due to memory allocation failure, it
+ * falls back to @wq->dfl_pwq which may not be optimal but is always
+ * correct.
+ */
+static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
+ bool online)
+{
+ int node = cpu_to_node(cpu);
+ int cpu_off = online ? -1 : cpu;
+ struct pool_workqueue *old_pwq = NULL, *new_pwq = NULL;
+ struct pool_workqueue *pwq;
+ struct workqueue_attrs *target_attrs;
+ cpumask_t *cpumask;
+
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+ return;
+
+ /*
+ * We don't wanna alloc/free wq_attrs for each wq for each CPU.
+ * Let's use a preallocated one. The following buf is protected by
+ * CPU hotplug exclusion.
+ */
+ target_attrs = wq_update_unbound_numa_attrs_buf;
+ cpumask = target_attrs->cpumask;
+retry:
+ mutex_lock(&wq->mutex);
+
+ copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+ pwq = unbound_pwq_by_node(wq, node);
+
+ /*
+ * Let's determine what needs to be done. If the target cpumask is
+ * different from wq's, we need to compare it to @pwq's and create
+ * a new one if they don't match. If the target cpumask equals
+ * wq's, the default pwq should be used. If @pwq is already the
+ * default one, nothing to do; otherwise, install the default one.
+ */
+ if (wq_calc_node_cpumask(wq->unbound_attrs, node, cpu_off, cpumask)) {
+ if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
+ goto out_unlock;
+ } else if (pwq != wq->dfl_pwq) {
+ goto use_dfl_pwq;
+ } else {
+ goto out_unlock;
+ }
+
+ /*
+ * Have we already created a new pwq? As we could have raced with
+ * apply_workqueue_attrs(), verify that its attrs match the desired
+ * one before installing.
+ */
+ if (new_pwq && wqattrs_equal(new_pwq->pool->attrs, target_attrs)) {
+ old_pwq = numa_pwq_tbl_install(wq, node, new_pwq);
+ new_pwq = NULL;
+ goto out_unlock;
+ }
+
+ mutex_unlock(&wq->mutex);
+
+ /*
+ * Need to create a new pwq - either this is the first time or we
+ * raced with apply_workqueue_attrs() and our previous new pwq is
+ * no longer valid. Dispose of the previous one if exists and
+ * create a new one.
+ */
+ if (new_pwq)
+ free_unbound_pwq(new_pwq);
+
+ new_pwq = alloc_unbound_pwq(wq, target_attrs);
+ if (new_pwq)
+ goto retry;
+
+ pr_warning("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
+ wq->name);
+ mutex_lock(&wq->mutex);
+ /* fall through */
+use_dfl_pwq:
+ spin_lock_irq(&wq->dfl_pwq->pool->lock);
+ get_pwq(wq->dfl_pwq);
+ spin_unlock_irq(&wq->dfl_pwq->pool->lock);
+ old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq);
+out_unlock:
+ mutex_unlock(&wq->mutex);
+ put_pwq_unlocked(old_pwq);
+ free_unbound_pwq(new_pwq);
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3938,6 +4153,7 @@ EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
void destroy_workqueue(struct workqueue_struct *wq)
{
struct pool_workqueue *pwq;
+ int node;

/* drain it before proceeding with destruction */
drain_workqueue(wq);
@@ -3989,12 +4205,17 @@ void destroy_workqueue(struct workqueue_struct *wq)
} else {
/*
* We're the sole accessor of @wq at this point. Directly
- * access the first pwq and put the base ref. @wq will be
- * freed when the last pwq is released.
+ * access numa_pwq_tbl[] and dfl_pwq to put the base refs.
+ * @wq will be freed when the last pwq is released.
*/
- pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
- put_pwq_unlocked(pwq);
+ for_each_node(node) {
+ pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
+ put_pwq_unlocked(pwq);
+ }
+
+ put_pwq_unlocked(wq->dfl_pwq);
+ wq->dfl_pwq = NULL;
}
}
EXPORT_SYMBOL_GPL(destroy_workqueue);
@@ -4281,6 +4502,7 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
{
int cpu = (unsigned long)hcpu;
struct worker_pool *pool;
+ struct workqueue_struct *wq;
int pi;

switch (action & ~CPU_TASKS_FROZEN) {
@@ -4313,6 +4535,10 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
mutex_unlock(&pool->manager_mutex);
}

+ /* update NUMA affinity of unbound workqueues */
+ list_for_each_entry(wq, &workqueues, list)
+ wq_update_unbound_numa(wq, cpu, true);
+
mutex_unlock(&wq_pool_mutex);
break;
}
@@ -4329,12 +4555,21 @@ static int __cpuinit workqueue_cpu_down_callback(struct notifier_block *nfb,
{
int cpu = (unsigned long)hcpu;
struct work_struct unbind_work;
+ struct workqueue_struct *wq;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
- /* unbinding should happen on the local CPU */
+ /* unbinding per-cpu workers should happen on the local CPU */
INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
queue_work_on(cpu, system_highpri_wq, &unbind_work);
+
+ /* update NUMA affinity of unbound workqueues */
+ mutex_lock(&wq_pool_mutex);
+ list_for_each_entry(wq, &workqueues, list)
+ wq_update_unbound_numa(wq, cpu, false);
+ mutex_unlock(&wq_pool_mutex);
+
+ /* wait for per-cpu unbinding to finish */
flush_work(&unbind_work);
break;
}
@@ -4522,6 +4757,9 @@ static void __init wq_numa_init(void)
if (num_possible_nodes() <= 1)
return;

+ wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs(GFP_KERNEL);
+ BUG_ON(!wq_update_unbound_numa_attrs_buf);
+
/*
* We want masks of possible CPUs of each node which isn't readily
* available. Build one from cpu_to_node() which should have been
--
1.8.1.4

2013-03-28 06:44:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/14] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation. The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these. Future patches will make use of
them.

v2: NUMA initialization move into wq_numa_init(). Also, the possible
cpumask array is not created if there aren't multiple nodes on the
system. wq_numa_enabled bool added.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 26771f4e..54b5048 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,7 @@
#include <linux/jhash.h>
#include <linux/hashtable.h>
#include <linux/rculist.h>
+#include <linux/nodemask.h>

#include "workqueue_internal.h"

@@ -253,6 +254,12 @@ struct workqueue_struct {

static struct kmem_cache *pwq_cache;

+static int wq_numa_tbl_len; /* highest possible NUMA node id + 1 */
+static cpumask_var_t *wq_numa_possible_cpumask;
+ /* possible CPUs of each node */
+
+static bool wq_numa_enabled; /* unbound NUMA affinity enabled */
+
static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */

@@ -4402,6 +4409,43 @@ out_unlock:
}
#endif /* CONFIG_FREEZER */

+static void __init wq_numa_init(void)
+{
+ cpumask_var_t *tbl;
+ int node, cpu;
+
+ /* determine NUMA pwq table len - highest node id + 1 */
+ for_each_node(node)
+ wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
+
+ if (num_possible_nodes() <= 1)
+ return;
+
+ /*
+ * We want masks of possible CPUs of each node which isn't readily
+ * available. Build one from cpu_to_node() which should have been
+ * fully initialized by now.
+ */
+ tbl = kzalloc(wq_numa_tbl_len * sizeof(tbl[0]), GFP_KERNEL);
+ BUG_ON(!tbl);
+
+ for_each_node(node)
+ BUG_ON(!alloc_cpumask_var_node(&tbl[node], GFP_KERNEL, node));
+
+ for_each_possible_cpu(cpu) {
+ node = cpu_to_node(cpu);
+ if (WARN_ON(node == NUMA_NO_NODE)) {
+ pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+ /* happens iff arch is bonkers, let's just proceed */
+ return;
+ }
+ cpumask_set_cpu(cpu, tbl[node]);
+ }
+
+ wq_numa_possible_cpumask = tbl;
+ wq_numa_enabled = true;
+}
+
static int __init init_workqueues(void)
{
int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
@@ -4418,6 +4462,8 @@ static int __init init_workqueues(void)
cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);

+ wq_numa_init();
+
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
--
1.8.1.4

2013-03-29 22:44:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v4 13/14] workqueue: implement NUMA affinity for unbound workqueues

>From e4bc30ced68420e89d264a26e10d450765a747ed Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Fri, 29 Mar 2013 15:42:27 -0700

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
There was a silly bug in destroy_workqueue(). wq/review-numa branch
updated accordingly.

Thanks.

kernel/workqueue.c | 281 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 262 insertions(+), 19 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e656931..3b710cd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
#include <linux/hashtable.h>
#include <linux/rculist.h>
#include <linux/nodemask.h>
+#include <linux/moduleparam.h>

#include "workqueue_internal.h"

@@ -245,6 +246,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */

struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
+ struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */

#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
@@ -268,6 +270,9 @@ static cpumask_var_t *wq_numa_possible_cpumask;

static bool wq_numa_enabled; /* unbound NUMA affinity enabled */

+/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
+static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+
static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */

@@ -3710,6 +3715,61 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
return pwq;
}

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (pwq) {
+ put_unbound_pool(pwq->pool);
+ kfree(pwq);
+ }
+}
+
+/**
+ * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
+ * @attrs: the wq_attrs of interest
+ * @node: the target NUMA node
+ * @cpu_going_down: if >= 0, the CPU to consider as offline
+ * @cpumask: outarg, the resulting cpumask
+ *
+ * Calculate the cpumask a workqueue with @attrs should use on @node. If
+ * @cpu_going_down is >= 0, that cpu is considered offline during
+ * calculation. The result is stored in @cpumask. This function returns
+ * %true if the resulting @cpumask is different from @attrs->cpumask,
+ * %false if equal.
+ *
+ * If NUMA affinity is not enabled, @attrs->cpumask is always used. If
+ * enabled and @node has online CPUs requested by @attrs, the returned
+ * cpumask is the intersection of the possible CPUs of @node and
+ * @attrs->cpumask.
+ *
+ * The caller is responsible for ensuring that the cpumask of @node stays
+ * stable.
+ */
+static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
+ int cpu_going_down, cpumask_t *cpumask)
+{
+ if (!wq_numa_enabled)
+ goto use_dfl;
+
+ /* does @node have any online CPUs @attrs wants? */
+ cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
+ if (cpu_going_down >= 0)
+ cpumask_clear_cpu(cpu_going_down, cpumask);
+
+ if (cpumask_empty(cpumask))
+ goto use_dfl;
+
+ /* yeap, return possible CPUs in @node that @attrs wants */
+ cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
+ return !cpumask_equal(cpumask, attrs->cpumask);
+
+use_dfl:
+ cpumask_copy(cpumask, attrs->cpumask);
+ return false;
+}
+
/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
int node,
@@ -3732,11 +3792,12 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
* @wq: the target workqueue
* @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
*
- * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items. Older pwqs are released as in-flight
- * work items finish. Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
*
* Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
* failure.
@@ -3744,8 +3805,8 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
- struct workqueue_attrs *new_attrs;
- struct pool_workqueue *pwq, *last_pwq = NULL;
+ struct workqueue_attrs *new_attrs, *tmp_attrs;
+ struct pool_workqueue **pwq_tbl, *dfl_pwq;
int node;

/* only unbound workqueues can change attributes */
@@ -3756,36 +3817,190 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- /* make a copy of @attrs and sanitize it */
+ pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
new_attrs = alloc_workqueue_attrs(GFP_KERNEL);
- if (!new_attrs)
+ tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pwq_tbl || !new_attrs || !tmp_attrs)
goto enomem;

+ /* make a copy of @attrs and sanitize it */
copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);

+ /*
+ * We may create multiple pwqs with differing cpumasks. Make a
+ * copy of @new_attrs which will be modified and used to obtain
+ * pools.
+ */
+ copy_workqueue_attrs(tmp_attrs, new_attrs);
+
+ /*
+ * CPUs should stay stable across pwq creations and installations.
+ * Pin CPUs, determine the target cpumask for each node and create
+ * pwqs accordingly.
+ */
+ get_online_cpus();
+
mutex_lock(&wq_pool_mutex);
- pwq = alloc_unbound_pwq(wq, new_attrs);
+
+ /*
+ * If something goes wrong during CPU up/down, we'll fall back to
+ * the default pwq covering whole @attrs->cpumask. Always create
+ * it even if we don't use it immediately.
+ */
+ dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
+ if (!dfl_pwq)
+ goto enomem_pwq;
+
+ for_each_node(node) {
+ if (wq_calc_node_cpumask(attrs, node, -1, tmp_attrs->cpumask)) {
+ pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq_tbl[node])
+ goto enomem_pwq;
+ } else {
+ dfl_pwq->refcnt++;
+ pwq_tbl[node] = dfl_pwq;
+ }
+ }
+
mutex_unlock(&wq_pool_mutex);
- if (!pwq)
- goto enomem;

+ /* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->mutex);

copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
+
+ /* save the previous pwq and install the new one */
for_each_node(node)
- last_pwq = numa_pwq_tbl_install(wq, node, pwq);
+ pwq_tbl[node] = numa_pwq_tbl_install(wq, node, pwq_tbl[node]);
+
+ /* @dfl_pwq might not have been used, ensure it's linked */
+ link_pwq(dfl_pwq);
+ swap(wq->dfl_pwq, dfl_pwq);

mutex_unlock(&wq->mutex);

- put_pwq_unlocked(last_pwq);
+ /* put the old pwqs */
+ for_each_node(node)
+ put_pwq_unlocked(pwq_tbl[node]);
+ put_pwq_unlocked(dfl_pwq);
+
+ put_online_cpus();
return 0;

+enomem_pwq:
+ free_unbound_pwq(dfl_pwq);
+ for_each_node(node)
+ if (pwq_tbl && pwq_tbl[node] != dfl_pwq)
+ free_unbound_pwq(pwq_tbl[node]);
+ mutex_unlock(&wq_pool_mutex);
+ put_online_cpus();
enomem:
+ free_workqueue_attrs(tmp_attrs);
free_workqueue_attrs(new_attrs);
+ kfree(pwq_tbl);
return -ENOMEM;
}

+/**
+ * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
+ * @wq: the target workqueue
+ * @cpu: the CPU coming up or going down
+ * @online: whether @cpu is coming up or going down
+ *
+ * This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and
+ * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update NUMA affinity of
+ * @wq accordingly.
+ *
+ * If NUMA affinity can't be adjusted due to memory allocation failure, it
+ * falls back to @wq->dfl_pwq which may not be optimal but is always
+ * correct.
+ */
+static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
+ bool online)
+{
+ int node = cpu_to_node(cpu);
+ int cpu_off = online ? -1 : cpu;
+ struct pool_workqueue *old_pwq = NULL, *new_pwq = NULL;
+ struct pool_workqueue *pwq;
+ struct workqueue_attrs *target_attrs;
+ cpumask_t *cpumask;
+
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+ return;
+
+ /*
+ * We don't wanna alloc/free wq_attrs for each wq for each CPU.
+ * Let's use a preallocated one. The following buf is protected by
+ * CPU hotplug exclusion.
+ */
+ target_attrs = wq_update_unbound_numa_attrs_buf;
+ cpumask = target_attrs->cpumask;
+retry:
+ mutex_lock(&wq->mutex);
+
+ copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+ pwq = unbound_pwq_by_node(wq, node);
+
+ /*
+ * Let's determine what needs to be done. If the target cpumask is
+ * different from wq's, we need to compare it to @pwq's and create
+ * a new one if they don't match. If the target cpumask equals
+ * wq's, the default pwq should be used. If @pwq is already the
+ * default one, nothing to do; otherwise, install the default one.
+ */
+ if (wq_calc_node_cpumask(wq->unbound_attrs, node, cpu_off, cpumask)) {
+ if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
+ goto out_unlock;
+ } else if (pwq != wq->dfl_pwq) {
+ goto use_dfl_pwq;
+ } else {
+ goto out_unlock;
+ }
+
+ /*
+ * Have we already created a new pwq? As we could have raced with
+ * apply_workqueue_attrs(), verify that its attrs match the desired
+ * one before installing.
+ */
+ if (new_pwq && wqattrs_equal(new_pwq->pool->attrs, target_attrs)) {
+ old_pwq = numa_pwq_tbl_install(wq, node, new_pwq);
+ new_pwq = NULL;
+ goto out_unlock;
+ }
+
+ mutex_unlock(&wq->mutex);
+
+ /*
+ * Need to create a new pwq - either this is the first time or we
+ * raced with apply_workqueue_attrs() and our previous new pwq is
+ * no longer valid. Dispose of the previous one if exists and
+ * create a new one.
+ */
+ if (new_pwq)
+ free_unbound_pwq(new_pwq);
+
+ new_pwq = alloc_unbound_pwq(wq, target_attrs);
+ if (new_pwq)
+ goto retry;
+
+ pr_warning("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
+ wq->name);
+ mutex_lock(&wq->mutex);
+ /* fall through */
+use_dfl_pwq:
+ spin_lock_irq(&wq->dfl_pwq->pool->lock);
+ get_pwq(wq->dfl_pwq);
+ spin_unlock_irq(&wq->dfl_pwq->pool->lock);
+ old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq);
+out_unlock:
+ mutex_unlock(&wq->mutex);
+ put_pwq_unlocked(old_pwq);
+ free_unbound_pwq(new_pwq);
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3938,6 +4153,7 @@ EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
void destroy_workqueue(struct workqueue_struct *wq)
{
struct pool_workqueue *pwq;
+ int node;

/* drain it before proceeding with destruction */
drain_workqueue(wq);
@@ -3989,11 +4205,21 @@ void destroy_workqueue(struct workqueue_struct *wq)
} else {
/*
* We're the sole accessor of @wq at this point. Directly
- * access the first pwq and put the base ref. @wq will be
- * freed when the last pwq is released.
+ * access numa_pwq_tbl[] and dfl_pwq to put the base refs.
+ * @wq will be freed when the last pwq is released.
+ */
+ for_each_node(node) {
+ pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
+ put_pwq_unlocked(pwq);
+ }
+
+ /*
+ * Put dfl_pwq. @wq may be freed any time after dfl_pwq is
+ * put. Don't access it afterwards.
*/
- pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
+ pwq = wq->dfl_pwq;
+ wq->dfl_pwq = NULL;
put_pwq_unlocked(pwq);
}
}
@@ -4281,6 +4507,7 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
{
int cpu = (unsigned long)hcpu;
struct worker_pool *pool;
+ struct workqueue_struct *wq;
int pi;

switch (action & ~CPU_TASKS_FROZEN) {
@@ -4313,6 +4540,10 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
mutex_unlock(&pool->manager_mutex);
}

+ /* update NUMA affinity of unbound workqueues */
+ list_for_each_entry(wq, &workqueues, list)
+ wq_update_unbound_numa(wq, cpu, true);
+
mutex_unlock(&wq_pool_mutex);
break;
}
@@ -4329,12 +4560,21 @@ static int __cpuinit workqueue_cpu_down_callback(struct notifier_block *nfb,
{
int cpu = (unsigned long)hcpu;
struct work_struct unbind_work;
+ struct workqueue_struct *wq;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
- /* unbinding should happen on the local CPU */
+ /* unbinding per-cpu workers should happen on the local CPU */
INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
queue_work_on(cpu, system_highpri_wq, &unbind_work);
+
+ /* update NUMA affinity of unbound workqueues */
+ mutex_lock(&wq_pool_mutex);
+ list_for_each_entry(wq, &workqueues, list)
+ wq_update_unbound_numa(wq, cpu, false);
+ mutex_unlock(&wq_pool_mutex);
+
+ /* wait for per-cpu unbinding to finish */
flush_work(&unbind_work);
break;
}
@@ -4522,6 +4762,9 @@ static void __init wq_numa_init(void)
if (num_possible_nodes() <= 1)
return;

+ wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs(GFP_KERNEL);
+ BUG_ON(!wq_update_unbound_numa_attrs_buf);
+
/*
* We want masks of possible CPUs of each node which isn't readily
* available. Build one from cpu_to_node() which should have been
--
1.8.1.4

2013-03-30 17:23:46

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v4 13/14] workqueue: implement NUMA affinity for unbound workqueues

On 31/03/13 00:32, Tejun Heo wrote:
> Hello, Lai.
>
>
> On Sat, Mar 30, 2013 at 9:13 AM, Lai Jiangshan <[email protected] <mailto:[email protected]>> wrote:
>
>
> + /* all pwqs have been created successfully, let's install'em */
> mutex_lock(&wq->mutex);
>
> copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
> +
> + /* save the previous pwq and install the new one */
> for_each_node(node)
> - last_pwq = numa_pwq_tbl_install(wq, node, pwq);
> + pwq_tbl[node] = numa_pwq_tbl_install(wq, node, pwq_tbl[node]);
> +
> + /* @dfl_pwq might not have been used, ensure it's linked */
> + link_pwq(dfl_pwq);
> + swap(wq->dfl_pwq, dfl_pwq);
>
> mutex_unlock(&wq->mutex);
>
> - put_pwq_unlocked(last_pwq);
> + /* put the old pwqs */
> + for_each_node(node)
> + put_pwq_unlocked(pwq_tbl[node]);
> + put_pwq_unlocked(dfl_pwq);
> +
> + put_online_cpus();
> return 0;
>
>
>
> Forgot to free new_attrs in previous patch
> (workqueue: fix unbound workqueue attrs hashing / comparison).
>
> Forgot to free tmp_attrs, pwq_tbl in this patch.
>
>
> Right, will fix.
>
> +retry:
> + mutex_lock(&wq->mutex);
> +
> + copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
> + pwq = unbound_pwq_by_node(wq, node);
> +
> + /*
> + * Let's determine what needs to be done. If the target cpumask is
> + * different from wq's, we need to compare it to @pwq's and create
> + * a new one if they don't match. If the target cpumask equals
> + * wq's, the default pwq should be used. If @pwq is already the
> + * default one, nothing to do; otherwise, install the default one.
> + */
> + if (wq_calc_node_cpumask(wq->unbound_attrs, node, cpu_off, cpumask)) {
> + if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
> + goto out_unlock;
> + } else if (pwq != wq->dfl_pwq) {
> + goto use_dfl_pwq;
> + } else {
> + goto out_unlock;
> + }
> +
> + /*
> + * Have we already created a new pwq? As we could have raced with
> + * apply_workqueue_attrs(), verify that its attrs match the desired
> + * one before installing.
> + */
>
>
> I don't see any race since there is get/put_online_cpu() in apply_workqueue_attrs().
>
>
> I don't know. I kinda want wq exclusion to be self-contained, but yeah the hotplug exclusion here is *almost* explicit so maybe it would be better to depend on it. Will think about it.
>
> + mutex_unlock(&wq->mutex);
> + put_pwq_unlocked(old_pwq);
> + free_unbound_pwq(new_pwq);
> +}
>
>
> OK, your solution is what I suggested: swapping dfl_pwq <-> node pwq.
> But when the last cpu of the node(of the wq) is trying to offline.
> you need to handle the work items of node pwq(old_pwq in the code).
>
> you may handle the works which are still queued by migrating, OR by
> flushing the works.
> and you may handle busy works by temporary changing the cpumask of
> the workers, OR by flushing the busy works.
>
>
> I don't think that's necessary.

Please document it.

> It's not like we have hard guarantee on attr changes anyway.
> Self-requeueing work items can get stuck with old attributes for quite a while,

It is OK for it is documented.

> and even per-cpu work items get migrated to other CPUs on CPU DOWN.

It is expected.

But for unbound wq when cpuhotplug
w/o NUMA affinity, works are always in the cpus if there is online cpu in wq's cpumask
w/ NUMA affinity, ......... NOT always ........ even ....................................

> Workqueue's affinity guarantee is very specific - the work item owner is
> responsible for flushing the work item during CPU DOWN if it wants
> to guarantee affinity over full execution.

Could you add the comments and add Reviewed-by: Lai Jiangshan <[email protected]>
for the patchset?

Thanks,
Lai

2013-03-31 19:06:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v4 13/14] workqueue: implement NUMA affinity for unbound workqueues

Hello, Lai.

On Sun, Mar 31, 2013 at 01:23:46AM +0800, Lai Jiangshan wrote:
> But for unbound wq when cpuhotplug
> w/o NUMA affinity, works are always in the cpus if there is online cpu in wq's cpumask
> w/ NUMA affinity, ......... NOT always ........ even ....................................

Yeah, this is rather unfortunate but cpumask for unbound workqueue is
a completely new thing anyway and I think providing a similar
guarantee as per-cpu should be enough. Things are much simpler that
way and requiring users which depend on hard affinity to take care of
flushing is reasonable enough and in line with how workqueue has
traditionally been working.

> > Workqueue's affinity guarantee is very specific - the work item owner is
> > responsible for flushing the work item during CPU DOWN if it wants
> > to guarantee affinity over full execution.
>
> Could you add the comments and add Reviewed-by: Lai Jiangshan <[email protected]>
> for the patchset?

Sure thing.

Thanks.

--
tejun

2013-04-01 18:28:07

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 0.5/14] workqueue: fix memory leak in apply_workqueue_attrs()

>From 4862125b0256a946d2749a1d5003b0604bc3cb4d Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Mon, 1 Apr 2013 11:23:31 -0700

apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
in its success path. Fix it.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Lai Jiangshan <[email protected]>
---
This causes some minor conflicts down the series but nothing
noteworthy. I'm not posting the whole refreshed series. If you want
it, holler.

Thanks.

kernel/workqueue.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index abe1f0d..89480fc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3636,6 +3636,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
struct workqueue_attrs *new_attrs;
struct pool_workqueue *pwq = NULL, *last_pwq;
struct worker_pool *pool;
+ int ret;

/* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3668,12 +3669,16 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
spin_unlock_irq(&last_pwq->pool->lock);
}

- return 0;
+ ret = 0;
+ /* fall through */
+out_free:
+ free_workqueue_attrs(new_attrs);
+ return ret;

enomem:
kmem_cache_free(pwq_cache, pwq);
- free_workqueue_attrs(new_attrs);
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out_free;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
--
1.8.1.4

2013-04-01 18:28:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v5 13/14] workqueue: implement NUMA affinity for unbound workqueues

>From 4c16bd327c74d6678858706211a0c6e4e53eb3e6 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Mon, 1 Apr 2013 11:23:36 -0700

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.

v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.

Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.

Documentation on CPU affinity guarantee on CPU_DOWN added.

All changes are suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 278 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 259 insertions(+), 19 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d9a4aeb..57cd77d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
#include <linux/hashtable.h>
#include <linux/rculist.h>
#include <linux/nodemask.h>
+#include <linux/moduleparam.h>

#include "workqueue_internal.h"

@@ -245,6 +246,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */

struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
+ struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */

#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
@@ -268,6 +270,9 @@ static cpumask_var_t *wq_numa_possible_cpumask;

static bool wq_numa_enabled; /* unbound NUMA affinity enabled */

+/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
+static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+
static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */

@@ -3710,6 +3715,61 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
return pwq;
}

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (pwq) {
+ put_unbound_pool(pwq->pool);
+ kfree(pwq);
+ }
+}
+
+/**
+ * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
+ * @attrs: the wq_attrs of interest
+ * @node: the target NUMA node
+ * @cpu_going_down: if >= 0, the CPU to consider as offline
+ * @cpumask: outarg, the resulting cpumask
+ *
+ * Calculate the cpumask a workqueue with @attrs should use on @node. If
+ * @cpu_going_down is >= 0, that cpu is considered offline during
+ * calculation. The result is stored in @cpumask. This function returns
+ * %true if the resulting @cpumask is different from @attrs->cpumask,
+ * %false if equal.
+ *
+ * If NUMA affinity is not enabled, @attrs->cpumask is always used. If
+ * enabled and @node has online CPUs requested by @attrs, the returned
+ * cpumask is the intersection of the possible CPUs of @node and
+ * @attrs->cpumask.
+ *
+ * The caller is responsible for ensuring that the cpumask of @node stays
+ * stable.
+ */
+static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
+ int cpu_going_down, cpumask_t *cpumask)
+{
+ if (!wq_numa_enabled)
+ goto use_dfl;
+
+ /* does @node have any online CPUs @attrs wants? */
+ cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
+ if (cpu_going_down >= 0)
+ cpumask_clear_cpu(cpu_going_down, cpumask);
+
+ if (cpumask_empty(cpumask))
+ goto use_dfl;
+
+ /* yeap, return possible CPUs in @node that @attrs wants */
+ cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
+ return !cpumask_equal(cpumask, attrs->cpumask);
+
+use_dfl:
+ cpumask_copy(cpumask, attrs->cpumask);
+ return false;
+}
+
/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
int node,
@@ -3732,11 +3792,12 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
* @wq: the target workqueue
* @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
*
- * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items. Older pwqs are released as in-flight
- * work items finish. Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
*
* Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
* failure.
@@ -3744,8 +3805,8 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
- struct workqueue_attrs *new_attrs;
- struct pool_workqueue *pwq, *last_pwq = NULL;
+ struct workqueue_attrs *new_attrs, *tmp_attrs;
+ struct pool_workqueue **pwq_tbl, *dfl_pwq;
int node, ret;

/* only unbound workqueues can change attributes */
@@ -3756,40 +3817,191 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- /* make a copy of @attrs and sanitize it */
+ pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
new_attrs = alloc_workqueue_attrs(GFP_KERNEL);
- if (!new_attrs)
+ tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pwq_tbl || !new_attrs || !tmp_attrs)
goto enomem;

+ /* make a copy of @attrs and sanitize it */
copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);

+ /*
+ * We may create multiple pwqs with differing cpumasks. Make a
+ * copy of @new_attrs which will be modified and used to obtain
+ * pools.
+ */
+ copy_workqueue_attrs(tmp_attrs, new_attrs);
+
+ /*
+ * CPUs should stay stable across pwq creations and installations.
+ * Pin CPUs, determine the target cpumask for each node and create
+ * pwqs accordingly.
+ */
+ get_online_cpus();
+
mutex_lock(&wq_pool_mutex);
- pwq = alloc_unbound_pwq(wq, new_attrs);
+
+ /*
+ * If something goes wrong during CPU up/down, we'll fall back to
+ * the default pwq covering whole @attrs->cpumask. Always create
+ * it even if we don't use it immediately.
+ */
+ dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
+ if (!dfl_pwq)
+ goto enomem_pwq;
+
+ for_each_node(node) {
+ if (wq_calc_node_cpumask(attrs, node, -1, tmp_attrs->cpumask)) {
+ pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq_tbl[node])
+ goto enomem_pwq;
+ } else {
+ dfl_pwq->refcnt++;
+ pwq_tbl[node] = dfl_pwq;
+ }
+ }
+
mutex_unlock(&wq_pool_mutex);
- if (!pwq)
- goto enomem;

+ /* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->mutex);

copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
+
+ /* save the previous pwq and install the new one */
for_each_node(node)
- last_pwq = numa_pwq_tbl_install(wq, node, pwq);
+ pwq_tbl[node] = numa_pwq_tbl_install(wq, node, pwq_tbl[node]);
+
+ /* @dfl_pwq might not have been used, ensure it's linked */
+ link_pwq(dfl_pwq);
+ swap(wq->dfl_pwq, dfl_pwq);

mutex_unlock(&wq->mutex);

- put_pwq_unlocked(last_pwq);
+ /* put the old pwqs */
+ for_each_node(node)
+ put_pwq_unlocked(pwq_tbl[node]);
+ put_pwq_unlocked(dfl_pwq);
+
+ put_online_cpus();
ret = 0;
/* fall through */
out_free:
+ free_workqueue_attrs(tmp_attrs);
free_workqueue_attrs(new_attrs);
+ kfree(pwq_tbl);
return ret;

+enomem_pwq:
+ free_unbound_pwq(dfl_pwq);
+ for_each_node(node)
+ if (pwq_tbl && pwq_tbl[node] != dfl_pwq)
+ free_unbound_pwq(pwq_tbl[node]);
+ mutex_unlock(&wq_pool_mutex);
+ put_online_cpus();
enomem:
ret = -ENOMEM;
goto out_free;
}

+/**
+ * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
+ * @wq: the target workqueue
+ * @cpu: the CPU coming up or going down
+ * @online: whether @cpu is coming up or going down
+ *
+ * This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and
+ * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update NUMA affinity of
+ * @wq accordingly.
+ *
+ * If NUMA affinity can't be adjusted due to memory allocation failure, it
+ * falls back to @wq->dfl_pwq which may not be optimal but is always
+ * correct.
+ *
+ * Note that when the last allowed CPU of a NUMA node goes offline for a
+ * workqueue with a cpumask spanning multiple nodes, the workers which were
+ * already executing the work items for the workqueue will lose their CPU
+ * affinity and may execute on any CPU. This is similar to how per-cpu
+ * workqueues behave on CPU_DOWN. If a workqueue user wants strict
+ * affinity, it's the user's responsibility to flush the work item from
+ * CPU_DOWN_PREPARE.
+ */
+static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
+ bool online)
+{
+ int node = cpu_to_node(cpu);
+ int cpu_off = online ? -1 : cpu;
+ struct pool_workqueue *old_pwq = NULL, *pwq;
+ struct workqueue_attrs *target_attrs;
+ cpumask_t *cpumask;
+
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+ return;
+
+ /*
+ * We don't wanna alloc/free wq_attrs for each wq for each CPU.
+ * Let's use a preallocated one. The following buf is protected by
+ * CPU hotplug exclusion.
+ */
+ target_attrs = wq_update_unbound_numa_attrs_buf;
+ cpumask = target_attrs->cpumask;
+
+ mutex_lock(&wq->mutex);
+
+ copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+ pwq = unbound_pwq_by_node(wq, node);
+
+ /*
+ * Let's determine what needs to be done. If the target cpumask is
+ * different from wq's, we need to compare it to @pwq's and create
+ * a new one if they don't match. If the target cpumask equals
+ * wq's, the default pwq should be used. If @pwq is already the
+ * default one, nothing to do; otherwise, install the default one.
+ */
+ if (wq_calc_node_cpumask(wq->unbound_attrs, node, cpu_off, cpumask)) {
+ if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
+ goto out_unlock;
+ } else {
+ if (pwq == wq->dfl_pwq)
+ goto out_unlock;
+ else
+ goto use_dfl_pwq;
+ }
+
+ mutex_unlock(&wq->mutex);
+
+ /* create a new pwq */
+ pwq = alloc_unbound_pwq(wq, target_attrs);
+ if (!pwq) {
+ pr_warning("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
+ wq->name);
+ goto out_unlock;
+ }
+
+ /*
+ * Install the new pwq. As this function is called only from CPU
+ * hotplug callbacks and applying a new attrs is wrapped with
+ * get/put_online_cpus(), @wq->unbound_attrs couldn't have changed
+ * inbetween.
+ */
+ mutex_lock(&wq->mutex);
+ old_pwq = numa_pwq_tbl_install(wq, node, pwq);
+ goto out_unlock;
+
+use_dfl_pwq:
+ spin_lock_irq(&wq->dfl_pwq->pool->lock);
+ get_pwq(wq->dfl_pwq);
+ spin_unlock_irq(&wq->dfl_pwq->pool->lock);
+ old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq);
+out_unlock:
+ mutex_unlock(&wq->mutex);
+ put_pwq_unlocked(old_pwq);
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3942,6 +4154,7 @@ EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
void destroy_workqueue(struct workqueue_struct *wq)
{
struct pool_workqueue *pwq;
+ int node;

/* drain it before proceeding with destruction */
drain_workqueue(wq);
@@ -3993,11 +4206,21 @@ void destroy_workqueue(struct workqueue_struct *wq)
} else {
/*
* We're the sole accessor of @wq at this point. Directly
- * access the first pwq and put the base ref. @wq will be
- * freed when the last pwq is released.
+ * access numa_pwq_tbl[] and dfl_pwq to put the base refs.
+ * @wq will be freed when the last pwq is released.
*/
- pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
+ for_each_node(node) {
+ pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+ RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
+ put_pwq_unlocked(pwq);
+ }
+
+ /*
+ * Put dfl_pwq. @wq may be freed any time after dfl_pwq is
+ * put. Don't access it afterwards.
+ */
+ pwq = wq->dfl_pwq;
+ wq->dfl_pwq = NULL;
put_pwq_unlocked(pwq);
}
}
@@ -4285,6 +4508,7 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
{
int cpu = (unsigned long)hcpu;
struct worker_pool *pool;
+ struct workqueue_struct *wq;
int pi;

switch (action & ~CPU_TASKS_FROZEN) {
@@ -4317,6 +4541,10 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
mutex_unlock(&pool->manager_mutex);
}

+ /* update NUMA affinity of unbound workqueues */
+ list_for_each_entry(wq, &workqueues, list)
+ wq_update_unbound_numa(wq, cpu, true);
+
mutex_unlock(&wq_pool_mutex);
break;
}
@@ -4333,12 +4561,21 @@ static int __cpuinit workqueue_cpu_down_callback(struct notifier_block *nfb,
{
int cpu = (unsigned long)hcpu;
struct work_struct unbind_work;
+ struct workqueue_struct *wq;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
- /* unbinding should happen on the local CPU */
+ /* unbinding per-cpu workers should happen on the local CPU */
INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
queue_work_on(cpu, system_highpri_wq, &unbind_work);
+
+ /* update NUMA affinity of unbound workqueues */
+ mutex_lock(&wq_pool_mutex);
+ list_for_each_entry(wq, &workqueues, list)
+ wq_update_unbound_numa(wq, cpu, false);
+ mutex_unlock(&wq_pool_mutex);
+
+ /* wait for per-cpu unbinding to finish */
flush_work(&unbind_work);
break;
}
@@ -4526,6 +4763,9 @@ static void __init wq_numa_init(void)
if (num_possible_nodes() <= 1)
return;

+ wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs(GFP_KERNEL);
+ BUG_ON(!wq_update_unbound_numa_attrs_buf);
+
/*
* We want masks of possible CPUs of each node which isn't readily
* available. Build one from cpu_to_node() which should have been
--
1.8.1.4

2013-04-01 18:29:28

by Tejun Heo

[permalink] [raw]
Subject: Re: Subject: [PATCHSET v2 wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

Applied to wq/for-3.10 with the updated patches and Lai's
Reviewed-by's added.

Thanks.

--
tejun