2013-09-05 05:12:59

by Li Bin

[permalink] [raw]
Subject: [PATCH] workqueue: fix ordered workqueue in multi NUMA nodes platform

In platform with multi NUMA nodes, there is no ordering guarantee
with the workqueue created by calling alloc_ordered_workqueue().

Add member ordered_pwq in structure workqueue_struct, used to hold
the first choice of pwq, in order to avoid breaking its ordering
guarantee under enqueueing work in different NUMA nodes.

Signed-off-by: Libin <[email protected]>
---
kernel/workqueue.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 29b7985..42c6c29 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -249,6 +249,7 @@ struct workqueue_struct {

struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */
+ struct pool_workqueue *ordered_pwq; /* WQ: only for ordered wqs */

#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
@@ -1326,10 +1327,19 @@ retry:

/* pwq which will be used unless @work is executing elsewhere */
if (!(wq->flags & WQ_UNBOUND))
- pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
- else
- pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
+ pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
+ else {
+ pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
+ if (wq->flags & __WQ_ORDERED) {
+ mutex_lock(&wq->mutex);
+ if (wq->ordered_pwq == NULL)
+ wq->ordered_pwq = pwq;
+ else
+ pwq = wq->ordered_pwq;
+ mutex_unlock(&wq->mutex);
+ }

+ }
/*
* If @work was previously on a different pool, it might still be
* running there, in which case the work needs to be queued on that
--
1.8.2.1


2013-09-05 16:37:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] workqueue: fix ordered workqueue in NUMA setups

Hello, Libin.

Thanks a lot for the report and the possible fix. I took a different
approach as ensuring the pwqs are setup correctly during creation is
cleaner / lower overhead. Can you please consider the followings when
submitting patches in the future?

* Please use your full name in SOBs.

* Please use tab for tabs instead of spaces and a tab should be 8
space wide. It's strongly discouraged to use an editor configured
to treat a tab as four spaces.

Thanks.

----- 8< ------
An ordered workqueue implements execution ordering by using single
pool_workqueue with max_active == 1. On a given pool_workqueue, work
items are processed in FIFO order and limiting max_active to 1
enforces the queued work items to be processed one by one.

Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
unbound workqueues") accidentally broke this guarantee by applying
NUMA affinity to ordered workqueues too. On NUMA setups, an ordered
workqueue would end up with separate pool_workqueues for different
nodes. Each pool_workqueue still limits max_active to 1 but multiple
work items may be executed concurrently and out of order depending on
which node they are queued to.

Fix it by using dedicated ordered_wq_attrs[] when creating ordered
workqueues. The new attrs match the unbound ones except that no_numa
is always set thus forcing all NUMA nodes to share the default
pool_workqueue.

While at it, add sanity check in workqueue creation path which
verifies that an ordered workqueues has only the default
pool_workqueue.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Libin <[email protected]>
Cc: [email protected]
Cc: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 29b7985..9c91b1d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -305,6 +305,9 @@ static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
/* I: attributes used when instantiating standard unbound pools on demand */
static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];

+/* I: attributes used when instantiating ordered pools on demand */
+static struct workqueue_attrs *ordered_wq_attrs[NR_STD_WORKER_POOLS];
+
struct workqueue_struct *system_wq __read_mostly;
EXPORT_SYMBOL(system_wq);
struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -4084,7 +4087,7 @@ out_unlock:
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
- int cpu;
+ int cpu, ret;

if (!(wq->flags & WQ_UNBOUND)) {
wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
@@ -4104,6 +4107,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
mutex_unlock(&wq->mutex);
}
return 0;
+ } else if (wq->flags & __WQ_ORDERED) {
+ ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
+ /* there should only be single pwq for ordering guarantee */
+ WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
+ wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
+ "ordering guarantee broken for workqueue %s\n", wq->name);
+ return ret;
} else {
return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
}
@@ -5026,13 +5036,23 @@ static int __init init_workqueues(void)
}
}

- /* create default unbound wq attrs */
+ /* create default unbound and ordered wq attrs */
for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
struct workqueue_attrs *attrs;

BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
attrs->nice = std_nice[i];
unbound_std_wq_attrs[i] = attrs;
+
+ /*
+ * An ordered wq should have only one pwq as ordering is
+ * guaranteed by max_active which is enforced by pwqs.
+ * Turn off NUMA so that dfl_pwq is used for all nodes.
+ */
+ BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
+ attrs->nice = std_nice[i];
+ attrs->no_numa = true;
+ ordered_wq_attrs[i] = attrs;
}

system_wq = alloc_workqueue("events", 0, 0);