2015-05-11 09:32:33

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 0/5] workqueue: cleanup for apply_workqueue_attrs()

This patchset has several cleanups to apply_workqueue_attrs(),
including enlagring the protection region of wq_pool_mutex,
merging similar code, changing the API ...

Patch3 is not just cleanup, it changes behavior and
ensures attrs-changing be sequentially.

Thanks,
Lai

Cc: Tejun Heo <[email protected]>

Lai Jiangshan (5):
workqueue: wq_pool_mutex protects the attrs-installation
workqueue: merge the similar code
workqueue: ensure attrs-changing be sequentially
workqueue: don't expose workqueue_attrs to users
workqueue: remove no_numa from workqueue_attrs

include/linux/workqueue.h | 18 +-
kernel/workqueue.c | 433 +++++++++++++++++++++-------------------------
2 files changed, 200 insertions(+), 251 deletions(-)

--
2.1.0


2015-05-11 09:32:35

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/5] workqueue: wq_pool_mutex protects the attrs-installation

Current wq_pool_mutex doesn't proctect the attrs-installation, it results
that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
and the old_pwq. After this patch, wq_update_unbound_numa() is simplified.

attrs-installation (including apply_wqattrs_cleanup()) is a short operation,
so this change will no cause any latency for other operations which also
acquire the wq_pool_mutex.

The only unprotected attrs-installation code is in apply_workqueue_attrs(),
so this patch touches code less than comments.

It is also a preparation patch of merging similar code in apply_workqueue_attrs()
and wq_update_unbound_numa() together.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a3915ab..fa8b949 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -127,6 +127,12 @@ enum {
*
* PR: wq_pool_mutex protected for writes. Sched-RCU protected for reads.
*
+ * PW: wq_pool_mutex and wq->mutex protected for writes. Any one of them
+ * protected for reads.
+ *
+ * PWR: wq_pool_mutex and wq->mutex protected for writes. Any one of them
+ * or sched-RCU for reads.
+ *
* WQ: wq->mutex protected.
*
* WR: wq->mutex protected for writes. Sched-RCU protected for reads.
@@ -247,8 +253,8 @@ struct workqueue_struct {
int nr_drainers; /* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */

- struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
- struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */
+ struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */
+ struct pool_workqueue *dfl_pwq; /* PW: only for unbound wqs */

#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
@@ -268,7 +274,7 @@ struct workqueue_struct {
/* hot fields used during command issue, aligned to cacheline */
unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
- struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
+ struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */
};

static struct kmem_cache *pwq_cache;
@@ -349,6 +355,12 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
lockdep_is_held(&wq->mutex), \
"sched RCU or wq->mutex should be held")

+#define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \
+ rcu_lockdep_assert(rcu_read_lock_sched_held() || \
+ lockdep_is_held(&wq->mutex) || \
+ lockdep_is_held(&wq_pool_mutex), \
+ "sched RCU, wq->mutex or wq_pool_mutex should be held")
+
#define for_each_cpu_worker_pool(pool, cpu) \
for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
(pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
@@ -553,7 +565,7 @@ static int worker_pool_assign_id(struct worker_pool *pool)
* @wq: the target workqueue
* @node: the node ID
*
- * This must be called either with pwq_lock held or sched RCU read locked.
+ * This must be called either with wq_pool_mutex held or sched RCU read locked.
* If the pwq needs to be used beyond the locking in effect, the caller is
* responsible for guaranteeing that the pwq stays online.
*
@@ -562,7 +574,7 @@ static int worker_pool_assign_id(struct worker_pool *pool)
static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
int node)
{
- assert_rcu_or_wq_mutex(wq);
+ assert_rcu_or_wq_mutex_or_pool_mutex(wq);
return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
}

@@ -3480,6 +3492,7 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
struct pool_workqueue *old_pwq;

lockdep_assert_held(&wq->mutex);
+ lockdep_assert_held(&wq_pool_mutex);

/* link_pwq() can handle duplicate calls */
link_pwq(pwq);
@@ -3644,10 +3657,9 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
* pwqs accordingly.
*/
get_online_cpus();
-
mutex_lock(&wq_pool_mutex);
+
ctx = apply_wqattrs_prepare(wq, attrs);
- mutex_unlock(&wq_pool_mutex);

/* the ctx has been prepared successfully, let's commit it */
if (ctx) {
@@ -3655,10 +3667,11 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
ret = 0;
}

- put_online_cpus();
-
apply_wqattrs_cleanup(ctx);

+ mutex_unlock(&wq_pool_mutex);
+ put_online_cpus();
+
return ret;
}

@@ -3695,7 +3708,8 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,

lockdep_assert_held(&wq_pool_mutex);

- if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+ if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
+ wq->unbound_attrs->no_numa)
return;

/*
@@ -3706,10 +3720,6 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
target_attrs = wq_update_unbound_numa_attrs_buf;
cpumask = target_attrs->cpumask;

- mutex_lock(&wq->mutex);
- if (wq->unbound_attrs->no_numa)
- goto out_unlock;
-
copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
pwq = unbound_pwq_by_node(wq, node);

@@ -3721,19 +3731,16 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
*/
if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask)) {
if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
- goto out_unlock;
+ return;
} else {
goto use_dfl_pwq;
}

- mutex_unlock(&wq->mutex);
-
/* create a new pwq */
pwq = alloc_unbound_pwq(wq, target_attrs);
if (!pwq) {
pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
wq->name);
- mutex_lock(&wq->mutex);
goto use_dfl_pwq;
}

@@ -3748,6 +3755,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
goto out_unlock;

use_dfl_pwq:
+ mutex_lock(&wq->mutex);
spin_lock_irq(&wq->dfl_pwq->pool->lock);
get_pwq(wq->dfl_pwq);
spin_unlock_irq(&wq->dfl_pwq->pool->lock);
--
2.1.0

2015-05-11 09:33:42

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 2/5] workqueue: merge the similar code

The calculation and allocation of per-node pwq code in
apply_workqueue_attrs() and wq_update_unbound_numa() are highly similar,
we merge them into alloc_node_unbound_pwq().

wq_calc_node_mask() is only used for this calculation and allocation,
so it is also moved into alloc_node_unbound_pwq().

Changed behavior:
1) Always try to reuse the old pwq.
(In old code, apply_workqueue_attrs() doesn't resue the old pwq.
2) Any reusage of old pwq will introduce get_pwq()/put_pwq() and
corresponding lock overhead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 166 +++++++++++++++++++++++------------------------------
1 file changed, 73 insertions(+), 93 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fa8b949..e7accc1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -296,7 +296,7 @@ module_param_named(power_efficient, wq_power_efficient, bool, 0444);

static bool wq_numa_enabled; /* unbound NUMA affinity enabled */

-/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
+/* PL: buf for alloc_node_unbound_pwq() */
static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;

static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
@@ -3440,48 +3440,86 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
}

/**
- * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
- * @attrs: the wq_attrs of the default pwq of the target workqueue
+ * alloc_node_unbound_pwq - allocate a pwq for specified node
+ * @wq: the target workqueue
+ * @dfl_pwq: the allocated default pwq
+ * @numa: NUMA affinity
* @node: the target NUMA node
- * @cpu_going_down: if >= 0, the CPU to consider as offline
- * @cpumask: outarg, the resulting cpumask
+ * @cpu_off: if >= 0, the CPU to consider as offline
+ * @use_dfl_when_fail: use the @dfl_pwq in case the normal allocation failed
+ *
+ * Allocate or reuse a pwq with the cpumask that @wq should use on @node.
*
- * Calculate the cpumask a workqueue with @attrs should use on @node. If
- * @cpu_going_down is >= 0, that cpu is considered offline during
- * calculation. The result is stored in @cpumask.
+ * If NUMA affinity is not enabled, @dfl_pwq is always used. @dfl_pwq
+ * was allocated with the effetive attrs saved in @dfl_pwq->pool->attrs.
*
- * If NUMA affinity is not enabled, @attrs->cpumask is always used. If
- * enabled and @node has online CPUs requested by @attrs, the returned
- * cpumask is the intersection of the possible CPUs of @node and
- * @attrs->cpumask.
+ * If enabled and @node has online CPUs requested by the effetive attrs,
+ * the cpumask is the intersection of the possible CPUs of @node and
+ * the cpumask of the effetive attrs. If @cpu_off is >= 0,
+ * that cpu is considered offline during calculation.
*
* The caller is responsible for ensuring that the cpumask of @node stays
* stable.
*
- * Return: %true if the resulting @cpumask is different from @attrs->cpumask,
- * %false if equal.
+ * Return: valid pwq, it might be @dfl_pwq under some conditions
+ * or might be the old pwq of the @node.
+ * NULL, when the normal allocation failed.
*/
-static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
- int cpu_going_down, cpumask_t *cpumask)
+static struct pool_workqueue *
+alloc_node_unbound_pwq(struct workqueue_struct *wq,
+ struct pool_workqueue *dfl_pwq, bool numa,
+ int node, int cpu_off, bool use_dfl_when_fail)
+
{
- if (!wq_numa_enabled || attrs->no_numa)
+ struct pool_workqueue *pwq = unbound_pwq_by_node(wq, node);
+ struct workqueue_attrs *attrs = dfl_pwq->pool->attrs, *tmp_attrs;
+ cpumask_t *cpumask;
+
+ lockdep_assert_held(&wq_pool_mutex);
+
+ if (!wq_numa_enabled || !numa)
goto use_dfl;

+ /*
+ * We don't wanna alloc/free wq_attrs for each call. Let's use a
+ * preallocated one. It is protected by wq_pool_mutex.
+ * tmp_attrs->cpumask will be updated in next and tmp_attrs->no_numa
+ * is not used, so we just need to initialize tmp_attrs->nice;
+ */
+ tmp_attrs = wq_update_unbound_numa_attrs_buf;
+ tmp_attrs->nice = attrs->nice;
+ cpumask = tmp_attrs->cpumask;
+
/* does @node have any online CPUs @attrs wants? */
cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
- if (cpu_going_down >= 0)
- cpumask_clear_cpu(cpu_going_down, cpumask);
-
+ if (cpu_off >= 0)
+ cpumask_clear_cpu(cpu_off, cpumask);
if (cpumask_empty(cpumask))
goto use_dfl;

- /* yeap, return possible CPUs in @node that @attrs wants */
+ /* yeap, use possible CPUs in @node that @attrs wants */
cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
- return !cpumask_equal(cpumask, attrs->cpumask);
+ if (cpumask_equal(cpumask, attrs->cpumask))
+ goto use_dfl;
+ if (pwq && wqattrs_equal(tmp_attrs, pwq->pool->attrs))
+ goto use_existed;
+
+ /* create a new pwq */
+ pwq = alloc_unbound_pwq(wq, tmp_attrs);
+ if (!pwq && use_dfl_when_fail) {
+ pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
+ wq->name);
+ goto use_dfl;
+ }
+ return pwq;

use_dfl:
- cpumask_copy(cpumask, attrs->cpumask);
- return false;
+ pwq = dfl_pwq;
+use_existed:
+ spin_lock_irq(&pwq->pool->lock);
+ get_pwq(pwq);
+ spin_unlock_irq(&pwq->pool->lock);
+ return pwq;
}

/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
@@ -3533,7 +3571,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs)
{
struct apply_wqattrs_ctx *ctx;
- struct workqueue_attrs *new_attrs, *tmp_attrs;
+ struct workqueue_attrs *new_attrs;
int node;

lockdep_assert_held(&wq_pool_mutex);
@@ -3542,8 +3580,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
GFP_KERNEL);

new_attrs = alloc_workqueue_attrs(GFP_KERNEL);
- tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
- if (!ctx || !new_attrs || !tmp_attrs)
+ if (!ctx || !new_attrs)
goto out_free;

/*
@@ -3557,13 +3594,6 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);

/*
- * We may create multiple pwqs with differing cpumasks. Make a
- * copy of @new_attrs which will be modified and used to obtain
- * pools.
- */
- copy_workqueue_attrs(tmp_attrs, new_attrs);
-
- /*
* If something goes wrong during CPU up/down, we'll fall back to
* the default pwq covering whole @attrs->cpumask. Always create
* it even if we don't use it immediately.
@@ -3573,14 +3603,11 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
goto out_free;

for_each_node(node) {
- if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) {
- ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
- if (!ctx->pwq_tbl[node])
- goto out_free;
- } else {
- ctx->dfl_pwq->refcnt++;
- ctx->pwq_tbl[node] = ctx->dfl_pwq;
- }
+ ctx->pwq_tbl[node] = alloc_node_unbound_pwq(wq, ctx->dfl_pwq,
+ !attrs->no_numa,
+ node, -1, false);
+ if (!ctx->pwq_tbl[node])
+ goto out_free;
}

/* save the user configured attrs and sanitize it. */
@@ -3589,11 +3616,9 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
ctx->attrs = new_attrs;

ctx->wq = wq;
- free_workqueue_attrs(tmp_attrs);
return ctx;

out_free:
- free_workqueue_attrs(tmp_attrs);
free_workqueue_attrs(new_attrs);
apply_wqattrs_cleanup(ctx);
return NULL;
@@ -3702,9 +3727,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
{
int node = cpu_to_node(cpu);
int cpu_off = online ? -1 : cpu;
- struct pool_workqueue *old_pwq = NULL, *pwq;
- struct workqueue_attrs *target_attrs;
- cpumask_t *cpumask;
+ struct pool_workqueue *old_pwq, *pwq;

lockdep_assert_held(&wq_pool_mutex);

@@ -3712,56 +3735,13 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
wq->unbound_attrs->no_numa)
return;

- /*
- * We don't wanna alloc/free wq_attrs for each wq for each CPU.
- * Let's use a preallocated one. The following buf is protected by
- * CPU hotplug exclusion.
- */
- target_attrs = wq_update_unbound_numa_attrs_buf;
- cpumask = target_attrs->cpumask;
-
- copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
- pwq = unbound_pwq_by_node(wq, node);
-
- /*
- * Let's determine what needs to be done. If the target cpumask is
- * different from the default pwq's, we need to compare it to @pwq's
- * and create a new one if they don't match. If the target cpumask
- * equals the default pwq's, the default pwq should be used.
- */
- if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask)) {
- if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
- return;
- } else {
- goto use_dfl_pwq;
- }
+ pwq = alloc_node_unbound_pwq(wq, wq->dfl_pwq, true, node, cpu_off,
+ true);

- /* create a new pwq */
- pwq = alloc_unbound_pwq(wq, target_attrs);
- if (!pwq) {
- pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
- wq->name);
- goto use_dfl_pwq;
- }
-
- /*
- * Install the new pwq. As this function is called only from CPU
- * hotplug callbacks and applying a new attrs is wrapped with
- * get/put_online_cpus(), @wq->unbound_attrs couldn't have changed
- * inbetween.
- */
mutex_lock(&wq->mutex);
old_pwq = numa_pwq_tbl_install(wq, node, pwq);
- goto out_unlock;
-
-use_dfl_pwq:
- mutex_lock(&wq->mutex);
- spin_lock_irq(&wq->dfl_pwq->pool->lock);
- get_pwq(wq->dfl_pwq);
- spin_unlock_irq(&wq->dfl_pwq->pool->lock);
- old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq);
-out_unlock:
mutex_unlock(&wq->mutex);
+
put_pwq_unlocked(old_pwq);
}

--
2.1.0

2015-05-11 09:32:38

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 3/5] workqueue: ensure attrs-changing be sequentially

Current modification to attrs via sysfs is not atomically.

Process A (change cpumask) | Process B (change numa affinity)
wq_cpumask_store() |
wq_sysfs_prep_attrs() |
| apply_workqueue_attrs()
apply_workqueue_attrs() |

It results that the Process B's operation is totally reverted
without any notification.

This behavior is acceptable but it is sometimes unexpected.
Sequential model on non-performance-sensitive operations is more popular
and preferred. So this patch moves wq_sysfs_prep_attrs() into the protection
under wq_pool_mutex to ensure attrs-changing be sequentially.

This patch is also a preparation patch for next patch which change
the API of apply_workqueue_attrs().

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 110 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 69 insertions(+), 41 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e7accc1..efd9a3a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3646,24 +3646,25 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
mutex_unlock(&ctx->wq->mutex);
}

-/**
- * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
- * @wq: the target workqueue
- * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
- *
- * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
- * machines, this function maps a separate pwq to each NUMA node with
- * possibles CPUs in @attrs->cpumask so that work items are affine to the
- * NUMA node it was issued on. Older pwqs are released as in-flight work
- * items finish. Note that a work item which repeatedly requeues itself
- * back-to-back will stay on its current pwq.
- *
- * Performs GFP_KERNEL allocations.
- *
- * Return: 0 on success and -errno on failure.
- */
-int apply_workqueue_attrs(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+static void apply_wqattrs_lock(void)
+{
+ /*
+ * CPUs should stay stable across pwq creations and installations.
+ * Pin CPUs, determine the target cpumask for each node and create
+ * pwqs accordingly.
+ */
+ get_online_cpus();
+ mutex_lock(&wq_pool_mutex);
+}
+
+static void apply_wqattrs_unlock(void)
+{
+ mutex_unlock(&wq_pool_mutex);
+ put_online_cpus();
+}
+
+static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs)
{
struct apply_wqattrs_ctx *ctx;
int ret = -ENOMEM;
@@ -3676,14 +3677,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- /*
- * CPUs should stay stable across pwq creations and installations.
- * Pin CPUs, determine the target cpumask for each node and create
- * pwqs accordingly.
- */
- get_online_cpus();
- mutex_lock(&wq_pool_mutex);
-
ctx = apply_wqattrs_prepare(wq, attrs);

/* the ctx has been prepared successfully, let's commit it */
@@ -3694,8 +3687,33 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,

apply_wqattrs_cleanup(ctx);

- mutex_unlock(&wq_pool_mutex);
- put_online_cpus();
+ return ret;
+}
+
+/**
+ * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
+ * @wq: the target workqueue
+ * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
+ *
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on. Older pwqs are released as in-flight work
+ * items finish. Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
+ *
+ * Performs GFP_KERNEL allocations.
+ *
+ * Return: 0 on success and -errno on failure.
+ */
+int apply_workqueue_attrs(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs)
+{
+ int ret;
+
+ apply_wqattrs_lock();
+ ret = apply_workqueue_attrs_locked(wq, attrs);
+ apply_wqattrs_unlock();

return ret;
}
@@ -4784,10 +4802,9 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
if (!zalloc_cpumask_var(&saved_cpumask, GFP_KERNEL))
return -ENOMEM;

- get_online_cpus();
cpumask_and(cpumask, cpumask, cpu_possible_mask);
if (!cpumask_empty(cpumask)) {
- mutex_lock(&wq_pool_mutex);
+ apply_wqattrs_lock();

/* save the old wq_unbound_cpumask. */
cpumask_copy(saved_cpumask, wq_unbound_cpumask);
@@ -4800,9 +4817,8 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
if (ret < 0)
cpumask_copy(wq_unbound_cpumask, saved_cpumask);

- mutex_unlock(&wq_pool_mutex);
+ apply_wqattrs_unlock();
}
- put_online_cpus();

free_cpumask_var(saved_cpumask);
return ret;
@@ -4927,18 +4943,22 @@ static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
{
struct workqueue_struct *wq = dev_to_wq(dev);
struct workqueue_attrs *attrs;
- int ret;
+ int ret = -ENOMEM;
+
+ apply_wqattrs_lock();

attrs = wq_sysfs_prep_attrs(wq);
if (!attrs)
- return -ENOMEM;
+ goto out_unlock;

if (sscanf(buf, "%d", &attrs->nice) == 1 &&
attrs->nice >= MIN_NICE && attrs->nice <= MAX_NICE)
- ret = apply_workqueue_attrs(wq, attrs);
+ ret = apply_workqueue_attrs_locked(wq, attrs);
else
ret = -EINVAL;

+out_unlock:
+ apply_wqattrs_unlock();
free_workqueue_attrs(attrs);
return ret ?: count;
}
@@ -4962,16 +4982,20 @@ static ssize_t wq_cpumask_store(struct device *dev,
{
struct workqueue_struct *wq = dev_to_wq(dev);
struct workqueue_attrs *attrs;
- int ret;
+ int ret = -ENOMEM;
+
+ apply_wqattrs_lock();

attrs = wq_sysfs_prep_attrs(wq);
if (!attrs)
- return -ENOMEM;
+ goto out_unlock;

ret = cpumask_parse(buf, attrs->cpumask);
if (!ret)
- ret = apply_workqueue_attrs(wq, attrs);
+ ret = apply_workqueue_attrs_locked(wq, attrs);

+out_unlock:
+ apply_wqattrs_unlock();
free_workqueue_attrs(attrs);
return ret ?: count;
}
@@ -4995,18 +5019,22 @@ static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
{
struct workqueue_struct *wq = dev_to_wq(dev);
struct workqueue_attrs *attrs;
- int v, ret;
+ int v, ret = -ENOMEM;
+
+ apply_wqattrs_lock();

attrs = wq_sysfs_prep_attrs(wq);
if (!attrs)
- return -ENOMEM;
+ goto out_unlock;

ret = -EINVAL;
if (sscanf(buf, "%d", &v) == 1) {
attrs->no_numa = !v;
- ret = apply_workqueue_attrs(wq, attrs);
+ ret = apply_workqueue_attrs_locked(wq, attrs);
}

+out_unlock:
+ apply_wqattrs_unlock();
free_workqueue_attrs(attrs);
return ret ?: count;
}
--
2.1.0

2015-05-11 09:33:02

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

workqueue_attrs is an internal-like structure and is exposed with
apply_workqueue_attrs() whose user has to investigate the structure
before use.

And the apply_workqueue_attrs() API is inconvenient with the structure.
The user (although there is no user yet currently) has to assemble
several LoC to use:
attrs = alloc_workqueue_attrs();
if (!attrs)
return;
attrs->nice = ...;
copy cpumask;
attrs->no_numa = ...;
apply_workqueue_attrs();
free_workqueue_attrs();

It is too elaborate. This patch changes apply_workqueue_attrs() API,
and one-line-code is enough to be called from user:
apply_workqueue_attrs(wq, cpumask, nice, numa);

This patch also reduces the code of workqueue.c, about -50 lines.
wq_sysfs_prep_attrs() is removed, wq_[nice|cpumask|numa]_store()
directly access to the ->unbound_attrs with the protection
of apply_wqattrs_lock();

This patch is also a preparation patch of next patch which
remove no_numa out from the structure workqueue_attrs which
requires apply_workqueue_attrs() has an argument to pass numa affinity.

Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/workqueue.h | 18 +----
kernel/workqueue.c | 173 ++++++++++++++++++----------------------------
2 files changed, 70 insertions(+), 121 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4618dd6..32e7c4b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,20 +119,6 @@ struct delayed_work {
int cpu;
};

-/*
- * A struct for workqueue attributes. This can be used to change
- * attributes of an unbound workqueue.
- *
- * Unlike other fields, ->no_numa isn't a property of a worker_pool. It
- * only modifies how apply_workqueue_attrs() select pools and thus doesn't
- * participate in pool hash calculations or equality comparisons.
- */
-struct workqueue_attrs {
- int nice; /* nice level */
- cpumask_var_t cpumask; /* allowed CPUs */
- bool no_numa; /* disable NUMA affinity */
-};
-
static inline struct delayed_work *to_delayed_work(struct work_struct *work)
{
return container_of(work, struct delayed_work, work);
@@ -420,10 +406,8 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,

extern void destroy_workqueue(struct workqueue_struct *wq);

-struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask);
-void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs);
+ const cpumask_var_t cpumask, int nice, bool numa);
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);

extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index efd9a3a..b08df98 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -106,6 +106,20 @@ enum {
};

/*
+ * A struct for workqueue attributes. This can be used to change
+ * attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool. It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
+ */
+struct workqueue_attrs {
+ int nice; /* nice level */
+ cpumask_var_t cpumask; /* allowed CPUs */
+ bool no_numa; /* disable NUMA affinity */
+};
+
+/*
* Structure fields follow one of the following exclusion rules.
*
* I: Modifiable by initialization/destruction paths and read-only for
@@ -307,6 +321,8 @@ static bool workqueue_freezing; /* PL: have wqs started freezing? */

static cpumask_var_t wq_unbound_cpumask; /* PL: low level cpumask for all unbound wqs */

+static const int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
+
/* the per-cpu worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_worker_pools);
@@ -316,12 +332,6 @@ static DEFINE_IDR(worker_pool_idr); /* PR: idr of all pools */
/* PL: hash of all unbound pools keyed by pool->attrs */
static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);

-/* I: attributes used when instantiating standard unbound pools on demand */
-static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
-
-/* I: attributes used when instantiating ordered pools on demand */
-static struct workqueue_attrs *ordered_wq_attrs[NR_STD_WORKER_POOLS];
-
struct workqueue_struct *system_wq __read_mostly;
EXPORT_SYMBOL(system_wq);
struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -338,8 +348,6 @@ struct workqueue_struct *system_freezable_power_efficient_wq __read_mostly;
EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq);

static int worker_thread(void *__worker);
-static void copy_workqueue_attrs(struct workqueue_attrs *to,
- const struct workqueue_attrs *from);
static void workqueue_sysfs_unregister(struct workqueue_struct *wq);

#define CREATE_TRACE_POINTS
@@ -3022,7 +3030,7 @@ EXPORT_SYMBOL_GPL(execute_in_process_context);
*
* Undo alloc_workqueue_attrs().
*/
-void free_workqueue_attrs(struct workqueue_attrs *attrs)
+static void free_workqueue_attrs(struct workqueue_attrs *attrs)
{
if (attrs) {
free_cpumask_var(attrs->cpumask);
@@ -3039,7 +3047,7 @@ void free_workqueue_attrs(struct workqueue_attrs *attrs)
*
* Return: The allocated new workqueue_attr on success. %NULL on failure.
*/
-struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
+static struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
{
struct workqueue_attrs *attrs;

@@ -3568,7 +3576,7 @@ static void apply_wqattrs_cleanup(struct apply_wqattrs_ctx *ctx)
/* allocate the attrs and pwqs for later installation */
static struct apply_wqattrs_ctx *
apply_wqattrs_prepare(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+ const cpumask_var_t cpumask, int nice, bool numa)
{
struct apply_wqattrs_ctx *ctx;
struct workqueue_attrs *new_attrs;
@@ -3588,8 +3596,9 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
* If the user configured cpumask doesn't overlap with the
* wq_unbound_cpumask, we fallback to the wq_unbound_cpumask.
*/
- copy_workqueue_attrs(new_attrs, attrs);
- cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
+ cpumask_and(new_attrs->cpumask, cpumask, wq_unbound_cpumask);
+ new_attrs->nice = nice;
+ new_attrs->no_numa = !numa;
if (unlikely(cpumask_empty(new_attrs->cpumask)))
cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);

@@ -3604,15 +3613,14 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,

for_each_node(node) {
ctx->pwq_tbl[node] = alloc_node_unbound_pwq(wq, ctx->dfl_pwq,
- !attrs->no_numa,
- node, -1, false);
+ numa, node, -1,
+ false);
if (!ctx->pwq_tbl[node])
goto out_free;
}

/* save the user configured attrs and sanitize it. */
- copy_workqueue_attrs(new_attrs, attrs);
- cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
+ cpumask_and(new_attrs->cpumask, cpumask, cpu_possible_mask);
ctx->attrs = new_attrs;

ctx->wq = wq;
@@ -3664,7 +3672,8 @@ static void apply_wqattrs_unlock(void)
}

static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+ const cpumask_var_t cpumask,
+ int nice, bool numa)
{
struct apply_wqattrs_ctx *ctx;
int ret = -ENOMEM;
@@ -3677,7 +3686,7 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;

- ctx = apply_wqattrs_prepare(wq, attrs);
+ ctx = apply_wqattrs_prepare(wq, cpumask, nice, numa);

/* the ctx has been prepared successfully, let's commit it */
if (ctx) {
@@ -3693,11 +3702,13 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
/**
* apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
* @wq: the target workqueue
- * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
+ * @cpumask: allowed cpumask for the workers for the target workqueue
+ * @nice: the ncie value for the workers for the target workqueue
+ * @numa: enable/disable per NUMA node pwqs
*
- * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
+ * Apply the attrs to an unbound workqueue @wq. Unless disabled, on NUMA
* machines, this function maps a separate pwq to each NUMA node with
- * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * possibles CPUs in @cpumask so that work items are affine to the
* NUMA node it was issued on. Older pwqs are released as in-flight work
* items finish. Note that a work item which repeatedly requeues itself
* back-to-back will stay on its current pwq.
@@ -3707,12 +3718,12 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
* Return: 0 on success and -errno on failure.
*/
int apply_workqueue_attrs(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+ const cpumask_var_t cpumask, int nice, bool numa)
{
int ret;

apply_wqattrs_lock();
- ret = apply_workqueue_attrs_locked(wq, attrs);
+ ret = apply_workqueue_attrs_locked(wq, cpumask, nice, numa);
apply_wqattrs_unlock();

return ret;
@@ -3787,14 +3798,16 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
}
return 0;
} else if (wq->flags & __WQ_ORDERED) {
- ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
+ ret = apply_workqueue_attrs(wq, cpu_possible_mask,
+ std_nice[highpri], false);
/* there should only be single pwq for ordering guarantee */
WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
"ordering guarantee broken for workqueue %s\n", wq->name);
return ret;
} else {
- return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
+ return apply_workqueue_attrs(wq, cpu_possible_mask,
+ std_nice[highpri], true);
}
}

@@ -4764,7 +4777,9 @@ static int workqueue_apply_unbound_cpumask(void)
if (wq->flags & __WQ_ORDERED)
continue;

- ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+ ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs->cpumask,
+ wq->unbound_attrs->nice,
+ !wq->unbound_attrs->no_numa);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -4923,43 +4938,22 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
return written;
}

-/* prepare workqueue_attrs for sysfs store operations */
-static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
-{
- struct workqueue_attrs *attrs;
-
- attrs = alloc_workqueue_attrs(GFP_KERNEL);
- if (!attrs)
- return NULL;
-
- mutex_lock(&wq->mutex);
- copy_workqueue_attrs(attrs, wq->unbound_attrs);
- mutex_unlock(&wq->mutex);
- return attrs;
-}
-
static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
struct workqueue_struct *wq = dev_to_wq(dev);
- struct workqueue_attrs *attrs;
- int ret = -ENOMEM;
+ int nice, ret;

- apply_wqattrs_lock();
-
- attrs = wq_sysfs_prep_attrs(wq);
- if (!attrs)
- goto out_unlock;

- if (sscanf(buf, "%d", &attrs->nice) == 1 &&
- attrs->nice >= MIN_NICE && attrs->nice <= MAX_NICE)
- ret = apply_workqueue_attrs_locked(wq, attrs);
- else
- ret = -EINVAL;
+ if (!(sscanf(buf, "%d", &nice) == 1 &&
+ nice >= MIN_NICE && nice <= MAX_NICE))
+ return -EINVAL;

-out_unlock:
+ apply_wqattrs_lock();
+ ret = apply_workqueue_attrs_locked(wq, wq->unbound_attrs->cpumask,
+ nice, !wq->unbound_attrs->no_numa);
apply_wqattrs_unlock();
- free_workqueue_attrs(attrs);
+
return ret ?: count;
}

@@ -4981,22 +4975,21 @@ static ssize_t wq_cpumask_store(struct device *dev,
const char *buf, size_t count)
{
struct workqueue_struct *wq = dev_to_wq(dev);
- struct workqueue_attrs *attrs;
- int ret = -ENOMEM;
-
- apply_wqattrs_lock();
-
- attrs = wq_sysfs_prep_attrs(wq);
- if (!attrs)
- goto out_unlock;
+ cpumask_var_t cpumask;
+ int ret;

- ret = cpumask_parse(buf, attrs->cpumask);
- if (!ret)
- ret = apply_workqueue_attrs_locked(wq, attrs);
+ if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+ return -ENOMEM;
+ ret = cpumask_parse(buf, cpumask);
+ if (ret)
+ return ret;

-out_unlock:
+ apply_wqattrs_lock();
+ ret = apply_workqueue_attrs_locked(wq, cpumask, wq->unbound_attrs->nice,
+ !wq->unbound_attrs->no_numa);
apply_wqattrs_unlock();
- free_workqueue_attrs(attrs);
+
+ free_cpumask_var(cpumask);
return ret ?: count;
}

@@ -5018,24 +5011,16 @@ static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
struct workqueue_struct *wq = dev_to_wq(dev);
- struct workqueue_attrs *attrs;
- int v, ret = -ENOMEM;
+ int v, ret;

- apply_wqattrs_lock();
-
- attrs = wq_sysfs_prep_attrs(wq);
- if (!attrs)
- goto out_unlock;
-
- ret = -EINVAL;
- if (sscanf(buf, "%d", &v) == 1) {
- attrs->no_numa = !v;
- ret = apply_workqueue_attrs_locked(wq, attrs);
- }
+ if (sscanf(buf, "%d", &v) != 1)
+ return -EINVAL;

-out_unlock:
+ apply_wqattrs_lock();
+ ret = apply_workqueue_attrs_locked(wq, wq->unbound_attrs->cpumask,
+ wq->unbound_attrs->nice, !!v);
apply_wqattrs_unlock();
- free_workqueue_attrs(attrs);
+
return ret ?: count;
}

@@ -5237,7 +5222,6 @@ static void __init wq_numa_init(void)

static int __init init_workqueues(void)
{
- int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
int i, cpu;

WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
@@ -5281,25 +5265,6 @@ static int __init init_workqueues(void)
}
}

- /* create default unbound and ordered wq attrs */
- for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
- struct workqueue_attrs *attrs;
-
- BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
- attrs->nice = std_nice[i];
- unbound_std_wq_attrs[i] = attrs;
-
- /*
- * An ordered wq should have only one pwq as ordering is
- * guaranteed by max_active which is enforced by pwqs.
- * Turn off NUMA so that dfl_pwq is used for all nodes.
- */
- BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
- attrs->nice = std_nice[i];
- attrs->no_numa = true;
- ordered_wq_attrs[i] = attrs;
- }
-
system_wq = alloc_workqueue("events", 0, 0);
system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
system_long_wq = alloc_workqueue("events_long", 0, 0);
--
2.1.0

2015-05-11 09:33:19

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 5/5] workqueue: remove no_numa from workqueue_attrs

The structure workqueue_attrs is used both on workqueue and worker_pool,
but ->no_numa isn't a property of a worker_pool and requires special
comments or code when it is used on worker_pool.

This patch simply removes ->no_numa from workqueue_attrs along with
the special comments and code. The numa affinity is stored at
wq->numa instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 44 ++++++++++++++------------------------------
1 file changed, 14 insertions(+), 30 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b08df98..42721a2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -108,15 +108,10 @@ enum {
/*
* A struct for workqueue attributes. This can be used to change
* attributes of an unbound workqueue.
- *
- * Unlike other fields, ->no_numa isn't a property of a worker_pool. It
- * only modifies how apply_workqueue_attrs() select pools and thus doesn't
- * participate in pool hash calculations or equality comparisons.
*/
struct workqueue_attrs {
int nice; /* nice level */
cpumask_var_t cpumask; /* allowed CPUs */
- bool no_numa; /* disable NUMA affinity */
};

/*
@@ -267,8 +262,10 @@ struct workqueue_struct {
int nr_drainers; /* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */

- struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */
- struct pool_workqueue *dfl_pwq; /* PW: only for unbound wqs */
+ /* The following fields are only for unbound wqs. */
+ struct workqueue_attrs *unbound_attrs; /* PW: configure attrs */
+ bool numa; /* PW: enable per-node pwqs */
+ struct pool_workqueue *dfl_pwq; /* PW: the default pwq */

#ifdef CONFIG_SYSFS
struct wq_device *wq_dev; /* I: for sysfs interface */
@@ -3069,12 +3066,6 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
{
to->nice = from->nice;
cpumask_copy(to->cpumask, from->cpumask);
- /*
- * Unlike hash and equality test, this function doesn't ignore
- * ->no_numa as it is used for both pool and wq attrs. Instead,
- * get_unbound_pool() explicitly clears ->no_numa after copying.
- */
- to->no_numa = from->no_numa;
}

/* hash value of the content of @attr */
@@ -3265,12 +3256,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
lockdep_set_subclass(&pool->lock, 1); /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);

- /*
- * no_numa isn't a worker_pool attribute, always clear it. See
- * 'struct workqueue_attrs' comments for detail.
- */
- pool->attrs->no_numa = false;
-
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
@@ -3491,8 +3476,8 @@ alloc_node_unbound_pwq(struct workqueue_struct *wq,
/*
* We don't wanna alloc/free wq_attrs for each call. Let's use a
* preallocated one. It is protected by wq_pool_mutex.
- * tmp_attrs->cpumask will be updated in next and tmp_attrs->no_numa
- * is not used, so we just need to initialize tmp_attrs->nice;
+ * tmp_attrs->cpumask will be updated in next, so we just need
+ * to initialize tmp_attrs->nice;
*/
tmp_attrs = wq_update_unbound_numa_attrs_buf;
tmp_attrs->nice = attrs->nice;
@@ -3552,6 +3537,7 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
struct apply_wqattrs_ctx {
struct workqueue_struct *wq; /* target workqueue */
struct workqueue_attrs *attrs; /* attrs to apply */
+ bool numa; /* enalbe per-node pwqs */
struct list_head list; /* queued for batching commit */
struct pool_workqueue *dfl_pwq;
struct pool_workqueue *pwq_tbl[];
@@ -3598,7 +3584,6 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
*/
cpumask_and(new_attrs->cpumask, cpumask, wq_unbound_cpumask);
new_attrs->nice = nice;
- new_attrs->no_numa = !numa;
if (unlikely(cpumask_empty(new_attrs->cpumask)))
cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);

@@ -3622,6 +3607,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
/* save the user configured attrs and sanitize it. */
cpumask_and(new_attrs->cpumask, cpumask, cpu_possible_mask);
ctx->attrs = new_attrs;
+ ctx->numa = numa;

ctx->wq = wq;
return ctx;
@@ -3641,6 +3627,7 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
mutex_lock(&ctx->wq->mutex);

copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);
+ ctx->wq->numa = ctx->numa;

/* save the previous pwq and install the new one */
for_each_node(node)
@@ -3760,8 +3747,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,

lockdep_assert_held(&wq_pool_mutex);

- if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
- wq->unbound_attrs->no_numa)
+ if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) || !wq->numa)
return;

pwq = alloc_node_unbound_pwq(wq, wq->dfl_pwq, true, node, cpu_off,
@@ -4778,8 +4764,7 @@ static int workqueue_apply_unbound_cpumask(void)
continue;

ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs->cpumask,
- wq->unbound_attrs->nice,
- !wq->unbound_attrs->no_numa);
+ wq->unbound_attrs->nice, wq->numa);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -4951,7 +4936,7 @@ static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,

apply_wqattrs_lock();
ret = apply_workqueue_attrs_locked(wq, wq->unbound_attrs->cpumask,
- nice, !wq->unbound_attrs->no_numa);
+ nice, wq->numa);
apply_wqattrs_unlock();

return ret ?: count;
@@ -4986,7 +4971,7 @@ static ssize_t wq_cpumask_store(struct device *dev,

apply_wqattrs_lock();
ret = apply_workqueue_attrs_locked(wq, cpumask, wq->unbound_attrs->nice,
- !wq->unbound_attrs->no_numa);
+ wq->numa);
apply_wqattrs_unlock();

free_cpumask_var(cpumask);
@@ -5000,8 +4985,7 @@ static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
int written;

mutex_lock(&wq->mutex);
- written = scnprintf(buf, PAGE_SIZE, "%d\n",
- !wq->unbound_attrs->no_numa);
+ written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->numa);
mutex_unlock(&wq->mutex);

return written;
--
2.1.0

2015-05-11 12:23:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/5] workqueue: wq_pool_mutex protects the attrs-installation

On Mon, May 11, 2015 at 05:35:48PM +0800, Lai Jiangshan wrote:
> @@ -127,6 +127,12 @@ enum {
> *
> * PR: wq_pool_mutex protected for writes. Sched-RCU protected for reads.
> *
> + * PW: wq_pool_mutex and wq->mutex protected for writes. Any one of them
> + * protected for reads.

Either for reads.

> + *
> + * PWR: wq_pool_mutex and wq->mutex protected for writes. Any one of them
> + * or sched-RCU for reads.

Ditto.

> + *
> * WQ: wq->mutex protected.
> *
> * WR: wq->mutex protected for writes. Sched-RCU protected for reads.
...
> @@ -553,7 +565,7 @@ static int worker_pool_assign_id(struct worker_pool *pool)
> * @wq: the target workqueue
> * @node: the node ID
> *
> - * This must be called either with pwq_lock held or sched RCU read locked.
> + * This must be called either with wq_pool_mutex held or sched RCU read locked.

The comment was outdated before too but the updated one isn't correct
either.

> * If the pwq needs to be used beyond the locking in effect, the caller is
> * responsible for guaranteeing that the pwq stays online.
> *
> @@ -562,7 +574,7 @@ static int worker_pool_assign_id(struct worker_pool *pool)
> static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
> int node)
> {
> - assert_rcu_or_wq_mutex(wq);
> + assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
> }
>
...
> @@ -3644,10 +3657,9 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> * pwqs accordingly.
> */
> get_online_cpus();
> -
> mutex_lock(&wq_pool_mutex);
> +
> ctx = apply_wqattrs_prepare(wq, attrs);
> - mutex_unlock(&wq_pool_mutex);
>
> /* the ctx has been prepared successfully, let's commit it */
> if (ctx) {
> @@ -3655,10 +3667,11 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> ret = 0;
> }
>
> - put_online_cpus();
> -
> apply_wqattrs_cleanup(ctx);

Why are we protecting cleanup?

> + mutex_unlock(&wq_pool_mutex);
> + put_online_cpus();
> +
> return ret;
> }
>

Thanks.

--
tejun

2015-05-11 14:31:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/5] workqueue: merge the similar code

Hello, Lai.

On Mon, May 11, 2015 at 05:35:49PM +0800, Lai Jiangshan wrote:
> @@ -3440,48 +3440,86 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> }
>
> /**
> - * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
> - * @attrs: the wq_attrs of the default pwq of the target workqueue
> + * alloc_node_unbound_pwq - allocate a pwq for specified node

for the specified node

> + * @wq: the target workqueue
> + * @dfl_pwq: the allocated default pwq
> + * @numa: NUMA affinity
> * @node: the target NUMA node
> - * @cpu_going_down: if >= 0, the CPU to consider as offline
> - * @cpumask: outarg, the resulting cpumask
> + * @cpu_off: if >= 0, the CPU to consider as offline

@cpu_off sounds like offset into cpu array or sth. Is there a reason
to change the name?

> + * @use_dfl_when_fail: use the @dfl_pwq in case the normal allocation failed

@use_dfl_on_fail

> + *
> + * Allocate or reuse a pwq with the cpumask that @wq should use on @node.

I wonder whether a better name for the function would be sth like
get_alloc_node_unbound_pwq().

> *
> - * Calculate the cpumask a workqueue with @attrs should use on @node. If
> - * @cpu_going_down is >= 0, that cpu is considered offline during
> - * calculation. The result is stored in @cpumask.
> + * If NUMA affinity is not enabled, @dfl_pwq is always used. @dfl_pwq
> + * was allocated with the effetive attrs saved in @dfl_pwq->pool->attrs.

I'm not sure we need the second sentence.

...
> - * Return: %true if the resulting @cpumask is different from @attrs->cpumask,
> - * %false if equal.
> + * Return: valid pwq, it might be @dfl_pwq under some conditions
> + * or might be the old pwq of the @node.
> + * NULL, when the normal allocation failed.

Maybe explain how @use_dfl_on_fail affects the result?

> */
> -static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
> - int cpu_going_down, cpumask_t *cpumask)
> +static struct pool_workqueue *
> +alloc_node_unbound_pwq(struct workqueue_struct *wq,
> + struct pool_workqueue *dfl_pwq, bool numa,
> + int node, int cpu_off, bool use_dfl_when_fail)
> +
> {
> - if (!wq_numa_enabled || attrs->no_numa)
> + struct pool_workqueue *pwq = unbound_pwq_by_node(wq, node);
> + struct workqueue_attrs *attrs = dfl_pwq->pool->attrs, *tmp_attrs;
> + cpumask_t *cpumask;
> +
> + lockdep_assert_held(&wq_pool_mutex);
> +
> + if (!wq_numa_enabled || !numa)
> goto use_dfl;
>
> + /*
> + * We don't wanna alloc/free wq_attrs for each call. Let's use a
> + * preallocated one. It is protected by wq_pool_mutex.
> + * tmp_attrs->cpumask will be updated in next and tmp_attrs->no_numa

will be updated below

> + * is not used, so we just need to initialize tmp_attrs->nice;
> + */
> + tmp_attrs = wq_update_unbound_numa_attrs_buf;
> + tmp_attrs->nice = attrs->nice;
> + cpumask = tmp_attrs->cpumask;
> +
> /* does @node have any online CPUs @attrs wants? */
> cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
> - if (cpu_going_down >= 0)
> - cpumask_clear_cpu(cpu_going_down, cpumask);
> -
> + if (cpu_off >= 0)
> + cpumask_clear_cpu(cpu_off, cpumask);
> if (cpumask_empty(cpumask))
> goto use_dfl;
>
> - /* yeap, return possible CPUs in @node that @attrs wants */
> + /* yeap, use possible CPUs in @node that @attrs wants */
> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> - return !cpumask_equal(cpumask, attrs->cpumask);
> + if (cpumask_equal(cpumask, attrs->cpumask))
> + goto use_dfl;
> + if (pwq && wqattrs_equal(tmp_attrs, pwq->pool->attrs))
> + goto use_existed;

goto use_current;

Also, would it be difficult to put this in a separate patch? This is
mixing code refactoring with behavior change. Make both code paths
behave the same way first and then refactor?

> +
> + /* create a new pwq */
> + pwq = alloc_unbound_pwq(wq, tmp_attrs);
> + if (!pwq && use_dfl_when_fail) {
> + pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
> + wq->name);
> + goto use_dfl;

Does this need to be in this function? Can't we let the caller handle
the fallback instead?

Thanks.

--
tejun

2015-05-11 14:55:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/5] workqueue: ensure attrs-changing be sequentially

Hey,

Prolly a better subject is "ensure attrs changes are properly
synchronized"

On Mon, May 11, 2015 at 05:35:50PM +0800, Lai Jiangshan wrote:
> Current modification to attrs via sysfs is not atomically.

atomic.

>
> Process A (change cpumask) | Process B (change numa affinity)
> wq_cpumask_store() |
> wq_sysfs_prep_attrs() |
^
misaligned

> | apply_workqueue_attrs()
> apply_workqueue_attrs() |
>
> It results that the Process B's operation is totally reverted
> without any notification.

Yeah, right.

> This behavior is acceptable but it is sometimes unexpected.

I don't think this is an acceptable behavior.

> Sequential model on non-performance-sensitive operations is more popular
> and preferred. So this patch moves wq_sysfs_prep_attrs() into the protection

You can just say the previous behavior is buggy.

> under wq_pool_mutex to ensure attrs-changing be sequentially.
>
> This patch is also a preparation patch for next patch which change
> the API of apply_workqueue_attrs().
...
> +static void apply_wqattrs_lock(void)
> +{
> + /*
> + * CPUs should stay stable across pwq creations and installations.
> + * Pin CPUs, determine the target cpumask for each node and create
> + * pwqs accordingly.
> + */
> + get_online_cpus();
> + mutex_lock(&wq_pool_mutex);
> +}
> +
> +static void apply_wqattrs_unlock(void)
> +{
> + mutex_unlock(&wq_pool_mutex);
> + put_online_cpus();
> +}

Separate out refactoring and extending locking coverage?

Thanks.

--
tejun

2015-05-11 15:00:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

On Mon, May 11, 2015 at 05:35:51PM +0800, Lai Jiangshan wrote:
> workqueue_attrs is an internal-like structure and is exposed with
> apply_workqueue_attrs() whose user has to investigate the structure
> before use.
>
> And the apply_workqueue_attrs() API is inconvenient with the structure.
> The user (although there is no user yet currently) has to assemble
> several LoC to use:
> attrs = alloc_workqueue_attrs();
> if (!attrs)
> return;
> attrs->nice = ...;
> copy cpumask;
> attrs->no_numa = ...;
> apply_workqueue_attrs();
> free_workqueue_attrs();
>
> It is too elaborate. This patch changes apply_workqueue_attrs() API,
> and one-line-code is enough to be called from user:
> apply_workqueue_attrs(wq, cpumask, nice, numa);
>
> This patch also reduces the code of workqueue.c, about -50 lines.
> wq_sysfs_prep_attrs() is removed, wq_[nice|cpumask|numa]_store()
> directly access to the ->unbound_attrs with the protection
> of apply_wqattrs_lock();
>
> This patch is also a preparation patch of next patch which
> remove no_numa out from the structure workqueue_attrs which
> requires apply_workqueue_attrs() has an argument to pass numa affinity.

I'm not sure about this. Yeah, sure, it's a bit more lines of code
but at the same time this'd allow us to make the public interface
atomic too. What we prolly should do is changing the interface so
that we do

attrs = prepare_workqueue_attrs(gfp_mask); /* allocate, lock & copy */
/* modify attrs as desired */
commit_workqueue_attrs(attrs); /* apply, unlock and free */

Thanks.

--
tejun

2015-05-12 01:59:44

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 2/5] workqueue: merge the similar code

On 05/11/2015 10:31 PM, Tejun Heo wrote:
> Hello, Lai.

Hello, TJ

>
>> * @node: the target NUMA node
>> - * @cpu_going_down: if >= 0, the CPU to consider as offline
>> - * @cpumask: outarg, the resulting cpumask
>> + * @cpu_off: if >= 0, the CPU to consider as offline
>
> @cpu_off sounds like offset into cpu array or sth. Is there a reason
> to change the name?

@cpu_off is a local variable in wq_update_unbound_numa() and is a shorter
name.

>
>> + *
>> + * Allocate or reuse a pwq with the cpumask that @wq should use on @node.
>
> I wonder whether a better name for the function would be sth like
> get_alloc_node_unbound_pwq().
>

The name length of alloc_node_unbound_pwq() had already added trouble to me
for code-indent in the called-site. I can add a variable to ease the indent
problem later, but IMHO, get_alloc_node_unbound_pwq() is not strictly a better
name over alloc_node_unbound_pwq(). Maybe we can consider get_node_unbound_pwq()?

>> *
>> - * Calculate the cpumask a workqueue with @attrs should use on @node. If
>> - * @cpu_going_down is >= 0, that cpu is considered offline during
>> - * calculation. The result is stored in @cpumask.
>> + * If NUMA affinity is not enabled, @dfl_pwq is always used. @dfl_pwq
>> + * was allocated with the effetive attrs saved in @dfl_pwq->pool->attrs.
>
> I'm not sure we need the second sentence.

effetive -> effective

I used "the effetive attrs" twice bellow. I need help to rephrase it,
might you do me a favor? Or just use it without introducing it at first?

+ * If enabled and @node has online CPUs requested by the effetive attrs,
+ * the cpumask is the intersection of the possible CPUs of @node and
+ * the cpumask of the effetive attrs.

>> + if (cpumask_equal(cpumask, attrs->cpumask))
>> + goto use_dfl;
>> + if (pwq && wqattrs_equal(tmp_attrs, pwq->pool->attrs))
>> + goto use_existed;
>
> goto use_current;

The label use_existed is shared with use_dfl:

use_dfl:
pwq = dfl_pwq;
use_existed:
spin_lock_irq(&pwq->pool->lock);
get_pwq(pwq);
spin_unlock_irq(&pwq->pool->lock);
return pwq;

But I don't think the dfl_pwq is current.

>
> Also, would it be difficult to put this in a separate patch? This is
> mixing code refactoring with behavior change. Make both code paths
> behave the same way first and then refactor?
>
>> +
>> + /* create a new pwq */
>> + pwq = alloc_unbound_pwq(wq, tmp_attrs);
>> + if (!pwq && use_dfl_when_fail) {
>> + pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
>> + wq->name);
>> + goto use_dfl;
>
> Does this need to be in this function? Can't we let the caller handle
> the fallback instead?

Will it leave the duplicated code that this patch tries to remove?

I will try it with introducing a get_pwq_unlocked().

Thanks,
Lai

2015-05-12 02:12:31

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

On 05/11/2015 10:59 PM, Tejun Heo wrote:
> On Mon, May 11, 2015 at 05:35:51PM +0800, Lai Jiangshan wrote:
>> workqueue_attrs is an internal-like structure and is exposed with
>> apply_workqueue_attrs() whose user has to investigate the structure
>> before use.
>>
>> And the apply_workqueue_attrs() API is inconvenient with the structure.
>> The user (although there is no user yet currently) has to assemble
>> several LoC to use:
>> attrs = alloc_workqueue_attrs();
>> if (!attrs)
>> return;
>> attrs->nice = ...;
>> copy cpumask;
>> attrs->no_numa = ...;
>> apply_workqueue_attrs();
>> free_workqueue_attrs();
>>
>> It is too elaborate. This patch changes apply_workqueue_attrs() API,
>> and one-line-code is enough to be called from user:
>> apply_workqueue_attrs(wq, cpumask, nice, numa);
>>
>> This patch also reduces the code of workqueue.c, about -50 lines.
>> wq_sysfs_prep_attrs() is removed, wq_[nice|cpumask|numa]_store()
>> directly access to the ->unbound_attrs with the protection
>> of apply_wqattrs_lock();
>>
>> This patch is also a preparation patch of next patch which
>> remove no_numa out from the structure workqueue_attrs which
>> requires apply_workqueue_attrs() has an argument to pass numa affinity.
>
> I'm not sure about this. Yeah, sure, it's a bit more lines of code
> but at the same time this'd allow us to make the public interface
> atomic too. What we prolly should do is changing the interface so
> that we do
>
> attrs = prepare_workqueue_attrs(gfp_mask); /* allocate, lock & copy */
> /* modify attrs as desired */
> commit_workqueue_attrs(attrs); /* apply, unlock and free */

I think the workqueue.c has too much complicated and rarely used APIs
and exposes too much in this way. No one can set the nice value
and the cpuallowed of a task atomically.

If the user want atomic-able, Her/he can just disable WQ_SYSFS
on its workqueue and maintain a copy of the cpumask, nice, numa values
under its own lock.

>
> Thanks.
>

2015-05-12 05:06:24

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 3/5] workqueue: ensure attrs-changing be sequentially

On 05/11/2015 10:55 PM, Tejun Heo wrote:
> Hey,
>
> Prolly a better subject is "ensure attrs changes are properly
> synchronized"
>
> On Mon, May 11, 2015 at 05:35:50PM +0800, Lai Jiangshan wrote:
>> Current modification to attrs via sysfs is not atomically.
>
> atomic.
>
>>
>> Process A (change cpumask) | Process B (change numa affinity)
>> wq_cpumask_store() |
>> wq_sysfs_prep_attrs() |
> ^
> misaligned

It is aligned in email, misaligned in quoted email, and misaligned
in `git log` and `git show`, aligned in `git commit` when I wrote
the changelog.

I will just remove all the |.

>
>> | apply_workqueue_attrs()
>> apply_workqueue_attrs() |
>>
>> It results that the Process B's operation is totally reverted
>> without any notification.
>
> Yeah, right.
>
>> This behavior is acceptable but it is sometimes unexpected.
>
> I don't think this is an acceptable behavior.
>
>> Sequential model on non-performance-sensitive operations is more popular
>> and preferred. So this patch moves wq_sysfs_prep_attrs() into the protection
>
> You can just say the previous behavior is buggy.

It depends on definitions. To me, it is just a nuisance.

>
>> under wq_pool_mutex to ensure attrs-changing be sequentially.
>>
>> This patch is also a preparation patch for next patch which change
>> the API of apply_workqueue_attrs().
> ...
>> +static void apply_wqattrs_lock(void)
>> +{
>> + /*
>> + * CPUs should stay stable across pwq creations and installations.
>> + * Pin CPUs, determine the target cpumask for each node and create
>> + * pwqs accordingly.
>> + */
>> + get_online_cpus();
>> + mutex_lock(&wq_pool_mutex);
>> +}
>> +
>> +static void apply_wqattrs_unlock(void)
>> +{
>> + mutex_unlock(&wq_pool_mutex);
>> + put_online_cpus();
>> +}
>
> Separate out refactoring and extending locking coverage?
>
> Thanks.
>

2015-05-12 13:16:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/5] workqueue: merge the similar code

Hello,

On Tue, May 12, 2015 at 10:03:11AM +0800, Lai Jiangshan wrote:
> > @cpu_off sounds like offset into cpu array or sth. Is there a reason
> > to change the name?
>
> @cpu_off is a local variable in wq_update_unbound_numa() and is a shorter
> name.

Let's stick with the other name.

> >
> >> + *
> >> + * Allocate or reuse a pwq with the cpumask that @wq should use on @node.
> >
> > I wonder whether a better name for the function would be sth like
> > get_alloc_node_unbound_pwq().
> >
>
> The name length of alloc_node_unbound_pwq() had already added trouble to me
> for code-indent in the called-site. I can add a variable to ease the indent
> problem later, but IMHO, get_alloc_node_unbound_pwq() is not strictly a better
> name over alloc_node_unbound_pwq(). Maybe we can consider get_node_unbound_pwq()?

Hmmm... the thing w/ "get" is that it gets confusing w/ refcnt ops.
alloc is kinda misleading and we do use concatenations of two verbs
for things like this - find_get, lookup_create and so on. If the name
is too long (is it really tho?), do we really need "node" in the name?

> >> *
> >> - * Calculate the cpumask a workqueue with @attrs should use on @node. If
> >> - * @cpu_going_down is >= 0, that cpu is considered offline during
> >> - * calculation. The result is stored in @cpumask.
> >> + * If NUMA affinity is not enabled, @dfl_pwq is always used. @dfl_pwq
> >> + * was allocated with the effetive attrs saved in @dfl_pwq->pool->attrs.
> >
> > I'm not sure we need the second sentence.
>
> effetive -> effective
>
> I used "the effetive attrs" twice bellow. I need help to rephrase it,
> might you do me a favor? Or just use it without introducing it at first?

It's just a bit unnecessary to re-state where dfl_pwq is allocated
here. It's an invariant which is explained where it's set-up. I
don't think we need extra explanation here.

> >> + if (cpumask_equal(cpumask, attrs->cpumask))
> >> + goto use_dfl;
> >> + if (pwq && wqattrs_equal(tmp_attrs, pwq->pool->attrs))
> >> + goto use_existed;
> >
> > goto use_current;
>
> The label use_existed is shared with use_dfl:

It's just bad phrasing. If you want to use "exist", you can say
use_existing as in "use (the) existing (one)".

> use_dfl:
> pwq = dfl_pwq;
> use_existed:
> spin_lock_irq(&pwq->pool->lock);
> get_pwq(pwq);
> spin_unlock_irq(&pwq->pool->lock);
> return pwq;
>
> But I don't think the dfl_pwq is current.

I don't think we generally try to make the combination of consecutive
goto labels come out as a coherent narrative.

> >> +
> >> + /* create a new pwq */
> >> + pwq = alloc_unbound_pwq(wq, tmp_attrs);
> >> + if (!pwq && use_dfl_when_fail) {
> >> + pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
> >> + wq->name);
> >> + goto use_dfl;
> >
> > Does this need to be in this function? Can't we let the caller handle
> > the fallback instead?
>
> Will it leave the duplicated code that this patch tries to remove?

Even if it ends up several more lines of code, I think that'd be
cleaner. Look at the parameters this function is taking. It looks
almost incoherent.

Thanks.

--
tejun

2015-05-12 13:19:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/5] workqueue: ensure attrs-changing be sequentially

Hello,

On Tue, May 12, 2015 at 01:09:50PM +0800, Lai Jiangshan wrote:
> On 05/11/2015 10:55 PM, Tejun Heo wrote:
> >>
> >> Process A (change cpumask) | Process B (change numa affinity)
> >> wq_cpumask_store() |
> >> wq_sysfs_prep_attrs() |
> > ^
> > misaligned
>
> It is aligned in email, misaligned in quoted email, and misaligned
> in `git log` and `git show`, aligned in `git commit` when I wrote
> the changelog.
>
> I will just remove all the |.

Hmmm... I wonder why that is. It looks consistently misaligned here.
If in doubt, just use spaces instead of tabs when drawing stuff.

...
> >> Sequential model on non-performance-sensitive operations is more popular
> >> and preferred. So this patch moves wq_sysfs_prep_attrs() into the protection
> >
> > You can just say the previous behavior is buggy.
>
> It depends on definitions. To me, it is just a nuisance.

I find this pretty difficult to agree with. A does an operation which
changes attribute 1. B independently tries to change attribute 2.
Depending on the sequence, we end up with three different results.
How is this not a bug?

Thanks.

--
tejun

2015-05-12 13:22:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

Hello, Lai.

On Tue, May 12, 2015 at 10:15:28AM +0800, Lai Jiangshan wrote:
> > I'm not sure about this. Yeah, sure, it's a bit more lines of code
> > but at the same time this'd allow us to make the public interface
> > atomic too. What we prolly should do is changing the interface so
> > that we do
> >
> > attrs = prepare_workqueue_attrs(gfp_mask); /* allocate, lock & copy */
> > /* modify attrs as desired */
> > commit_workqueue_attrs(attrs); /* apply, unlock and free */
>
> I think the workqueue.c has too much complicated and rarely used APIs
> and exposes too much in this way. No one can set the nice value
> and the cpuallowed of a task atomically.

What do you mean no one can?

> If the user want atomic-able, Her/he can just disable WQ_SYSFS
> on its workqueue and maintain a copy of the cpumask, nice, numa values
> under its own lock.

So, we're now requiring workqueue users to take care of
synchronization, disabling and reinstating WQ_SYSFS (what if userland
hits those knobs at the same time?) and poking into workqueue struct
to determine the current values of the attributes that the user is not
intereted in changing? This is a horrible interface.

Thanks.

--
tejun

2015-05-13 01:39:54

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

On 05/12/2015 09:22 PM, Tejun Heo wrote:
> Hello, Lai.
>
> On Tue, May 12, 2015 at 10:15:28AM +0800, Lai Jiangshan wrote:
>>> I'm not sure about this. Yeah, sure, it's a bit more lines of code
>>> but at the same time this'd allow us to make the public interface
>>> atomic too. What we prolly should do is changing the interface so
>>> that we do
>>>
>>> attrs = prepare_workqueue_attrs(gfp_mask); /* allocate, lock & copy */
>>> /* modify attrs as desired */
>>> commit_workqueue_attrs(attrs); /* apply, unlock and free */
>>
>> I think the workqueue.c has too much complicated and rarely used APIs
>> and exposes too much in this way. No one can set the nice value
>> and the cpuallowed of a task atomically.
>
> What do you mean no one can?

normal/general task. not kworker.

no one can set the nice value and the cpumallowed of a normal task atomically.

The kernel doesn't have such APIs:

lock_and_get_task_cpus_allowed(task);
/* modify cpumask */
set_cpus_allowed_ptr_and_unlock();


>
>> If the user want atomic-able, Her/he can just disable WQ_SYSFS
>> on its workqueue and maintain a copy of the cpumask, nice, numa values
>> under its own lock.
>
> So, we're now requiring workqueue users to take care of
> synchronization, disabling and reinstating WQ_SYSFS (what if userland
> hits those knobs at the same time?)

I think there is no userland knobs when !WQ_SYSFS.

> and poking into workqueue struct to determine the current values of the

I think the copy version of cpumask, nice, numa values are same as
the workqueue struct have. No poking is required.
(Its own lock-protect-region is the ONLY entry to call apply_workqueue_attrs()).

> attributes that the user is not
> intereted in changing? This is a horrible interface.



2015-05-13 13:53:00

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/5] workqueue: don't expose workqueue_attrs to users

Hello, Lai.

On Wed, May 13, 2015 at 09:43:19AM +0800, Lai Jiangshan wrote:
> >> I think the workqueue.c has too much complicated and rarely used APIs
> >> and exposes too much in this way. No one can set the nice value
> >> and the cpuallowed of a task atomically.
> >
> > What do you mean no one can?
>
> normal/general task. not kworker.
>
> no one can set the nice value and the cpumallowed of a normal task atomically.
>
> The kernel doesn't have such APIs:
>
> lock_and_get_task_cpus_allowed(task);
> /* modify cpumask */
> set_cpus_allowed_ptr_and_unlock();

I'm still not following. What are you trying to say?

> > So, we're now requiring workqueue users to take care of
> > synchronization, disabling and reinstating WQ_SYSFS (what if userland
> > hits those knobs at the same time?)
>
> I think there is no userland knobs when !WQ_SYSFS.

So, fail apply attrs calls if the workqueue is exposed to userland?
Are you serious?

> > and poking into workqueue struct to determine the current values of the
>
> I think the copy version of cpumask, nice, numa values are same as
> the workqueue struct have. No poking is required.
> (Its own lock-protect-region is the ONLY entry to call apply_workqueue_attrs()).

And how would the caller know the current values?

Thanks.

--
tejun

2016-03-10 21:44:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/5] workqueue: wq_pool_mutex protects the attrs-installation

This patch was recently backported to 4.1.19, and when I merged it with -rt,
the following bug triggered:

===============================
[ INFO: suspicious RCU usage. ]
4.1.19-test-rt22+ #1 Not tainted
-------------------------------
/work/rt/stable-rt.git/kernel/workqueue.c:608 sched RCU, wq->mutex or wq_pool_mutex should be held!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 0
2 locks held by swapper/0/1:
#0: ((pendingb_lock).lock){+.+...}, at: [<ffffffff8105e4b7>] __local_lock_irq+0x21/0x74
#1: (rcu_read_lock){......}, at: [<ffffffff8105fbdc>] rcu_read_lock+0x0/0x6c

stack backtrace:
^AdCPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.1.19-test-rt22+ #1
^AdHardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
0000000000000000 ffff8802101dfd08 ffffffff816083ae ffff880210240000
0000000000000001 ffff8802101dfd38 ffffffff81087cf1 ffff88021588e800
0000000000000000 0000000000000100 ffff88021588e800 ffff8802101dfd58
Call Trace:
[<ffffffff816083ae>] dump_stack+0x67/0x90
[<ffffffff81087cf1>] lockdep_rcu_suspicious+0x107/0x110
[<ffffffff8105f9fe>] unbound_pwq_by_node+0x6c/0x93
[<ffffffff81060e62>] __queue_work+0xc8/0x2e7
[<ffffffff8106f0cc>] ? migrate_disable+0x28/0xe6
[<ffffffff81061126>] queue_work_on+0x85/0xb8
[<ffffffff81f54188>] ? acpi_battery_init+0x16/0x16
[<ffffffff8106a382>] __async_schedule+0x18b/0x19d
[<ffffffff81f54172>] ? acpi_memory_hotplug_init+0x12/0x12
[<ffffffff8106a3b9>] async_schedule+0x15/0x17
[<ffffffff81f54184>] acpi_battery_init+0x12/0x16
[<ffffffff81000415>] do_one_initcall+0xf7/0x18a
[<ffffffff8106692f>] ? parse_args+0x258/0x35f
[<ffffffff81f140be>] kernel_init_freeable+0x205/0x29e
[<ffffffff81f137d0>] ? do_early_param+0x86/0x86
[<ffffffff8160d9bc>] ? _raw_spin_unlock_irq+0x5d/0x72
[<ffffffff815fc28f>] ? rest_init+0x143/0x143
[<ffffffff815fc29d>] kernel_init+0xe/0xdf
[<ffffffff8160e712>] ret_from_fork+0x42/0x70
[<ffffffff815fc28f>] ? rest_init+0x143/0x143



On Mon, May 11, 2015 at 05:35:48PM +0800, Lai Jiangshan wrote:
> ---
> kernel/workqueue.c | 44 ++++++++++++++++++++++++++------------------
> 1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index a3915ab..fa8b949 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -127,6 +127,12 @@ enum {
> *
> * PR: wq_pool_mutex protected for writes. Sched-RCU protected for reads.
> *
> + * PW: wq_pool_mutex and wq->mutex protected for writes. Any one of them
> + * protected for reads.
> + *
> + * PWR: wq_pool_mutex and wq->mutex protected for writes. Any one of them
> + * or sched-RCU for reads.

How exactly is sched-RCU protecting this here? The cause for the 4.1-rt issue
is that we converted the local_irq_save() in queue_work_on() into a
"local_lock_irqsave()" which when PREEMPT_RT is enabled will be a mutex that
disables migration (can not migrate). This also prevents the current CPU from
going offline.

Does this code really need to be protected from being preempted, or is
disabling migration good enough?

Thanks!

-- Steve


> + *
> * WQ: wq->mutex protected.
> *
> * WR: wq->mutex protected for writes. Sched-RCU protected for reads.
> @@ -247,8 +253,8 @@ struct workqueue_struct {
> int nr_drainers; /* WQ: drain in progress */
> int saved_max_active; /* WQ: saved pwq max_active */
>
> - struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
> - struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */
> + struct workqueue_attrs *unbound_attrs; /* PW: only for unbound wqs */
> + struct pool_workqueue *dfl_pwq; /* PW: only for unbound wqs */
>
> #ifdef CONFIG_SYSFS
> struct wq_device *wq_dev; /* I: for sysfs interface */
> @@ -268,7 +274,7 @@ struct workqueue_struct {
> /* hot fields used during command issue, aligned to cacheline */
> unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
> struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
> - struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
> + struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */
> };
>
> static struct kmem_cache *pwq_cache;
> @@ -349,6 +355,12 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
> lockdep_is_held(&wq->mutex), \
> "sched RCU or wq->mutex should be held")
>
> +#define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \
> + rcu_lockdep_assert(rcu_read_lock_sched_held() || \
> + lockdep_is_held(&wq->mutex) || \
> + lockdep_is_held(&wq_pool_mutex), \
> + "sched RCU, wq->mutex or wq_pool_mutex should be held")
> +
> #define for_each_cpu_worker_pool(pool, cpu) \
> for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
> (pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
> @@ -553,7 +565,7 @@ static int worker_pool_assign_id(struct worker_pool *pool)
> * @wq: the target workqueue
> * @node: the node ID
> *
> - * This must be called either with pwq_lock held or sched RCU read locked.
> + * This must be called either with wq_pool_mutex held or sched RCU read locked.
> * If the pwq needs to be used beyond the locking in effect, the caller is
> * responsible for guaranteeing that the pwq stays online.
> *
> @@ -562,7 +574,7 @@ static int worker_pool_assign_id(struct worker_pool *pool)
> static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
> int node)
> {
> - assert_rcu_or_wq_mutex(wq);
> + assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
> }
>
> @@ -3480,6 +3492,7 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
> struct pool_workqueue *old_pwq;
>
> lockdep_assert_held(&wq->mutex);
> + lockdep_assert_held(&wq_pool_mutex);
>
> /* link_pwq() can handle duplicate calls */
> link_pwq(pwq);
> @@ -3644,10 +3657,9 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> * pwqs accordingly.
> */
> get_online_cpus();
> -
> mutex_lock(&wq_pool_mutex);
> +
> ctx = apply_wqattrs_prepare(wq, attrs);
> - mutex_unlock(&wq_pool_mutex);
>
> /* the ctx has been prepared successfully, let's commit it */
> if (ctx) {
> @@ -3655,10 +3667,11 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
> ret = 0;
> }
>
> - put_online_cpus();
> -
> apply_wqattrs_cleanup(ctx);
>
> + mutex_unlock(&wq_pool_mutex);
> + put_online_cpus();
> +
> return ret;
> }
>
> @@ -3695,7 +3708,8 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
>
> lockdep_assert_held(&wq_pool_mutex);
>
> - if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
> + if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
> + wq->unbound_attrs->no_numa)
> return;
>
> /*
> @@ -3706,10 +3720,6 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
> target_attrs = wq_update_unbound_numa_attrs_buf;
> cpumask = target_attrs->cpumask;
>
> - mutex_lock(&wq->mutex);
> - if (wq->unbound_attrs->no_numa)
> - goto out_unlock;
> -
> copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
> pwq = unbound_pwq_by_node(wq, node);
>
> @@ -3721,19 +3731,16 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
> */
> if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask)) {
> if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
> - goto out_unlock;
> + return;
> } else {
> goto use_dfl_pwq;
> }
>
> - mutex_unlock(&wq->mutex);
> -
> /* create a new pwq */
> pwq = alloc_unbound_pwq(wq, target_attrs);
> if (!pwq) {
> pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
> wq->name);
> - mutex_lock(&wq->mutex);
> goto use_dfl_pwq;
> }
>
> @@ -3748,6 +3755,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
> goto out_unlock;
>
> use_dfl_pwq:
> + mutex_lock(&wq->mutex);
> spin_lock_irq(&wq->dfl_pwq->pool->lock);
> get_pwq(wq->dfl_pwq);
> spin_unlock_irq(&wq->dfl_pwq->pool->lock);
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2016-03-11 17:50:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/5] workqueue: wq_pool_mutex protects the attrs-installation

Hello, Steven.

On Thu, Mar 10, 2016 at 04:44:11PM -0500, Steven Rostedt wrote:
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index a3915ab..fa8b949 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -127,6 +127,12 @@ enum {
> > *
> > * PR: wq_pool_mutex protected for writes. Sched-RCU protected for reads.
> > *
> > + * PW: wq_pool_mutex and wq->mutex protected for writes. Any one of them
> > + * protected for reads.
> > + *
> > + * PWR: wq_pool_mutex and wq->mutex protected for writes. Any one of them
> > + * or sched-RCU for reads.
>
> How exactly is sched-RCU protecting this here? The cause for the 4.1-rt issue
> is that we converted the local_irq_save() in queue_work_on() into a
> "local_lock_irqsave()" which when PREEMPT_RT is enabled will be a mutex that
> disables migration (can not migrate). This also prevents the current CPU from
> going offline.
>
> Does this code really need to be protected from being preempted, or is
> disabling migration good enough?

That's used for workqueue->numa_pwq_tbl[] and it's derefed while
determining the pwq to use for the node. The queueing path already
has local irq disabled and pwqs are sched rcu protected. If the task
gets preempted inbetween, sched rcu grace period might trigger and
queueing path might try to deref an already free pwq, so yeah, I think
it does need preemption protection there.

Thanks.

--
tejun