2013-03-02 03:24:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

Subject: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

Hello,

Finally, here's the unbound workqueue with custom worker attributes
patchset I've been talking about. The goal is simple. We want
unbound workqueues with custom worker attributes with a mechanism to
expose the knobs to userland.

Currently, the supported attributes are nice level and allowed
cpumask. It's likely that cgroup association will be added in future.
Attributes are specified via struct workqueue_attrs.

struct workqueue_attrs {
int nice; /* nice level */
cpumask_var_t cpumask; /* allowed CPUs */
};

which is allocated, applied and freed using the following functions.

struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask);
void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs);

If the workqueue's knobs should be visible to userland, WQ_SYSFS can
be specified during alloc_workqueue() or workqueue_sysfs_register()
can be called. The knobs will be accessible under
/sys/bus/workqueue/devices/NAME/. max_active, nice and cpumask are
all adjustable from userland.

Whenever a new set of attrs is applied, workqueue tries to find the
worker_pool with matching attributes. If there's one, its refcnt is
bumped and used; otherwise, a new one is created. A new
pool_workqueue is created to interface with the found or created
worker_pool and the old pwqs (pool_workqueues) stick around until all
in-flight work items finish. As pwqs retire, the associated
worker_pools are put too. As a result, workqueue will make all
workqueues with the same attributes share the same pool and only keep
around the pools which are in use.

The interface is simple but the implementation is quite involved
because per-cpu assumption is still very strongly entrenched in the
existing workqueue implementation with unbound workqueue
implementation thrown on top as a hacky extension of the per-cpu
model. A lot of this patchset deals with decoupling per-cpu
assumptions from various parts.

After per-cpu assumption is removed, unbound workqueue handling is
updated so that it can deal with multiple pwqs. With the pwq and pool
iterators updated to handle per-cpu and unbound ones equally, it
usually boils down to traveling the same path used by per-cpu
workqueues to deal with multiple per-cpu pwqs. For example,
non-reentrancy test while queueing and multiple pwq handling in
flush_workqueue() are now shared by both per-cpu and unbound
workqueues.

The result is pretty nice as per-cpu and unbound workqueues behave
almost the same with the only difference being per-cpu's pwqs are
per-cpu and unbound's are for different attributes. The handling
deviates only in creation and destruction paths.

This patchset doesn't introduce any uses of workqueue_attrs or
WQ_SYSFS. Writeback and btrfs IO workers are candidates for
conversion and will be done in separate patchsets.

This patchset contains the following 31 patches.

0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
0002-workqueue-make-workqueue_lock-irq-safe.patch
0003-workqueue-introduce-kmem_cache-for-pool_workqueues.patch
0004-workqueue-add-workqueue_struct-pwqs-list.patch
0005-workqueue-replace-for_each_pwq_cpu-with-for_each_pwq.patch
0006-workqueue-introduce-for_each_pool.patch
0007-workqueue-restructure-pool-pool_workqueue-iterations.patch
0008-workqueue-add-wokrqueue_struct-maydays-list-to-repla.patch
0009-workqueue-consistently-use-int-for-cpu-variables.patch
0010-workqueue-remove-workqueue_struct-pool_wq.single.patch
0011-workqueue-replace-get_pwq-with-explicit-per_cpu_ptr-.patch
0012-workqueue-update-synchronization-rules-on-workqueue-.patch
0013-workqueue-update-synchronization-rules-on-worker_poo.patch
0014-workqueue-replace-POOL_MANAGING_WORKERS-flag-with-wo.patch
0015-workqueue-separate-out-init_worker_pool-from-init_wo.patch
0016-workqueue-introduce-workqueue_attrs.patch
0017-workqueue-implement-attribute-based-unbound-worker_p.patch
0018-workqueue-remove-unbound_std_worker_pools-and-relate.patch
0019-workqueue-drop-std-from-cpu_std_worker_pools-and-for.patch
0020-workqueue-add-pool-ID-to-the-names-of-unbound-kworke.patch
0021-workqueue-drop-WQ_RESCUER-and-test-workqueue-rescuer.patch
0022-workqueue-restructure-__alloc_workqueue_key.patch
0023-workqueue-implement-get-put_pwq.patch
0024-workqueue-prepare-flush_workqueue-for-dynamic-creati.patch
0025-workqueue-perform-non-reentrancy-test-when-queueing-.patch
0026-workqueue-implement-apply_workqueue_attrs.patch
0027-workqueue-make-it-clear-that-WQ_DRAINING-is-an-inter.patch
0028-workqueue-reject-increasing-max_active-for-ordered-w.patch
0029-cpumask-implement-cpumask_parse.patch
0030-driver-base-implement-subsys_virtual_register.patch
0031-workqueue-implement-sysfs-interface-for-workqueues.patch

0001-0003 are misc preps.

0004-0008 update various iterators such that they don't operate on cpu
number.

0009-0011 are another set of misc preps / cleanups.

0012-0014 update synchronization rules to prepare for dynamic
management of pwqs and pools.

0015-0022 introduce workqueue_attrs and prepare for dynamic management
of pwqs and pools.

0023-0026 implement dynamic application of workqueue_attrs which
involes creating and destroying unbound pwqs and pools dynamically.

0027-0028 prepare workqueue for sysfs exports.

0029-0030 make cpumask and driver core changes for workqueue sysfs
exports.

0031 implements sysfs exports for workqueues.

This patchset is on top of

[1] wq/for-3.10-tmp 7bceeff75e ("workqueue: better define synchronization rule around rescuer->pool updates")

which is scheduled to be rebased on top of v3.9-rc1 once it comes out.
The changes are also available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-attrs

diffstat follows.

drivers/base/base.h | 2
drivers/base/bus.c | 73 +
drivers/base/core.c | 2
include/linux/cpumask.h | 15
include/linux/device.h | 2
include/linux/workqueue.h | 34
kernel/workqueue.c | 1716 +++++++++++++++++++++++++++++++-------------
kernel/workqueue_internal.h | 5
8 files changed, 1322 insertions(+), 527 deletions(-)

Thanks.

--
tejun

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.10-tmp


2013-03-02 03:24:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/31] workqueue: make sanity checks less punshing using WARN_ON[_ONCE]()s

Workqueue has been using mostly BUG_ON()s for sanity checks, which
fail unnecessarily harshly when the assertion doesn't hold. Most
assertions can converted to be less drastic such that things can limp
along instead of dying completely. Convert BUG_ON()s to
WARN_ON[_ONCE]()s with softer failure behaviors - e.g. if assertion
check fails in destroy_worker(), trigger WARN and silently ignore
destruction request.

Most conversions are trivial. Note that sanity checks in
destroy_workqueue() are moved above removal from workqueues list so
that it can bail out without side-effects if assertion checks fail.

This patch doesn't introduce any visible behavior changes during
normal operation.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 85 +++++++++++++++++++++++++++++-------------------------
1 file changed, 46 insertions(+), 39 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0b1e6f2..a533e77 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -530,7 +530,7 @@ static int work_next_color(int color)
static inline void set_work_data(struct work_struct *work, unsigned long data,
unsigned long flags)
{
- BUG_ON(!work_pending(work));
+ WARN_ON_ONCE(!work_pending(work));
atomic_long_set(&work->data, data | flags | work_static(work));
}

@@ -785,7 +785,8 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task,
pool = worker->pool;

/* this can only happen on the local cpu */
- BUG_ON(cpu != raw_smp_processor_id());
+ if (WARN_ON_ONCE(cpu != raw_smp_processor_id()))
+ return NULL;

/*
* The counterpart of the following dec_and_test, implied mb,
@@ -1459,9 +1460,10 @@ static void worker_enter_idle(struct worker *worker)
{
struct worker_pool *pool = worker->pool;

- BUG_ON(worker->flags & WORKER_IDLE);
- BUG_ON(!list_empty(&worker->entry) &&
- (worker->hentry.next || worker->hentry.pprev));
+ if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
+ WARN_ON_ONCE(!list_empty(&worker->entry) &&
+ (worker->hentry.next || worker->hentry.pprev)))
+ return;

/* can't use worker_set_flags(), also called from start_worker() */
worker->flags |= WORKER_IDLE;
@@ -1498,7 +1500,8 @@ static void worker_leave_idle(struct worker *worker)
{
struct worker_pool *pool = worker->pool;

- BUG_ON(!(worker->flags & WORKER_IDLE));
+ if (WARN_ON_ONCE(!(worker->flags & WORKER_IDLE)))
+ return;
worker_clr_flags(worker, WORKER_IDLE);
pool->nr_idle--;
list_del_init(&worker->entry);
@@ -1795,8 +1798,9 @@ static void destroy_worker(struct worker *worker)
int id = worker->id;

/* sanity check frenzy */
- BUG_ON(worker->current_work);
- BUG_ON(!list_empty(&worker->scheduled));
+ if (WARN_ON(worker->current_work) ||
+ WARN_ON(!list_empty(&worker->scheduled)))
+ return;

if (worker->flags & WORKER_STARTED)
pool->nr_workers--;
@@ -1925,7 +1929,8 @@ restart:
del_timer_sync(&pool->mayday_timer);
spin_lock_irq(&pool->lock);
start_worker(worker);
- BUG_ON(need_to_create_worker(pool));
+ if (WARN_ON_ONCE(need_to_create_worker(pool)))
+ goto restart;
return true;
}

@@ -2258,7 +2263,7 @@ recheck:
* preparing to process a work or actually processing it.
* Make sure nobody diddled with it while I was sleeping.
*/
- BUG_ON(!list_empty(&worker->scheduled));
+ WARN_ON_ONCE(!list_empty(&worker->scheduled));

/*
* When control reaches this point, we're guaranteed to have
@@ -2366,7 +2371,7 @@ repeat:
* Slurp in all works issued via this workqueue and
* process'em.
*/
- BUG_ON(!list_empty(&rescuer->scheduled));
+ WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
list_for_each_entry_safe(work, n, &pool->worklist, entry)
if (get_work_pwq(work) == pwq)
move_linked_works(work, scheduled, &n);
@@ -2501,7 +2506,7 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
unsigned int cpu;

if (flush_color >= 0) {
- BUG_ON(atomic_read(&wq->nr_pwqs_to_flush));
+ WARN_ON_ONCE(atomic_read(&wq->nr_pwqs_to_flush));
atomic_set(&wq->nr_pwqs_to_flush, 1);
}

@@ -2512,7 +2517,7 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
spin_lock_irq(&pool->lock);

if (flush_color >= 0) {
- BUG_ON(pwq->flush_color != -1);
+ WARN_ON_ONCE(pwq->flush_color != -1);

if (pwq->nr_in_flight[flush_color]) {
pwq->flush_color = flush_color;
@@ -2522,7 +2527,7 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
}

if (work_color >= 0) {
- BUG_ON(work_color != work_next_color(pwq->work_color));
+ WARN_ON_ONCE(work_color != work_next_color(pwq->work_color));
pwq->work_color = work_color;
}

@@ -2570,13 +2575,13 @@ void flush_workqueue(struct workqueue_struct *wq)
* becomes our flush_color and work_color is advanced
* by one.
*/
- BUG_ON(!list_empty(&wq->flusher_overflow));
+ WARN_ON_ONCE(!list_empty(&wq->flusher_overflow));
this_flusher.flush_color = wq->work_color;
wq->work_color = next_color;

if (!wq->first_flusher) {
/* no flush in progress, become the first flusher */
- BUG_ON(wq->flush_color != this_flusher.flush_color);
+ WARN_ON_ONCE(wq->flush_color != this_flusher.flush_color);

wq->first_flusher = &this_flusher;

@@ -2589,7 +2594,7 @@ void flush_workqueue(struct workqueue_struct *wq)
}
} else {
/* wait in queue */
- BUG_ON(wq->flush_color == this_flusher.flush_color);
+ WARN_ON_ONCE(wq->flush_color == this_flusher.flush_color);
list_add_tail(&this_flusher.list, &wq->flusher_queue);
flush_workqueue_prep_pwqs(wq, -1, wq->work_color);
}
@@ -2623,8 +2628,8 @@ void flush_workqueue(struct workqueue_struct *wq)

wq->first_flusher = NULL;

- BUG_ON(!list_empty(&this_flusher.list));
- BUG_ON(wq->flush_color != this_flusher.flush_color);
+ WARN_ON_ONCE(!list_empty(&this_flusher.list));
+ WARN_ON_ONCE(wq->flush_color != this_flusher.flush_color);

while (true) {
struct wq_flusher *next, *tmp;
@@ -2637,8 +2642,8 @@ void flush_workqueue(struct workqueue_struct *wq)
complete(&next->done);
}

- BUG_ON(!list_empty(&wq->flusher_overflow) &&
- wq->flush_color != work_next_color(wq->work_color));
+ WARN_ON_ONCE(!list_empty(&wq->flusher_overflow) &&
+ wq->flush_color != work_next_color(wq->work_color));

/* this flush_color is finished, advance by one */
wq->flush_color = work_next_color(wq->flush_color);
@@ -2662,7 +2667,7 @@ void flush_workqueue(struct workqueue_struct *wq)
}

if (list_empty(&wq->flusher_queue)) {
- BUG_ON(wq->flush_color != wq->work_color);
+ WARN_ON_ONCE(wq->flush_color != wq->work_color);
break;
}

@@ -2670,8 +2675,8 @@ void flush_workqueue(struct workqueue_struct *wq)
* Need to flush more colors. Make the next flusher
* the new first flusher and arm pwqs.
*/
- BUG_ON(wq->flush_color == wq->work_color);
- BUG_ON(wq->flush_color != next->flush_color);
+ WARN_ON_ONCE(wq->flush_color == wq->work_color);
+ WARN_ON_ONCE(wq->flush_color != next->flush_color);

list_del_init(&next->list);
wq->first_flusher = next;
@@ -3265,6 +3270,19 @@ void destroy_workqueue(struct workqueue_struct *wq)
/* drain it before proceeding with destruction */
drain_workqueue(wq);

+ /* sanity checks */
+ for_each_pwq_cpu(cpu, wq) {
+ struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ int i;
+
+ for (i = 0; i < WORK_NR_COLORS; i++)
+ if (WARN_ON(pwq->nr_in_flight[i]))
+ return;
+ if (WARN_ON(pwq->nr_active) ||
+ WARN_ON(!list_empty(&pwq->delayed_works)))
+ return;
+ }
+
/*
* wq list is used to freeze wq, remove from list after
* flushing is complete in case freeze races us.
@@ -3273,17 +3291,6 @@ void destroy_workqueue(struct workqueue_struct *wq)
list_del(&wq->list);
spin_unlock(&workqueue_lock);

- /* sanity check */
- for_each_pwq_cpu(cpu, wq) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
- int i;
-
- for (i = 0; i < WORK_NR_COLORS; i++)
- BUG_ON(pwq->nr_in_flight[i]);
- BUG_ON(pwq->nr_active);
- BUG_ON(!list_empty(&pwq->delayed_works));
- }
-
if (wq->flags & WQ_RESCUER) {
kthread_stop(wq->rescuer->task);
free_mayday_mask(wq->mayday_mask);
@@ -3427,7 +3434,7 @@ static void wq_unbind_fn(struct work_struct *work)
int i;

for_each_std_worker_pool(pool, cpu) {
- BUG_ON(cpu != smp_processor_id());
+ WARN_ON_ONCE(cpu != smp_processor_id());

mutex_lock(&pool->assoc_mutex);
spin_lock_irq(&pool->lock);
@@ -3597,7 +3604,7 @@ void freeze_workqueues_begin(void)

spin_lock(&workqueue_lock);

- BUG_ON(workqueue_freezing);
+ WARN_ON_ONCE(workqueue_freezing);
workqueue_freezing = true;

for_each_wq_cpu(cpu) {
@@ -3645,7 +3652,7 @@ bool freeze_workqueues_busy(void)

spin_lock(&workqueue_lock);

- BUG_ON(!workqueue_freezing);
+ WARN_ON_ONCE(!workqueue_freezing);

for_each_wq_cpu(cpu) {
struct workqueue_struct *wq;
@@ -3659,7 +3666,7 @@ bool freeze_workqueues_busy(void)
if (!pwq || !(wq->flags & WQ_FREEZABLE))
continue;

- BUG_ON(pwq->nr_active < 0);
+ WARN_ON_ONCE(pwq->nr_active < 0);
if (pwq->nr_active) {
busy = true;
goto out_unlock;
--
1.8.1.2

2013-03-02 03:25:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/31] workqueue: consistently use int for @cpu variables

Workqueue is mixing unsigned int and int for @cpu variables. There's
no point in using unsigned int for cpus - many of cpu related APIs
take int anyway. Consistently use int for @cpu variables so that we
can use negative values to mark special ones.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 6 +++---
kernel/workqueue.c | 24 +++++++++++-------------
kernel/workqueue_internal.h | 5 ++---
3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 5bd030f..899be66 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -435,7 +435,7 @@ extern bool cancel_delayed_work_sync(struct delayed_work *dwork);

extern void workqueue_set_max_active(struct workqueue_struct *wq,
int max_active);
-extern bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq);
+extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
extern unsigned int work_busy(struct work_struct *work);

/*
@@ -466,12 +466,12 @@ static inline bool __deprecated flush_delayed_work_sync(struct delayed_work *dwo
}

#ifndef CONFIG_SMP
-static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
+static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
{
return fn(arg);
}
#else
-long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
+long work_on_cpu(int cpu, long (*fn)(void *), void *arg);
#endif /* CONFIG_SMP */

#ifdef CONFIG_FREEZER
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8b38d1c..cbdc2ac 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -124,7 +124,7 @@ enum {

struct worker_pool {
spinlock_t lock; /* the pool lock */
- unsigned int cpu; /* I: the associated cpu */
+ int cpu; /* I: the associated cpu */
int id; /* I: pool ID */
unsigned int flags; /* X: flags */

@@ -467,8 +467,7 @@ static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
return &pools[highpri];
}

-static struct pool_workqueue *get_pwq(unsigned int cpu,
- struct workqueue_struct *wq)
+static struct pool_workqueue *get_pwq(int cpu, struct workqueue_struct *wq)
{
if (!(wq->flags & WQ_UNBOUND)) {
if (likely(cpu < nr_cpu_ids))
@@ -730,7 +729,7 @@ static void wake_up_worker(struct worker_pool *pool)
* CONTEXT:
* spin_lock_irq(rq->lock)
*/
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
+void wq_worker_waking_up(struct task_struct *task, int cpu)
{
struct worker *worker = kthread_data(task);

@@ -755,8 +754,7 @@ void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
* RETURNS:
* Worker task on @cpu to wake up, %NULL if none.
*/
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
- unsigned int cpu)
+struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu)
{
struct worker *worker = kthread_data(task), *to_wakeup = NULL;
struct worker_pool *pool;
@@ -1160,7 +1158,7 @@ static bool is_chained_work(struct workqueue_struct *wq)
return worker && worker->current_pwq->wq == wq;
}

-static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
+static void __queue_work(int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
struct pool_workqueue *pwq;
@@ -1716,7 +1714,7 @@ static struct worker *create_worker(struct worker_pool *pool)
if (pool->cpu != WORK_CPU_UNBOUND)
worker->task = kthread_create_on_node(worker_thread,
worker, cpu_to_node(pool->cpu),
- "kworker/%u:%d%s", pool->cpu, id, pri);
+ "kworker/%d:%d%s", pool->cpu, id, pri);
else
worker->task = kthread_create(worker_thread, worker,
"kworker/u:%d%s", id, pri);
@@ -3347,7 +3345,7 @@ EXPORT_SYMBOL_GPL(workqueue_set_max_active);
* RETURNS:
* %true if congested, %false otherwise.
*/
-bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq)
+bool workqueue_congested(int cpu, struct workqueue_struct *wq)
{
struct pool_workqueue *pwq = get_pwq(cpu, wq);

@@ -3464,7 +3462,7 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
unsigned long action,
void *hcpu)
{
- unsigned int cpu = (unsigned long)hcpu;
+ int cpu = (unsigned long)hcpu;
struct worker_pool *pool;

switch (action & ~CPU_TASKS_FROZEN) {
@@ -3510,7 +3508,7 @@ static int __cpuinit workqueue_cpu_down_callback(struct notifier_block *nfb,
unsigned long action,
void *hcpu)
{
- unsigned int cpu = (unsigned long)hcpu;
+ int cpu = (unsigned long)hcpu;
struct work_struct unbind_work;

switch (action & ~CPU_TASKS_FROZEN) {
@@ -3550,7 +3548,7 @@ static void work_for_cpu_fn(struct work_struct *work)
* It is up to the caller to ensure that the cpu doesn't go offline.
* The caller must not hold any locks which would prevent @fn from completing.
*/
-long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
+long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
{
struct work_for_cpu wfc = { .fn = fn, .arg = arg };

@@ -3708,7 +3706,7 @@ out_unlock:

static int __init init_workqueues(void)
{
- unsigned int cpu;
+ int cpu;

/* make sure we have enough bits for OFFQ pool ID */
BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index f9c8877..f116f07 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -59,8 +59,7 @@ static inline struct worker *current_wq_worker(void)
* Scheduler hooks for concurrency managed workqueue. Only to be used from
* sched.c and workqueue.c.
*/
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
- unsigned int cpu);
+void wq_worker_waking_up(struct task_struct *task, int cpu);
+struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu);

#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
--
1.8.1.2

2013-03-02 03:25:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 27/31] workqueue: make it clear that WQ_DRAINING is an internal flag

We're gonna add another internal WQ flag. Let's make the distinction
clear. Prefix WQ_DRAINING with __ and move it to bit 16.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 2 +-
kernel/workqueue.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index c8c3bf4..fc7f882 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -293,7 +293,7 @@ enum {
WQ_HIGHPRI = 1 << 4, /* high priority */
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */

- WQ_DRAINING = 1 << 6, /* internal: workqueue is draining */
+ __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */

WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 36fcf9c..2016c9e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1222,7 +1222,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
debug_work_activate(work);

/* if dying, only works from the same workqueue are allowed */
- if (unlikely(wq->flags & WQ_DRAINING) &&
+ if (unlikely(wq->flags & __WQ_DRAINING) &&
WARN_ON_ONCE(!is_chained_work(wq)))
return;
retry:
@@ -2761,11 +2761,11 @@ void drain_workqueue(struct workqueue_struct *wq)
/*
* __queue_work() needs to test whether there are drainers, is much
* hotter than drain_workqueue() and already looks at @wq->flags.
- * Use WQ_DRAINING so that queue doesn't have to check nr_drainers.
+ * Use __WQ_DRAINING so that queue doesn't have to check nr_drainers.
*/
spin_lock_irq(&workqueue_lock);
if (!wq->nr_drainers++)
- wq->flags |= WQ_DRAINING;
+ wq->flags |= __WQ_DRAINING;
spin_unlock_irq(&workqueue_lock);
reflush:
flush_workqueue(wq);
@@ -2793,7 +2793,7 @@ reflush:

spin_lock(&workqueue_lock);
if (!--wq->nr_drainers)
- wq->flags &= ~WQ_DRAINING;
+ wq->flags &= ~__WQ_DRAINING;
spin_unlock(&workqueue_lock);

local_irq_enable();
--
1.8.1.2

2013-03-02 03:25:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 31/31] workqueue: implement sysfs interface for workqueues

There are cases where workqueue users want to expose control knobs to
userland. e.g. Unbound workqueues with custom attributes are
scheduled to be used for writeback workers and depending on
configuration it can be useful to allow admins to tinker with the
priority or allowed CPUs.

This patch implements workqueue_sysfs_register(), which makes the
workqueue visible under /sys/bus/workqueue/devices/WQ_NAME. There
currently are two attributes common to both per-cpu and unbound pools
and extra attributes for unbound pools including nice level and
cpumask.

If alloc_workqueue*() is called with WQ_SYSFS,
workqueue_sysfs_register() is called automatically as part of
workqueue creation. This is the preferred method unless the workqueue
user wants to apply workqueue_attrs before making the workqueue
visible to userland.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 8 ++
kernel/workqueue.c | 280 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 288 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index e1e5748..9764841 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -292,6 +292,7 @@ enum {
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
WQ_HIGHPRI = 1 << 4, /* high priority */
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */
+ WQ_SYSFS = 1 << 6, /* visible in sysfs, see wq_sysfs_register() */

__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
@@ -494,4 +495,11 @@ extern bool freeze_workqueues_busy(void);
extern void thaw_workqueues(void);
#endif /* CONFIG_FREEZER */

+#ifdef CONFIG_SYSFS
+int workqueue_sysfs_register(struct workqueue_struct *wq);
+#else /* CONFIG_SYSFS */
+static inline int workqueue_sysfs_register(struct workqueue_struct *wq)
+{ return 0; }
+#endif /* CONFIG_SYSFS */
+
#endif
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8d487f6..a618df4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -210,6 +210,8 @@ struct wq_flusher {
struct completion done; /* flush completion */
};

+struct wq_device;
+
/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
@@ -233,6 +235,10 @@ struct workqueue_struct {

int nr_drainers; /* W: drain in progress */
int saved_max_active; /* W: saved pwq max_active */
+
+#ifdef CONFIG_SYSFS
+ struct wq_device *wq_dev; /* I: for sysfs interface */
+#endif
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -438,6 +444,8 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
static DEFINE_IDR(worker_pool_idr);

static int worker_thread(void *__worker);
+static void copy_workqueue_attrs(struct workqueue_attrs *to,
+ const struct workqueue_attrs *from);

/* allocate ID and assign it to @pool */
static int worker_pool_assign_id(struct worker_pool *pool)
@@ -3151,6 +3159,273 @@ int keventd_up(void)
return system_wq != NULL;
}

+#ifdef CONFIG_SYSFS
+/*
+ * Workqueues with WQ_SYSFS flag set is visible to userland via
+ * /sys/bus/workqueue/devices/WQ_NAME. All visible workqueues have the
+ * following attributes.
+ *
+ * per_cpu RO bool : whether the workqueue is per-cpu or unbound
+ * max_active RW int : maximum number of in-flight work items
+ *
+ * Unbound workqueues have the following extra attributes.
+ *
+ * id RO int : the associated pool ID
+ * nice RW int : nice value of the workers
+ * cpumask RW mask : bitmask of allowed CPUs for the workers
+ */
+struct wq_device {
+ struct workqueue_struct *wq;
+ struct device dev;
+};
+
+static struct workqueue_struct *dev_to_wq(struct device *dev)
+{
+ struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
+
+ return wq_dev->wq;
+}
+
+static ssize_t wq_per_cpu_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n", (bool)!(wq->flags & WQ_UNBOUND));
+}
+
+static ssize_t wq_max_active_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n", wq->saved_max_active);
+}
+
+static ssize_t wq_max_active_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int val;
+
+ if (sscanf(buf, "%d", &val) != 1 || val <= 0)
+ return -EINVAL;
+
+ workqueue_set_max_active(wq, val);
+ return count;
+}
+
+static struct device_attribute wq_sysfs_attrs[] = {
+ __ATTR(per_cpu, 0444, wq_per_cpu_show, NULL),
+ __ATTR(max_active, 0644, wq_max_active_show, wq_max_active_store),
+ __ATTR_NULL,
+};
+
+static ssize_t wq_pool_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct worker_pool *pool;
+ int written;
+
+ rcu_read_lock_sched();
+ pool = first_pwq(wq)->pool;
+ written = scnprintf(buf, PAGE_SIZE, "%d\n", pool->id);
+ rcu_read_unlock_sched();
+
+ return written;
+}
+
+static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int written;
+
+ rcu_read_lock_sched();
+ written = scnprintf(buf, PAGE_SIZE, "%d\n",
+ first_pwq(wq)->pool->attrs->nice);
+ rcu_read_unlock_sched();
+
+ return written;
+}
+
+/* prepare workqueue_attrs for sysfs store operations */
+static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
+{
+ struct workqueue_attrs *attrs;
+
+ attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!attrs)
+ return NULL;
+
+ rcu_read_lock_sched();
+ copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
+ rcu_read_unlock_sched();
+ return attrs;
+}
+
+static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct workqueue_attrs *attrs;
+ int ret;
+
+ attrs = wq_sysfs_prep_attrs(wq);
+ if (!attrs)
+ return -ENOMEM;
+
+ if (sscanf(buf, "%d", &attrs->nice) == 1 &&
+ attrs->nice >= -20 && attrs->nice <= 19)
+ ret = apply_workqueue_attrs(wq, attrs);
+ else
+ ret = -EINVAL;
+
+ free_workqueue_attrs(attrs);
+ return ret ?: count;
+}
+
+static ssize_t wq_cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int written;
+
+ rcu_read_lock_sched();
+ written = cpumask_scnprintf(buf, PAGE_SIZE,
+ first_pwq(wq)->pool->attrs->cpumask);
+ rcu_read_unlock_sched();
+
+ written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
+ return written;
+}
+
+static ssize_t wq_cpumask_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct workqueue_attrs *attrs;
+ int ret;
+
+ attrs = wq_sysfs_prep_attrs(wq);
+ if (!attrs)
+ return -ENOMEM;
+
+ ret = cpumask_parse(buf, attrs->cpumask);
+ if (!ret)
+ ret = apply_workqueue_attrs(wq, attrs);
+
+ free_workqueue_attrs(attrs);
+ return ret ?: count;
+}
+
+static struct device_attribute wq_sysfs_unbound_attrs[] = {
+ __ATTR(pool_id, 0444, wq_pool_id_show, NULL),
+ __ATTR(nice, 0644, wq_nice_show, wq_nice_store),
+ __ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+ __ATTR_NULL,
+};
+
+static struct bus_type wq_subsys = {
+ .name = "workqueue",
+ .dev_attrs = wq_sysfs_attrs,
+};
+
+static int __init wq_sysfs_init(void)
+{
+ return subsys_virtual_register(&wq_subsys, NULL);
+}
+core_initcall(wq_sysfs_init);
+
+static void wq_device_release(struct device *dev)
+{
+ struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
+
+ kfree(wq_dev);
+}
+
+/**
+ * workqueue_sysfs_register - make a workqueue visible in sysfs
+ * @wq: the workqueue to register
+ *
+ * Expose @wq in sysfs under /sys/bus/workqueue/devices.
+ * alloc_workqueue*() automatically calls this function if WQ_SYSFS is set
+ * which is the preferred method.
+ *
+ * Workqueue user should use this function directly iff it wants to apply
+ * workqueue_attrs before making the workqueue visible in sysfs; otherwise,
+ * apply_workqueue_attrs() may race against userland updating the
+ * attributes.
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int workqueue_sysfs_register(struct workqueue_struct *wq)
+{
+ struct wq_device *wq_dev;
+ int ret;
+
+ wq->wq_dev = wq_dev = kzalloc(sizeof(*wq_dev), GFP_KERNEL);
+ if (!wq_dev)
+ return -ENOMEM;
+
+ wq_dev->wq = wq;
+ wq_dev->dev.bus = &wq_subsys;
+ wq_dev->dev.init_name = wq->name;
+ wq_dev->dev.release = wq_device_release;
+
+ /*
+ * unbound_attrs are created separately. Suppress uevent until
+ * everything is ready.
+ */
+ dev_set_uevent_suppress(&wq_dev->dev, true);
+
+ ret = device_register(&wq_dev->dev);
+ if (ret) {
+ kfree(wq_dev);
+ wq->wq_dev = NULL;
+ return ret;
+ }
+
+ if (wq->flags & WQ_UNBOUND) {
+ struct device_attribute *attr;
+
+ for (attr = wq_sysfs_unbound_attrs; attr->attr.name; attr++) {
+ ret = device_create_file(&wq_dev->dev, attr);
+ if (ret) {
+ device_unregister(&wq_dev->dev);
+ wq->wq_dev = NULL;
+ return ret;
+ }
+ }
+ }
+
+ kobject_uevent(&wq_dev->dev.kobj, KOBJ_ADD);
+ return 0;
+}
+
+/**
+ * workqueue_sysfs_unregister - undo workqueue_sysfs_register()
+ * @wq: the workqueue to unregister
+ *
+ * If @wq is registered to sysfs by workqueue_sysfs_register(), unregister.
+ */
+static void workqueue_sysfs_unregister(struct workqueue_struct *wq)
+{
+ struct wq_device *wq_dev = wq->wq_dev;
+
+ if (!wq->wq_dev)
+ return;
+
+ wq->wq_dev = NULL;
+ device_unregister(&wq_dev->dev);
+}
+#else /* CONFIG_SYSFS */
+static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
+#endif /* CONFIG_SYSFS */
+
/**
* free_workqueue_attrs - free a workqueue_attrs
* @attrs: workqueue_attrs to free
@@ -3622,6 +3897,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
wake_up_process(rescuer->task);
}

+ if ((wq->flags & WQ_SYSFS) && workqueue_sysfs_register(wq))
+ goto err_destroy;
+
/*
* workqueue_lock protects global freeze state and workqueues
* list. Grab it, set max_active accordingly and add the new
@@ -3690,6 +3968,8 @@ void destroy_workqueue(struct workqueue_struct *wq)

spin_unlock_irq(&workqueue_lock);

+ workqueue_sysfs_unregister(wq);
+
if (wq->rescuer) {
kthread_stop(wq->rescuer->task);
kfree(wq->rescuer);
--
1.8.1.2

2013-03-02 03:25:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 30/31] driver/base: implement subsys_virtual_register()

Kay tells me the most appropriate place to expose workqueues to
userland would be /sys/devices/virtual/workqueues/WQ_NAME which is
symlinked to /sys/bus/workqueue/devices/WQ_NAME and that we're lacking
a way to do that outside of driver core as virtual_device_parent()
isn't exported and there's no inteface to conveniently create a
virtual subsystem.

This patch implements subsys_virtual_register() by factoring out
subsys_register() from subsys_system_register() and using it with
virtual_device_parent() as the origin directory. It's identical to
subsys_system_register() other than the origin directory but we aren't
gonna restrict the device names which should be used under it.

This will be used to expose workqueue attributes to userland.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
---
Kay, does this look okay? If so, how should this be routed?

Thanks.

drivers/base/base.h | 2 ++
drivers/base/bus.c | 73 +++++++++++++++++++++++++++++++++++---------------
drivers/base/core.c | 2 +-
include/linux/device.h | 2 ++
4 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 6ee17bb..b8bdfe6 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -101,6 +101,8 @@ static inline int hypervisor_init(void) { return 0; }
extern int platform_bus_init(void);
extern void cpu_dev_init(void);

+struct kobject *virtual_device_parent(struct device *dev);
+
extern int bus_add_device(struct device *dev);
extern void bus_probe_device(struct device *dev);
extern void bus_remove_device(struct device *dev);
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 24eb078..d229858 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -1205,26 +1205,10 @@ static void system_root_device_release(struct device *dev)
{
kfree(dev);
}
-/**
- * subsys_system_register - register a subsystem at /sys/devices/system/
- * @subsys: system subsystem
- * @groups: default attributes for the root device
- *
- * All 'system' subsystems have a /sys/devices/system/<name> root device
- * with the name of the subsystem. The root device can carry subsystem-
- * wide attributes. All registered devices are below this single root
- * device and are named after the subsystem with a simple enumeration
- * number appended. The registered devices are not explicitely named;
- * only 'id' in the device needs to be set.
- *
- * Do not use this interface for anything new, it exists for compatibility
- * with bad ideas only. New subsystems should use plain subsystems; and
- * add the subsystem-wide attributes should be added to the subsystem
- * directory itself and not some create fake root-device placed in
- * /sys/devices/system/<name>.
- */
-int subsys_system_register(struct bus_type *subsys,
- const struct attribute_group **groups)
+
+static int subsys_register(struct bus_type *subsys,
+ const struct attribute_group **groups,
+ struct kobject *parent_of_root)
{
struct device *dev;
int err;
@@ -1243,7 +1227,7 @@ int subsys_system_register(struct bus_type *subsys,
if (err < 0)
goto err_name;

- dev->kobj.parent = &system_kset->kobj;
+ dev->kobj.parent = parent_of_root;
dev->groups = groups;
dev->release = system_root_device_release;

@@ -1263,8 +1247,55 @@ err_dev:
bus_unregister(subsys);
return err;
}
+
+/**
+ * subsys_system_register - register a subsystem at /sys/devices/system/
+ * @subsys: system subsystem
+ * @groups: default attributes for the root device
+ *
+ * All 'system' subsystems have a /sys/devices/system/<name> root device
+ * with the name of the subsystem. The root device can carry subsystem-
+ * wide attributes. All registered devices are below this single root
+ * device and are named after the subsystem with a simple enumeration
+ * number appended. The registered devices are not explicitely named;
+ * only 'id' in the device needs to be set.
+ *
+ * Do not use this interface for anything new, it exists for compatibility
+ * with bad ideas only. New subsystems should use plain subsystems; and
+ * add the subsystem-wide attributes should be added to the subsystem
+ * directory itself and not some create fake root-device placed in
+ * /sys/devices/system/<name>.
+ */
+int subsys_system_register(struct bus_type *subsys,
+ const struct attribute_group **groups)
+{
+ return subsys_register(subsys, groups, &system_kset->kobj);
+}
EXPORT_SYMBOL_GPL(subsys_system_register);

+/**
+ * subsys_virtual_register - register a subsystem at /sys/devices/virtual/
+ * @subsys: virtual subsystem
+ * @groups: default attributes for the root device
+ *
+ * All 'virtual' subsystems have a /sys/devices/system/<name> root device
+ * with the name of the subystem. The root device can carry subsystem-wide
+ * attributes. All registered devices are below this single root device.
+ * There's no restriction on device naming. This is for kernel software
+ * constructs which need sysfs interface.
+ */
+int subsys_virtual_register(struct bus_type *subsys,
+ const struct attribute_group **groups)
+{
+ struct kobject *virtual_dir;
+
+ virtual_dir = virtual_device_parent(NULL);
+ if (!virtual_dir)
+ return -ENOMEM;
+
+ return subsys_register(subsys, groups, virtual_dir);
+}
+
int __init buses_init(void)
{
bus_kset = kset_create_and_add("bus", &bus_uevent_ops, NULL);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index a235085..90868e2 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -690,7 +690,7 @@ void device_initialize(struct device *dev)
set_dev_node(dev, -1);
}

-static struct kobject *virtual_device_parent(struct device *dev)
+struct kobject *virtual_device_parent(struct device *dev)
{
static struct kobject *virtual_dir = NULL;

diff --git a/include/linux/device.h b/include/linux/device.h
index 43dcda9..9765b38 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -301,6 +301,8 @@ void subsys_interface_unregister(struct subsys_interface *sif);

int subsys_system_register(struct bus_type *subsys,
const struct attribute_group **groups);
+int subsys_virtual_register(struct bus_type *subsys,
+ const struct attribute_group **groups);

/**
* struct class - device classes
--
1.8.1.2

2013-03-02 03:25:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/31] workqueue: separate out init_worker_pool() from init_workqueues()

This will be used to implement unbound pools with custom attributes.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 68b3443..f97539b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3121,6 +3121,26 @@ int keventd_up(void)
return system_wq != NULL;
}

+static void init_worker_pool(struct worker_pool *pool)
+{
+ spin_lock_init(&pool->lock);
+ pool->flags |= POOL_DISASSOCIATED;
+ INIT_LIST_HEAD(&pool->worklist);
+ INIT_LIST_HEAD(&pool->idle_list);
+ hash_init(pool->busy_hash);
+
+ init_timer_deferrable(&pool->idle_timer);
+ pool->idle_timer.function = idle_worker_timeout;
+ pool->idle_timer.data = (unsigned long)pool;
+
+ setup_timer(&pool->mayday_timer, pool_mayday_timeout,
+ (unsigned long)pool);
+
+ mutex_init(&pool->manager_mutex);
+ mutex_init(&pool->assoc_mutex);
+ ida_init(&pool->worker_ida);
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3789,23 +3809,8 @@ static int __init init_workqueues(void)
struct worker_pool *pool;

for_each_std_worker_pool(pool, cpu) {
- spin_lock_init(&pool->lock);
+ init_worker_pool(pool);
pool->cpu = cpu;
- pool->flags |= POOL_DISASSOCIATED;
- INIT_LIST_HEAD(&pool->worklist);
- INIT_LIST_HEAD(&pool->idle_list);
- hash_init(pool->busy_hash);
-
- init_timer_deferrable(&pool->idle_timer);
- pool->idle_timer.function = idle_worker_timeout;
- pool->idle_timer.data = (unsigned long)pool;
-
- setup_timer(&pool->mayday_timer, pool_mayday_timeout,
- (unsigned long)pool);
-
- mutex_init(&pool->manager_mutex);
- mutex_init(&pool->assoc_mutex);
- ida_init(&pool->worker_ida);

/* alloc pool ID */
BUG_ON(worker_pool_assign_id(pool));
--
1.8.1.2

2013-03-02 03:26:24

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 29/31] cpumask: implement cpumask_parse()

We have cpulist_parse() but not cpumask_parse(). Implement it using
bitmap_parse().

bitmap_parse() is weird in that it takes @len for a string in
kernel-memory which also is inconsistent with bitmap_parselist().
Make cpumask_parse() calculate the length and don't expose the
inconsistency to cpumask users. Maybe we can fix up bitmap_parse()
later.

This will be used to expose workqueue cpumask knobs to userland via
sysfs.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Rusty Russell <[email protected]>
---
Rusty, if this looks okay to you, would it be okay for me to route it
together with the rest of workqueue changes?

Thanks.

include/linux/cpumask.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 0325602..d08e4d2 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -591,6 +591,21 @@ static inline int cpulist_scnprintf(char *buf, int len,
}

/**
+ * cpumask_parse - extract a cpumask from from a string
+ * @buf: the buffer to extract from
+ * @dstp: the cpumask to set.
+ *
+ * Returns -errno, or 0 for success.
+ */
+static inline int cpumask_parse(const char *buf, struct cpumask *dstp)
+{
+ char *nl = strchr(buf, '\n');
+ int len = nl ? nl - buf : strlen(buf);
+
+ return bitmap_parse(buf, len, cpumask_bits(dstp), nr_cpumask_bits);
+}
+
+/**
* cpulist_parse - extract a cpumask from a user string of ranges
* @buf: the buffer to extract from
* @dstp: the cpumask to set.
--
1.8.1.2

2013-03-02 03:26:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 28/31] workqueue: reject increasing max_active for ordered workqueues

Workqueue will soon allow exposing control knobs to userland via
sysfs. Increasing max_active for an ordered workqueue breaks
correctness. Tag ordered workqueues with __WQ_ORDERED and always
limit max_active at 1.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 3 ++-
kernel/workqueue.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index fc7f882..e1e5748 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -294,6 +294,7 @@ enum {
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */

__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
+ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */

WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
@@ -396,7 +397,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
* Pointer to the allocated workqueue on success, %NULL on failure.
*/
#define alloc_ordered_workqueue(fmt, flags, args...) \
- alloc_workqueue(fmt, WQ_UNBOUND | (flags), 1, ##args)
+ alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)

#define create_workqueue(name) \
alloc_workqueue((name), WQ_MEM_RECLAIM, 1)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2016c9e..8d487f6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3537,7 +3537,16 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
static int wq_clamp_max_active(int max_active, unsigned int flags,
const char *name)
{
- int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;
+ int lim;
+
+ if (flags & WQ_UNBOUND) {
+ if (flags & __WQ_ORDERED)
+ lim = 1;
+ else
+ lim = WQ_UNBOUND_MAX_ACTIVE;
+ } else {
+ lim = WQ_MAX_ACTIVE;
+ }

if (max_active < 1 || max_active > lim)
pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
--
1.8.1.2

2013-03-02 03:25:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/31] workqueue: update synchronization rules on workqueue->pwqs

Make workqueue->pwqs protected by workqueue_lock for writes and
sched-RCU protected for reads. Lockdep assertions are added to
for_each_pwq() and first_pwq() and all their users are converted to
either hold workqueue_lock or disable preemption/irq.

alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for
consistency which isn't strictly necessary as the workqueue isn't
visible. destroy_workqueue() isn't updated to sched-RCU release pwqs.
This is okay as the workqueue should have on users left by that point.

The locking is superflous at this point. This is to help
implementation of unbound pools/pwqs with custom attributes.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 85 +++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 68 insertions(+), 17 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02f51b8..ff51c59 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -42,6 +42,7 @@
#include <linux/lockdep.h>
#include <linux/idr.h>
#include <linux/hashtable.h>
+#include <linux/rculist.h>

#include "workqueue_internal.h"

@@ -118,6 +119,8 @@ enum {
* F: wq->flush_mutex protected.
*
* W: workqueue_lock protected.
+ *
+ * R: workqueue_lock protected for writes. Sched-RCU protected for reads.
*/

/* struct worker is defined in workqueue_internal.h */
@@ -169,7 +172,7 @@ struct pool_workqueue {
int nr_active; /* L: nr of active works */
int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
- struct list_head pwqs_node; /* I: node on wq->pwqs */
+ struct list_head pwqs_node; /* R: node on wq->pwqs */
struct list_head mayday_node; /* W: node on wq->maydays */
} __aligned(1 << WORK_STRUCT_FLAG_BITS);

@@ -189,7 +192,7 @@ struct wq_flusher {
struct workqueue_struct {
unsigned int flags; /* W: WQ_* flags */
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
- struct list_head pwqs; /* I: all pwqs of this wq */
+ struct list_head pwqs; /* R: all pwqs of this wq */
struct list_head list; /* W: list of all workqueues */

struct mutex flush_mutex; /* protects wq flushing */
@@ -227,6 +230,11 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
#define CREATE_TRACE_POINTS
#include <trace/events/workqueue.h>

+#define assert_rcu_or_wq_lock() \
+ rcu_lockdep_assert(rcu_read_lock_sched_held() || \
+ lockdep_is_held(&workqueue_lock), \
+ "sched RCU or workqueue lock should be held")
+
#define for_each_std_worker_pool(pool, cpu) \
for ((pool) = &std_worker_pools(cpu)[0]; \
(pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++)
@@ -282,9 +290,16 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
* @pwq: iteration cursor
* @wq: the target workqueue
+ *
+ * This must be called either with workqueue_lock held or sched RCU read
+ * locked. If the pwq needs to be used beyond the locking in effect, the
+ * caller is responsible for guaranteeing that the pwq stays online.
+ *
+ * The if clause exists only for the lockdep assertion and can be ignored.
*/
#define for_each_pwq(pwq, wq) \
- list_for_each_entry((pwq), &(wq)->pwqs, pwqs_node)
+ list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
+ if (({ assert_rcu_or_wq_lock(); true; }))

#ifdef CONFIG_DEBUG_OBJECTS_WORK

@@ -463,9 +478,19 @@ static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
return &pools[highpri];
}

+/**
+ * first_pwq - return the first pool_workqueue of the specified workqueue
+ * @wq: the target workqueue
+ *
+ * This must be called either with workqueue_lock held or sched RCU read
+ * locked. If the pwq needs to be used beyond the locking in effect, the
+ * caller is responsible for guaranteeing that the pwq stays online.
+ */
static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
{
- return list_first_entry(&wq->pwqs, struct pool_workqueue, pwqs_node);
+ assert_rcu_or_wq_lock();
+ return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
+ pwqs_node);
}

static unsigned int work_color_to_flags(int color)
@@ -2488,10 +2513,12 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
atomic_set(&wq->nr_pwqs_to_flush, 1);
}

+ local_irq_disable();
+
for_each_pwq(pwq, wq) {
struct worker_pool *pool = pwq->pool;

- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);

if (flush_color >= 0) {
WARN_ON_ONCE(pwq->flush_color != -1);
@@ -2508,9 +2535,11 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
pwq->work_color = work_color;
}

- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
}

+ local_irq_enable();
+
if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_pwqs_to_flush))
complete(&wq->first_flusher->done);

@@ -2701,12 +2730,14 @@ void drain_workqueue(struct workqueue_struct *wq)
reflush:
flush_workqueue(wq);

+ local_irq_disable();
+
for_each_pwq(pwq, wq) {
bool drained;

- spin_lock_irq(&pwq->pool->lock);
+ spin_lock(&pwq->pool->lock);
drained = !pwq->nr_active && list_empty(&pwq->delayed_works);
- spin_unlock_irq(&pwq->pool->lock);
+ spin_unlock(&pwq->pool->lock);

if (drained)
continue;
@@ -2715,13 +2746,17 @@ reflush:
(flush_cnt % 100 == 0 && flush_cnt <= 1000))
pr_warn("workqueue %s: flush on destruction isn't complete after %u tries\n",
wq->name, flush_cnt);
+
+ local_irq_enable();
goto reflush;
}

- spin_lock_irq(&workqueue_lock);
+ spin_lock(&workqueue_lock);
if (!--wq->nr_drainers)
wq->flags &= ~WQ_DRAINING;
- spin_unlock_irq(&workqueue_lock);
+ spin_unlock(&workqueue_lock);
+
+ local_irq_enable();
}
EXPORT_SYMBOL_GPL(drain_workqueue);

@@ -3087,7 +3122,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
per_cpu_ptr(wq->cpu_pwqs, cpu);

pwq->pool = get_std_worker_pool(cpu, highpri);
- list_add_tail(&pwq->pwqs_node, &wq->pwqs);
+ list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}
} else {
struct pool_workqueue *pwq;
@@ -3097,7 +3132,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
return -ENOMEM;

pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
- list_add_tail(&pwq->pwqs_node, &wq->pwqs);
+ list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}

return 0;
@@ -3174,6 +3209,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
if (alloc_and_link_pwqs(wq) < 0)
goto err;

+ local_irq_disable();
for_each_pwq(pwq, wq) {
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
pwq->wq = wq;
@@ -3182,6 +3218,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
}
+ local_irq_enable();

if (flags & WQ_RESCUER) {
struct worker *rescuer;
@@ -3239,24 +3276,32 @@ void destroy_workqueue(struct workqueue_struct *wq)
/* drain it before proceeding with destruction */
drain_workqueue(wq);

+ spin_lock_irq(&workqueue_lock);
+
/* sanity checks */
for_each_pwq(pwq, wq) {
int i;

- for (i = 0; i < WORK_NR_COLORS; i++)
- if (WARN_ON(pwq->nr_in_flight[i]))
+ for (i = 0; i < WORK_NR_COLORS; i++) {
+ if (WARN_ON(pwq->nr_in_flight[i])) {
+ spin_unlock_irq(&workqueue_lock);
return;
+ }
+ }
+
if (WARN_ON(pwq->nr_active) ||
- WARN_ON(!list_empty(&pwq->delayed_works)))
+ WARN_ON(!list_empty(&pwq->delayed_works))) {
+ spin_unlock_irq(&workqueue_lock);
return;
+ }
}

/*
* wq list is used to freeze wq, remove from list after
* flushing is complete in case freeze races us.
*/
- spin_lock_irq(&workqueue_lock);
list_del(&wq->list);
+
spin_unlock_irq(&workqueue_lock);

if (wq->flags & WQ_RESCUER) {
@@ -3340,13 +3385,19 @@ EXPORT_SYMBOL_GPL(workqueue_set_max_active);
bool workqueue_congested(int cpu, struct workqueue_struct *wq)
{
struct pool_workqueue *pwq;
+ bool ret;
+
+ preempt_disable();

if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
else
pwq = first_pwq(wq);

- return !list_empty(&pwq->delayed_works);
+ ret = !list_empty(&pwq->delayed_works);
+ preempt_enable();
+
+ return ret;
}
EXPORT_SYMBOL_GPL(workqueue_congested);

--
1.8.1.2

2013-03-02 03:25:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/31] workqueue: replace get_pwq() with explicit per_cpu_ptr() accesses and first_pwq()

get_pwq() takes @cpu, which can also be WORK_CPU_UNBOUND, and @wq and
returns the matching pwq (pool_workqueue). We want to move away from
using @cpu for identifying pools and pwqs for unbound pools with
custom attributes and there is only one user - workqueue_congested() -
which makes use of the WQ_UNBOUND conditional in get_pwq(). All other
users already know whether they're dealing with a per-cpu or unbound
workqueue.

Replace get_pwq() with explicit per_cpu_ptr(wq->cpu_pwqs, cpu) for
per-cpu workqueues and first_pwq() for unbound ones, and open-code
WQ_UNBOUND conditional in workqueue_congested().

Note that this makes workqueue_congested() behave sligntly differently
when @cpu other than WORK_CPU_UNBOUND is specified. It ignores @cpu
for unbound workqueues and always uses the first pwq instead of
oopsing.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 79840b9..02f51b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -463,16 +463,9 @@ static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
return &pools[highpri];
}

-static struct pool_workqueue *get_pwq(int cpu, struct workqueue_struct *wq)
+static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
{
- if (!(wq->flags & WQ_UNBOUND)) {
- if (likely(cpu < nr_cpu_ids))
- return per_cpu_ptr(wq->cpu_pwqs, cpu);
- } else if (likely(cpu == WORK_CPU_UNBOUND)) {
- return list_first_entry(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
- }
- return NULL;
+ return list_first_entry(&wq->pwqs, struct pool_workqueue, pwqs_node);
}

static unsigned int work_color_to_flags(int color)
@@ -1192,7 +1185,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
* work needs to be queued on that cpu to guarantee
* non-reentrancy.
*/
- pwq = get_pwq(cpu, wq);
+ pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
last_pool = get_work_pool(work);

if (last_pool && last_pool != pwq->pool) {
@@ -1203,7 +1196,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
worker = find_worker_executing_work(last_pool, work);

if (worker && worker->current_pwq->wq == wq) {
- pwq = get_pwq(last_pool->cpu, wq);
+ pwq = per_cpu_ptr(wq->cpu_pwqs, last_pool->cpu);
} else {
/* meh... not running there, queue here */
spin_unlock(&last_pool->lock);
@@ -1213,7 +1206,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
spin_lock(&pwq->pool->lock);
}
} else {
- pwq = get_pwq(WORK_CPU_UNBOUND, wq);
+ pwq = first_pwq(wq);
spin_lock(&pwq->pool->lock);
}

@@ -1652,7 +1645,7 @@ static void rebind_workers(struct worker_pool *pool)
else
wq = system_wq;

- insert_work(get_pwq(pool->cpu, wq), rebind_work,
+ insert_work(per_cpu_ptr(wq->cpu_pwqs, pool->cpu), rebind_work,
worker->scheduled.next,
work_color_to_flags(WORK_NO_COLOR));
}
@@ -3090,7 +3083,8 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
return -ENOMEM;

for_each_possible_cpu(cpu) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ struct pool_workqueue *pwq =
+ per_cpu_ptr(wq->cpu_pwqs, cpu);

pwq->pool = get_std_worker_pool(cpu, highpri);
list_add_tail(&pwq->pwqs_node, &wq->pwqs);
@@ -3345,7 +3339,12 @@ EXPORT_SYMBOL_GPL(workqueue_set_max_active);
*/
bool workqueue_congested(int cpu, struct workqueue_struct *wq)
{
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ struct pool_workqueue *pwq;
+
+ if (!(wq->flags & WQ_UNBOUND))
+ pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
+ else
+ pwq = first_pwq(wq);

return !list_empty(&pwq->delayed_works);
}
--
1.8.1.2

2013-03-02 03:27:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 26/31] workqueue: implement apply_workqueue_attrs()

Implement apply_workqueue_attrs() which applies workqueue_attrs to the
specified unbound workqueue by creating a new pwq (pool_workqueue)
linked to worker_pool with the specified attributes.

A new pwq is linked at the head of wq->pwqs instead of tail and
__queue_work() verifies that the first unbound pwq has positive refcnt
before choosing it for the actual queueing. This is to cover the case
where creation of a new pwq races with queueing. As base ref on a pwq
won't be dropped without making another pwq the first one,
__queue_work() is guaranteed to make progress and not add work item to
a dead pwq.

init_and_link_pwq() is updated to return the last first pwq the new
pwq replaced, which is put by apply_workqueue_attrs().

Note that apply_workqueue_attrs() is almost identical to unbound pwq
part of alloc_and_link_pwqs(). The only difference is that there is
no previous first pwq. apply_workqueue_attrs() is implemented to
handle such cases and replaces unbound pwq handling in
alloc_and_link_pwqs().

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 91 ++++++++++++++++++++++++++++++++++++-----------
2 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0341403..c8c3bf4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -409,6 +409,8 @@ extern void destroy_workqueue(struct workqueue_struct *wq);

struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask);
void free_workqueue_attrs(struct workqueue_attrs *attrs);
+int apply_workqueue_attrs(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs);

extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4c67967..36fcf9c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1225,7 +1225,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
if (unlikely(wq->flags & WQ_DRAINING) &&
WARN_ON_ONCE(!is_chained_work(wq)))
return;
-
+retry:
/* pwq which will be used unless @work is executing elsewhere */
if (!(wq->flags & WQ_UNBOUND)) {
if (cpu == WORK_CPU_UNBOUND)
@@ -1259,6 +1259,25 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
spin_lock(&pwq->pool->lock);
}

+ /*
+ * pwq is determined and locked. For unbound pools, we could have
+ * raced with pwq release and it could already be dead. If its
+ * refcnt is zero, repeat pwq selection. Note that pwqs never die
+ * without another pwq replacing it as the first pwq or while a
+ * work item is executing on it, so the retying is guaranteed to
+ * make forward-progress.
+ */
+ if (unlikely(!pwq->refcnt)) {
+ if (wq->flags & WQ_UNBOUND) {
+ spin_unlock(&pwq->pool->lock);
+ cpu_relax();
+ goto retry;
+ }
+ /* oops */
+ WARN_ONCE(true, "workqueue: per-cpu pwq for %s on cpu%d has 0 refcnt",
+ wq->name, cpu);
+ }
+
/* pwq determined, queue */
trace_workqueue_queue_work(req_cpu, pwq, work);

@@ -3418,7 +3437,8 @@ static void pwq_unbound_release_workfn(struct work_struct *work)

static void init_and_link_pwq(struct pool_workqueue *pwq,
struct workqueue_struct *wq,
- struct worker_pool *pool)
+ struct worker_pool *pool,
+ struct pool_workqueue **p_last_pwq)
{
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);

@@ -3438,13 +3458,58 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
mutex_lock(&wq->flush_mutex);
spin_lock_irq(&workqueue_lock);

+ if (p_last_pwq)
+ *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;
- list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
+ list_add_rcu(&pwq->pwqs_node, &wq->pwqs);

spin_unlock_irq(&workqueue_lock);
mutex_unlock(&wq->flush_mutex);
}

+/**
+ * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
+ * @wq: the target workqueue
+ * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
+ *
+ * Apply @attrs to an unbound workqueue @wq. If @attrs doesn't match the
+ * current attributes, a new pwq is created and made the first pwq which
+ * will serve all new work items. Older pwqs are released as in-flight
+ * work items finish. Note that a work item which repeatedly requeues
+ * itself back-to-back will stay on its current pwq.
+ *
+ * Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
+ * failure.
+ */
+int apply_workqueue_attrs(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs)
+{
+ struct pool_workqueue *pwq, *last_pwq;
+ struct worker_pool *pool;
+
+ if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
+ return -EINVAL;
+
+ pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ if (!pwq)
+ return -ENOMEM;
+
+ pool = get_unbound_pool(attrs);
+ if (!pool) {
+ kmem_cache_free(pwq_cache, pwq);
+ return -ENOMEM;
+ }
+
+ init_and_link_pwq(pwq, wq, pool, &last_pwq);
+ if (last_pwq) {
+ spin_lock_irq(&last_pwq->pool->lock);
+ put_pwq(last_pwq);
+ spin_unlock_irq(&last_pwq->pool->lock);
+ }
+
+ return 0;
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3461,26 +3526,12 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
struct worker_pool *cpu_pools =
per_cpu(cpu_worker_pools, cpu);

- init_and_link_pwq(pwq, wq, &cpu_pools[highpri]);
+ init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
}
+ return 0;
} else {
- struct pool_workqueue *pwq;
- struct worker_pool *pool;
-
- pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
- if (!pwq)
- return -ENOMEM;
-
- pool = get_unbound_pool(unbound_std_wq_attrs[highpri]);
- if (!pool) {
- kmem_cache_free(pwq_cache, pwq);
- return -ENOMEM;
- }
-
- init_and_link_pwq(pwq, wq, pool);
+ return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
}
-
- return 0;
}

static int wq_clamp_max_active(int max_active, unsigned int flags,
--
1.8.1.2

2013-03-02 03:27:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 25/31] workqueue: perform non-reentrancy test when queueing to unbound workqueues too

Because per-cpu workqueues have multiple pwqs (pool_workqueues) to
serve the CPUs, to guarantee that a single work item isn't queued on
one pwq while still executing another, __queue_work() takes a look at
the previous pool the target work item was on and if it's still
executing there, queue the work item on that pool.

To support changing workqueue_attrs on the fly, unbound workqueues too
will have multiple pwqs and thus need non-reentrancy test when
queueing. This patch modifies __queue_work() such that the reentrancy
test is performed regardless of the workqueue type.

per_cpu_ptr(wq->cpu_pwqs, cpu) used to be used to determine the
matching pwq for the last pool. This can't be used for unbound
workqueues and is replaced with worker->current_pwq which also happens
to be simpler.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 42 +++++++++++++++++++-----------------------
1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0f0da59..4c67967 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1206,6 +1206,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
struct pool_workqueue *pwq;
+ struct worker_pool *last_pool;
struct list_head *worklist;
unsigned int work_flags;
unsigned int req_cpu = cpu;
@@ -1225,41 +1226,36 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
WARN_ON_ONCE(!is_chained_work(wq)))
return;

- /* determine the pwq to use */
+ /* pwq which will be used unless @work is executing elsewhere */
if (!(wq->flags & WQ_UNBOUND)) {
- struct worker_pool *last_pool;
-
if (cpu == WORK_CPU_UNBOUND)
cpu = raw_smp_processor_id();
-
- /*
- * It's multi cpu. If @work was previously on a different
- * cpu, it might still be running there, in which case the
- * work needs to be queued on that cpu to guarantee
- * non-reentrancy.
- */
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
- last_pool = get_work_pool(work);
+ } else {
+ pwq = first_pwq(wq);
+ }

- if (last_pool && last_pool != pwq->pool) {
- struct worker *worker;
+ /*
+ * If @work was previously on a different pool, it might still be
+ * running there, in which case the work needs to be queued on that
+ * pool to guarantee non-reentrancy.
+ */
+ last_pool = get_work_pool(work);
+ if (last_pool && last_pool != pwq->pool) {
+ struct worker *worker;

- spin_lock(&last_pool->lock);
+ spin_lock(&last_pool->lock);

- worker = find_worker_executing_work(last_pool, work);
+ worker = find_worker_executing_work(last_pool, work);

- if (worker && worker->current_pwq->wq == wq) {
- pwq = per_cpu_ptr(wq->cpu_pwqs, last_pool->cpu);
- } else {
- /* meh... not running there, queue here */
- spin_unlock(&last_pool->lock);
- spin_lock(&pwq->pool->lock);
- }
+ if (worker && worker->current_pwq->wq == wq) {
+ pwq = worker->current_pwq;
} else {
+ /* meh... not running there, queue here */
+ spin_unlock(&last_pool->lock);
spin_lock(&pwq->pool->lock);
}
} else {
- pwq = first_pwq(wq);
spin_lock(&pwq->pool->lock);
}

--
1.8.1.2

2013-03-02 03:25:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/31] workqueue: remove workqueue_struct->pool_wq.single

workqueue->pool_wq union is used to point either to percpu pwqs
(pool_workqueues) or single unbound pwq. As the first pwq can be
accessed via workqueue->pwqs list, there's no reason for the single
pointer anymore.

Use list_first_entry(workqueue->pwqs) to access the unbound pwq and
drop workqueue->pool_wq.single pointer and the pool_wq union. It
simplifies the code and eases implementing multiple unbound pools w/
custom attributes.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index cbdc2ac..79840b9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -188,11 +188,7 @@ struct wq_flusher {
*/
struct workqueue_struct {
unsigned int flags; /* W: WQ_* flags */
- union {
- struct pool_workqueue __percpu *pcpu;
- struct pool_workqueue *single;
- unsigned long v;
- } pool_wq; /* I: pwq's */
+ struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
struct list_head pwqs; /* I: all pwqs of this wq */
struct list_head list; /* W: list of all workqueues */

@@ -471,9 +467,11 @@ static struct pool_workqueue *get_pwq(int cpu, struct workqueue_struct *wq)
{
if (!(wq->flags & WQ_UNBOUND)) {
if (likely(cpu < nr_cpu_ids))
- return per_cpu_ptr(wq->pool_wq.pcpu, cpu);
- } else if (likely(cpu == WORK_CPU_UNBOUND))
- return wq->pool_wq.single;
+ return per_cpu_ptr(wq->cpu_pwqs, cpu);
+ } else if (likely(cpu == WORK_CPU_UNBOUND)) {
+ return list_first_entry(&wq->pwqs, struct pool_workqueue,
+ pwqs_node);
+ }
return NULL;
}

@@ -3087,8 +3085,8 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
int cpu;

if (!(wq->flags & WQ_UNBOUND)) {
- wq->pool_wq.pcpu = alloc_percpu(struct pool_workqueue);
- if (!wq->pool_wq.pcpu)
+ wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
+ if (!wq->cpu_pwqs)
return -ENOMEM;

for_each_possible_cpu(cpu) {
@@ -3104,7 +3102,6 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
if (!pwq)
return -ENOMEM;

- wq->pool_wq.single = pwq;
pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
list_add_tail(&pwq->pwqs_node, &wq->pwqs);
}
@@ -3115,9 +3112,10 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
static void free_pwqs(struct workqueue_struct *wq)
{
if (!(wq->flags & WQ_UNBOUND))
- free_percpu(wq->pool_wq.pcpu);
- else
- kmem_cache_free(pwq_cache, wq->pool_wq.single);
+ free_percpu(wq->cpu_pwqs);
+ else if (!list_empty(&wq->pwqs))
+ kmem_cache_free(pwq_cache, list_first_entry(&wq->pwqs,
+ struct pool_workqueue, pwqs_node));
}

static int wq_clamp_max_active(int max_active, unsigned int flags,
--
1.8.1.2

2013-03-02 03:27:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24/31] workqueue: prepare flush_workqueue() for dynamic creation and destrucion of unbound pool_workqueues

Unbound pwqs (pool_workqueues) will be dynamically created and
destroyed with the scheduled unbound workqueue w/ custom attributes
support. This patch synchronizes pwq linking and unlinking against
flush_workqueue() so that its operation isn't disturbed by pwqs coming
and going.

Linking and unlinking a pwq into wq->pwqs is now protected also by
wq->flush_mutex and a new pwq's work_color is initialized to
wq->work_color during linking. This ensures that pwqs changes don't
disturb flush_workqueue() in progress and the new pwq's work coloring
stays in sync with the rest of the workqueue.

flush_mutex during unlinking isn't strictly necessary but it's simpler
to do it anyway.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e092cd5..0f0da59 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -122,6 +122,9 @@ enum {
* W: workqueue_lock protected.
*
* R: workqueue_lock protected for writes. Sched-RCU protected for reads.
+ *
+ * FR: wq->flush_mutex and workqueue_lock protected for writes. Sched-RCU
+ * protected for reads.
*/

/* struct worker is defined in workqueue_internal.h */
@@ -185,7 +188,7 @@ struct pool_workqueue {
int nr_active; /* L: nr of active works */
int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
- struct list_head pwqs_node; /* R: node on wq->pwqs */
+ struct list_head pwqs_node; /* FR: node on wq->pwqs */
struct list_head mayday_node; /* W: node on wq->maydays */

/*
@@ -214,7 +217,7 @@ struct wq_flusher {
struct workqueue_struct {
unsigned int flags; /* W: WQ_* flags */
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
- struct list_head pwqs; /* R: all pwqs of this wq */
+ struct list_head pwqs; /* FR: all pwqs of this wq */
struct list_head list; /* W: list of all workqueues */

struct mutex flush_mutex; /* protects wq flushing */
@@ -3395,9 +3398,16 @@ static void pwq_unbound_release_workfn(struct work_struct *work)
if (WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND)))
return;

+ /*
+ * Unlink @pwq. Synchronization against flush_mutex isn't strictly
+ * necessary on release but do it anyway. It's easier to verify
+ * and consistent with the linking path.
+ */
+ mutex_lock(&wq->flush_mutex);
spin_lock_irq(&workqueue_lock);
list_del_rcu(&pwq->pwqs_node);
spin_unlock_irq(&workqueue_lock);
+ mutex_unlock(&wq->flush_mutex);

put_unbound_pool(pool);
call_rcu_sched(&pwq->rcu, rcu_free_pwq);
@@ -3425,7 +3435,18 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);

+ /*
+ * Link @pwq and set the matching work_color. This is synchronized
+ * with flush_mutex to avoid confusing flush_workqueue().
+ */
+ mutex_lock(&wq->flush_mutex);
+ spin_lock_irq(&workqueue_lock);
+
+ pwq->work_color = wq->work_color;
list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
+
+ spin_unlock_irq(&workqueue_lock);
+ mutex_unlock(&wq->flush_mutex);
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
--
1.8.1.2

2013-03-02 03:28:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 23/31] workqueue: implement get/put_pwq()

Add pool_workqueue->refcnt along with get/put_pwq(). Both per-cpu and
unbound pwqs have refcnts and any work item inserted on a pwq
increments the refcnt which is dropped when the work item finishes.

For per-cpu pwqs the base ref is never dropped and destroy_workqueue()
frees the pwqs as before. For unbound ones, destroy_workqueue()
simply drops the base ref on the first pwq. When the refcnt reaches
zero, pwq_unbound_release_workfn() is scheduled on system_wq, which
unlinks the pwq, puts the associated pool and frees the pwq and wq as
necessary. This needs to be done from a work item as put_pwq() needs
to be protected by pool->lock but release can't happen with the lock
held - e.g. put_unbound_pool() involves blocking operations.

Unbound pool->locks are marked with lockdep subclas 1 as put_pwq()
will schedule the release work item on system_wq while holding the
unbound pool's lock and triggers recursive locking warning spuriously.

This will be used to implement dynamic creation and destruction of
unbound pwqs.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 137 ++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 114 insertions(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d0604ee..e092cd5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -179,6 +179,7 @@ struct pool_workqueue {
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
int flush_color; /* L: flushing color */
+ int refcnt; /* L: reference count */
int nr_in_flight[WORK_NR_COLORS];
/* L: nr of in_flight works */
int nr_active; /* L: nr of active works */
@@ -186,6 +187,15 @@ struct pool_workqueue {
struct list_head delayed_works; /* L: delayed works */
struct list_head pwqs_node; /* R: node on wq->pwqs */
struct list_head mayday_node; /* W: node on wq->maydays */
+
+ /*
+ * Release of unbound pwq is punted to system_wq. See put_pwq()
+ * and pwq_unbound_release_workfn() for details. pool_workqueue
+ * itself is also sched-RCU protected so that the first pwq can be
+ * determined without grabbing workqueue_lock.
+ */
+ struct work_struct unbound_release_work;
+ struct rcu_head rcu;
} __aligned(1 << WORK_STRUCT_FLAG_BITS);

/*
@@ -936,6 +946,45 @@ static void move_linked_works(struct work_struct *work, struct list_head *head,
*nextp = n;
}

+/**
+ * get_pwq - get an extra reference on the specified pool_workqueue
+ * @pwq: pool_workqueue to get
+ *
+ * Obtain an extra reference on @pwq. The caller should guarantee that
+ * @pwq has positive refcnt and be holding the matching pool->lock.
+ */
+static void get_pwq(struct pool_workqueue *pwq)
+{
+ lockdep_assert_held(&pwq->pool->lock);
+ WARN_ON_ONCE(pwq->refcnt <= 0);
+ pwq->refcnt++;
+}
+
+/**
+ * put_pwq - put a pool_workqueue reference
+ * @pwq: pool_workqueue to put
+ *
+ * Drop a reference of @pwq. If its refcnt reaches zero, schedule its
+ * destruction. The caller should be holding the matching pool->lock.
+ */
+static void put_pwq(struct pool_workqueue *pwq)
+{
+ lockdep_assert_held(&pwq->pool->lock);
+ if (likely(--pwq->refcnt))
+ return;
+ if (WARN_ON_ONCE(!(pwq->wq->flags & WQ_UNBOUND)))
+ return;
+ /*
+ * @pwq can't be released under pool->lock, bounce to
+ * pwq_unbound_release_workfn(). This never recurses on the same
+ * pool->lock as this path is taken only for unbound workqueues and
+ * the release work item is scheduled on a per-cpu workqueue. To
+ * avoid lockdep warning, unbound pool->locks are given lockdep
+ * subclass of 1 in get_unbound_pool().
+ */
+ schedule_work(&pwq->unbound_release_work);
+}
+
static void pwq_activate_delayed_work(struct work_struct *work)
{
struct pool_workqueue *pwq = get_work_pwq(work);
@@ -967,9 +1016,9 @@ static void pwq_activate_first_delayed(struct pool_workqueue *pwq)
*/
static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
{
- /* ignore uncolored works */
+ /* uncolored work items don't participate in flushing or nr_active */
if (color == WORK_NO_COLOR)
- return;
+ goto out_put;

pwq->nr_in_flight[color]--;

@@ -982,11 +1031,11 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)

/* is flush in progress and are we at the flushing tip? */
if (likely(pwq->flush_color != color))
- return;
+ goto out_put;

/* are there still in-flight works? */
if (pwq->nr_in_flight[color])
- return;
+ goto out_put;

/* this pwq is done, clear flush_color */
pwq->flush_color = -1;
@@ -997,6 +1046,8 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
*/
if (atomic_dec_and_test(&pwq->wq->nr_pwqs_to_flush))
complete(&pwq->wq->first_flusher->done);
+out_put:
+ put_pwq(pwq);
}

/**
@@ -1119,6 +1170,7 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
/* we own @work, set data and link */
set_work_pwq(work, pwq, extra_flags);
list_add_tail(&work->entry, head);
+ get_pwq(pwq);

/*
* Ensure either worker_sched_deactivated() sees the above
@@ -3294,6 +3346,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
if (!pool || init_worker_pool(pool) < 0)
goto fail;

+ lockdep_set_subclass(&pool->lock, 1); /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);

if (worker_pool_assign_id(pool) < 0)
@@ -3322,7 +3375,41 @@ fail:
return NULL;
}

-/* initialize @pwq which interfaces with @pool for @wq and link it in */
+static void rcu_free_pwq(struct rcu_head *rcu)
+{
+ kmem_cache_free(pwq_cache,
+ container_of(rcu, struct pool_workqueue, rcu));
+}
+
+/*
+ * Scheduled on system_wq by put_pwq() when an unbound pwq hits zero refcnt
+ * and needs to be destroyed.
+ */
+static void pwq_unbound_release_workfn(struct work_struct *work)
+{
+ struct pool_workqueue *pwq = container_of(work, struct pool_workqueue,
+ unbound_release_work);
+ struct workqueue_struct *wq = pwq->wq;
+ struct worker_pool *pool = pwq->pool;
+
+ if (WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND)))
+ return;
+
+ spin_lock_irq(&workqueue_lock);
+ list_del_rcu(&pwq->pwqs_node);
+ spin_unlock_irq(&workqueue_lock);
+
+ put_unbound_pool(pool);
+ call_rcu_sched(&pwq->rcu, rcu_free_pwq);
+
+ /*
+ * If we're the last pwq going away, @wq is already dead and no one
+ * is gonna access it anymore. Free it.
+ */
+ if (list_empty(&wq->pwqs))
+ kfree(wq);
+}
+
static void init_and_link_pwq(struct pool_workqueue *pwq,
struct workqueue_struct *wq,
struct worker_pool *pool)
@@ -3332,9 +3419,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
pwq->pool = pool;
pwq->wq = wq;
pwq->flush_color = -1;
+ pwq->refcnt = 1;
pwq->max_active = wq->saved_max_active;
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
+ INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);

list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}
@@ -3377,15 +3466,6 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
return 0;
}

-static void free_pwqs(struct workqueue_struct *wq)
-{
- if (!(wq->flags & WQ_UNBOUND))
- free_percpu(wq->cpu_pwqs);
- else if (!list_empty(&wq->pwqs))
- kmem_cache_free(pwq_cache, list_first_entry(&wq->pwqs,
- struct pool_workqueue, pwqs_node));
-}
-
static int wq_clamp_max_active(int max_active, unsigned int flags,
const char *name)
{
@@ -3517,7 +3597,8 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
}

- if (WARN_ON(pwq->nr_active) ||
+ if (WARN_ON(pwq->refcnt > 1) ||
+ WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(&pwq->delayed_works))) {
spin_unlock_irq(&workqueue_lock);
return;
@@ -3538,17 +3619,27 @@ void destroy_workqueue(struct workqueue_struct *wq)
wq->rescuer = NULL;
}

- /*
- * We're the sole accessor of @wq at this point. Directly access
- * the first pwq and put its pool.
- */
- if (wq->flags & WQ_UNBOUND) {
+ if (!(wq->flags & WQ_UNBOUND)) {
+ /*
+ * The base ref is never dropped on per-cpu pwqs. Directly
+ * free the pwqs and wq.
+ */
+ free_percpu(wq->cpu_pwqs);
+ kfree(wq);
+ } else {
+ /*
+ * We're the sole accessor of @wq at this point. Directly
+ * access the first pwq and put the base ref. As both pwqs
+ * and pools are sched-RCU protected, the lock operations
+ * are safe. @wq will be freed when the last pwq is
+ * released.
+ */
pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
pwqs_node);
- put_unbound_pool(pwq->pool);
+ spin_lock_irq(&pwq->pool->lock);
+ put_pwq(pwq);
+ spin_unlock_irq(&pwq->pool->lock);
}
- free_pwqs(wq);
- kfree(wq);
}
EXPORT_SYMBOL_GPL(destroy_workqueue);

--
1.8.1.2

2013-03-02 03:28:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 22/31] workqueue: restructure __alloc_workqueue_key()

* Move initialization and linking of pool_workqueues into
init_and_link_pwq().

* Make the failure path use destroy_workqueue() once pool_workqueue
initialization succeeds.

These changes are to prepare for dynamic management of pool_workqueues
and don't introduce any functional changes.

While at it, convert list_del(&wq->list) to list_del_init() as a
precaution as scheduled changes will make destruction more complex.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 67 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bcc02bb..d0604ee 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3322,6 +3322,23 @@ fail:
return NULL;
}

+/* initialize @pwq which interfaces with @pool for @wq and link it in */
+static void init_and_link_pwq(struct pool_workqueue *pwq,
+ struct workqueue_struct *wq,
+ struct worker_pool *pool)
+{
+ BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
+
+ pwq->pool = pool;
+ pwq->wq = wq;
+ pwq->flush_color = -1;
+ pwq->max_active = wq->saved_max_active;
+ INIT_LIST_HEAD(&pwq->delayed_works);
+ INIT_LIST_HEAD(&pwq->mayday_node);
+
+ list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3338,23 +3355,23 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
struct worker_pool *cpu_pools =
per_cpu(cpu_worker_pools, cpu);

- pwq->pool = &cpu_pools[highpri];
- list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
+ init_and_link_pwq(pwq, wq, &cpu_pools[highpri]);
}
} else {
struct pool_workqueue *pwq;
+ struct worker_pool *pool;

pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
if (!pwq)
return -ENOMEM;

- pwq->pool = get_unbound_pool(unbound_std_wq_attrs[highpri]);
- if (!pwq->pool) {
+ pool = get_unbound_pool(unbound_std_wq_attrs[highpri]);
+ if (!pool) {
kmem_cache_free(pwq_cache, pwq);
return -ENOMEM;
}

- list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
+ init_and_link_pwq(pwq, wq, pool);
}

return 0;
@@ -3399,7 +3416,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,

wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
if (!wq)
- goto err;
+ return NULL;

vsnprintf(wq->name, namelen, fmt, args1);
va_end(args);
@@ -3422,18 +3439,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
INIT_LIST_HEAD(&wq->list);

if (alloc_and_link_pwqs(wq) < 0)
- goto err;
-
- local_irq_disable();
- for_each_pwq(pwq, wq) {
- BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
- pwq->wq = wq;
- pwq->flush_color = -1;
- pwq->max_active = max_active;
- INIT_LIST_HEAD(&pwq->delayed_works);
- INIT_LIST_HEAD(&pwq->mayday_node);
- }
- local_irq_enable();
+ goto err_free_wq;

/*
* Workqueues which may be used during memory reclaim should
@@ -3442,16 +3448,19 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
if (flags & WQ_MEM_RECLAIM) {
struct worker *rescuer;

- wq->rescuer = rescuer = alloc_worker();
+ rescuer = alloc_worker();
if (!rescuer)
- goto err;
+ goto err_destroy;

rescuer->rescue_wq = wq;
rescuer->task = kthread_create(rescuer_thread, rescuer, "%s",
wq->name);
- if (IS_ERR(rescuer->task))
- goto err;
+ if (IS_ERR(rescuer->task)) {
+ kfree(rescuer);
+ goto err_destroy;
+ }

+ wq->rescuer = rescuer;
rescuer->task->flags |= PF_THREAD_BOUND;
wake_up_process(rescuer->task);
}
@@ -3472,12 +3481,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
spin_unlock_irq(&workqueue_lock);

return wq;
-err:
- if (wq) {
- free_pwqs(wq);
- kfree(wq->rescuer);
- kfree(wq);
- }
+
+err_free_wq:
+ kfree(wq);
+ return NULL;
+err_destroy:
+ destroy_workqueue(wq);
return NULL;
}
EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
@@ -3519,7 +3528,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
* wq list is used to freeze wq, remove from list after
* flushing is complete in case freeze races us.
*/
- list_del(&wq->list);
+ list_del_init(&wq->list);

spin_unlock_irq(&workqueue_lock);

--
1.8.1.2

2013-03-02 03:25:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/31] workqueue: add wokrqueue_struct->maydays list to replace mayday cpu iterators

Similar to how pool_workqueue iteration used to be, raising and
servicing mayday requests is based on CPU numbers. It's hairy because
cpumask_t may not be able to handle WORK_CPU_UNBOUND and cpumasks are
assumed to be always set on UP. This is ugly and can't handle
multiple unbound pools to be added for unbound workqueues w/ custom
attributes.

Add workqueue_struct->maydays. When a pool_workqueue needs rescuing,
it gets chained on the list through pool_workqueue->mayday_node and
rescuer_thread() consumes the list until it's empty.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 77 ++++++++++++++++++++----------------------------------
1 file changed, 28 insertions(+), 49 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f195aa..8b38d1c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -170,6 +170,7 @@ struct pool_workqueue {
int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
struct list_head pwqs_node; /* I: node on wq->pwqs */
+ struct list_head mayday_node; /* W: node on wq->maydays */
} __aligned(1 << WORK_STRUCT_FLAG_BITS);

/*
@@ -182,27 +183,6 @@ struct wq_flusher {
};

/*
- * All cpumasks are assumed to be always set on UP and thus can't be
- * used to determine whether there's something to be done.
- */
-#ifdef CONFIG_SMP
-typedef cpumask_var_t mayday_mask_t;
-#define mayday_test_and_set_cpu(cpu, mask) \
- cpumask_test_and_set_cpu((cpu), (mask))
-#define mayday_clear_cpu(cpu, mask) cpumask_clear_cpu((cpu), (mask))
-#define for_each_mayday_cpu(cpu, mask) for_each_cpu((cpu), (mask))
-#define alloc_mayday_mask(maskp, gfp) zalloc_cpumask_var((maskp), (gfp))
-#define free_mayday_mask(mask) free_cpumask_var((mask))
-#else
-typedef unsigned long mayday_mask_t;
-#define mayday_test_and_set_cpu(cpu, mask) test_and_set_bit(0, &(mask))
-#define mayday_clear_cpu(cpu, mask) clear_bit(0, &(mask))
-#define for_each_mayday_cpu(cpu, mask) if ((cpu) = 0, (mask))
-#define alloc_mayday_mask(maskp, gfp) true
-#define free_mayday_mask(mask) do { } while (0)
-#endif
-
-/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
*/
@@ -224,7 +204,7 @@ struct workqueue_struct {
struct list_head flusher_queue; /* F: flush waiters */
struct list_head flusher_overflow; /* F: flush overflow list */

- mayday_mask_t mayday_mask; /* cpus requesting rescue */
+ struct list_head maydays; /* W: pwqs requesting rescue */
struct worker *rescuer; /* I: rescue worker */

int nr_drainers; /* W: drain in progress */
@@ -1852,23 +1832,21 @@ static void idle_worker_timeout(unsigned long __pool)
spin_unlock_irq(&pool->lock);
}

-static bool send_mayday(struct work_struct *work)
+static void send_mayday(struct work_struct *work)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct workqueue_struct *wq = pwq->wq;
- unsigned int cpu;
+
+ lockdep_assert_held(&workqueue_lock);

if (!(wq->flags & WQ_RESCUER))
- return false;
+ return;

/* mayday mayday mayday */
- cpu = pwq->pool->cpu;
- /* WORK_CPU_UNBOUND can't be set in cpumask, use cpu 0 instead */
- if (cpu == WORK_CPU_UNBOUND)
- cpu = 0;
- if (!mayday_test_and_set_cpu(cpu, wq->mayday_mask))
+ if (list_empty(&pwq->mayday_node)) {
+ list_add_tail(&pwq->mayday_node, &wq->maydays);
wake_up_process(wq->rescuer->task);
- return true;
+ }
}

static void pool_mayday_timeout(unsigned long __pool)
@@ -1876,7 +1854,8 @@ static void pool_mayday_timeout(unsigned long __pool)
struct worker_pool *pool = (void *)__pool;
struct work_struct *work;

- spin_lock_irq(&pool->lock);
+ spin_lock_irq(&workqueue_lock); /* for wq->maydays */
+ spin_lock(&pool->lock);

if (need_to_create_worker(pool)) {
/*
@@ -1889,7 +1868,8 @@ static void pool_mayday_timeout(unsigned long __pool)
send_mayday(work);
}

- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
+ spin_unlock_irq(&workqueue_lock);

mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
}
@@ -2338,8 +2318,6 @@ static int rescuer_thread(void *__rescuer)
struct worker *rescuer = __rescuer;
struct workqueue_struct *wq = rescuer->rescue_wq;
struct list_head *scheduled = &rescuer->scheduled;
- bool is_unbound = wq->flags & WQ_UNBOUND;
- unsigned int cpu;

set_user_nice(current, RESCUER_NICE_LEVEL);

@@ -2357,18 +2335,19 @@ repeat:
return 0;
}

- /*
- * See whether any cpu is asking for help. Unbounded
- * workqueues use cpu 0 in mayday_mask for CPU_UNBOUND.
- */
- for_each_mayday_cpu(cpu, wq->mayday_mask) {
- unsigned int tcpu = is_unbound ? WORK_CPU_UNBOUND : cpu;
- struct pool_workqueue *pwq = get_pwq(tcpu, wq);
+ /* see whether any pwq is asking for help */
+ spin_lock_irq(&workqueue_lock);
+
+ while (!list_empty(&wq->maydays)) {
+ struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
+ struct pool_workqueue, mayday_node);
struct worker_pool *pool = pwq->pool;
struct work_struct *work, *n;

__set_current_state(TASK_RUNNING);
- mayday_clear_cpu(cpu, wq->mayday_mask);
+ list_del_init(&pwq->mayday_node);
+
+ spin_unlock_irq(&workqueue_lock);

/* migrate to the target cpu if possible */
worker_maybe_bind_and_lock(pool);
@@ -2394,9 +2373,12 @@ repeat:
wake_up_worker(pool);

rescuer->pool = NULL;
- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
+ spin_lock(&workqueue_lock);
}

+ spin_unlock_irq(&workqueue_lock);
+
/* rescuers should never participate in concurrency management */
WARN_ON_ONCE(!(rescuer->flags & WORKER_NOT_RUNNING));
schedule();
@@ -3194,6 +3176,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
INIT_LIST_HEAD(&wq->pwqs);
INIT_LIST_HEAD(&wq->flusher_queue);
INIT_LIST_HEAD(&wq->flusher_overflow);
+ INIT_LIST_HEAD(&wq->maydays);

lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);
@@ -3207,14 +3190,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
pwq->flush_color = -1;
pwq->max_active = max_active;
INIT_LIST_HEAD(&pwq->delayed_works);
+ INIT_LIST_HEAD(&pwq->mayday_node);
}

if (flags & WQ_RESCUER) {
struct worker *rescuer;

- if (!alloc_mayday_mask(&wq->mayday_mask, GFP_KERNEL))
- goto err;
-
wq->rescuer = rescuer = alloc_worker();
if (!rescuer)
goto err;
@@ -3248,7 +3229,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
err:
if (wq) {
free_pwqs(wq);
- free_mayday_mask(wq->mayday_mask);
kfree(wq->rescuer);
kfree(wq);
}
@@ -3291,7 +3271,6 @@ void destroy_workqueue(struct workqueue_struct *wq)

if (wq->flags & WQ_RESCUER) {
kthread_stop(wq->rescuer->task);
- free_mayday_mask(wq->mayday_mask);
kfree(wq->rescuer);
}

--
1.8.1.2

2013-03-02 03:28:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 21/31] workqueue: drop WQ_RESCUER and test workqueue->rescuer for NULL instead

WQ_RESCUER is superflous. WQ_MEM_RECLAIM indicates that the user
wants a rescuer and testing wq->rescuer for NULL can answer whether a
given workqueue has a rescuer or not. Drop WQ_RESCUER and test
wq->rescuer directly.

This will help simplifying __alloc_workqueue_key() failure path by
allowing it to use destroy_workqueue() on a partially constructed
workqueue, which in turn will help implementing dynamic management of
pool_workqueues.

While at it, clear wq->rescuer after freeing it in
destroy_workqueue(). This is a precaution as scheduled changes will
make destruction more complex.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 1 -
kernel/workqueue.c | 22 ++++++++++------------
2 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 2683e8e..0341403 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -294,7 +294,6 @@ enum {
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */

WQ_DRAINING = 1 << 6, /* internal: workqueue is draining */
- WQ_RESCUER = 1 << 7, /* internal: workqueue has rescuer */

WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1695bd6..bcc02bb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1825,7 +1825,7 @@ static void send_mayday(struct work_struct *work)

lockdep_assert_held(&workqueue_lock);

- if (!(wq->flags & WQ_RESCUER))
+ if (!wq->rescuer)
return;

/* mayday mayday mayday */
@@ -2283,7 +2283,7 @@ sleep:
* @__rescuer: self
*
* Workqueue rescuer thread function. There's one rescuer for each
- * workqueue which has WQ_RESCUER set.
+ * workqueue which has WQ_MEM_RECLAIM set.
*
* Regular work processing on a pool may block trying to create a new
* worker which uses GFP_KERNEL allocation which has slight chance of
@@ -2767,7 +2767,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr)
* flusher is not running on the same workqueue by verifying write
* access.
*/
- if (pwq->wq->saved_max_active == 1 || pwq->wq->flags & WQ_RESCUER)
+ if (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)
lock_map_acquire(&pwq->wq->lockdep_map);
else
lock_map_acquire_read(&pwq->wq->lockdep_map);
@@ -3405,13 +3405,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
va_end(args);
va_end(args1);

- /*
- * Workqueues which may be used during memory reclaim should
- * have a rescuer to guarantee forward progress.
- */
- if (flags & WQ_MEM_RECLAIM)
- flags |= WQ_RESCUER;
-
max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);

@@ -3442,7 +3435,11 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
}
local_irq_enable();

- if (flags & WQ_RESCUER) {
+ /*
+ * Workqueues which may be used during memory reclaim should
+ * have a rescuer to guarantee forward progress.
+ */
+ if (flags & WQ_MEM_RECLAIM) {
struct worker *rescuer;

wq->rescuer = rescuer = alloc_worker();
@@ -3526,9 +3523,10 @@ void destroy_workqueue(struct workqueue_struct *wq)

spin_unlock_irq(&workqueue_lock);

- if (wq->flags & WQ_RESCUER) {
+ if (wq->rescuer) {
kthread_stop(wq->rescuer->task);
kfree(wq->rescuer);
+ wq->rescuer = NULL;
}

/*
--
1.8.1.2

2013-03-02 03:25:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/31] workqueue: restructure pool / pool_workqueue iterations in freeze/thaw functions

The three freeze/thaw related functions - freeze_workqueues_begin(),
freeze_workqueues_busy() and thaw_workqueues() - need to iterate
through all pool_workqueues of all freezable workqueues. They did it
by first iterating pools and then visiting all pwqs (pool_workqueues)
of all workqueues and process it if its pwq->pool matches the current
pool. This is rather backwards and done this way partly because
workqueue didn't have fitting iteration helpers and partly to avoid
the number of lock operations on pool->lock.

Workqueue now has fitting iterators and the locking operation overhead
isn't anything to worry about - those locks are unlikely to be
contended and the same CPU visiting the same set of locks multiple
times isn't expensive.

Restructure the three functions such that the flow better matches the
logical steps and pwq iteration is done using for_each_pwq() inside
workqueue iteration.

* freeze_workqueues_begin(): Setting of FREEZING is moved into a
separate for_each_pool() iteration. pwq iteration for clearing
max_active is updated as described above.

* freeze_workqueues_busy(): pwq iteration updated as described above.

* thaw_workqueues(): The single for_each_wq_cpu() iteration is broken
into three discrete steps - clearing FREEZING, restoring max_active,
and kicking workers. The first and last steps use for_each_pool()
and the second step uses pwq iteration described above.

This makes the code easier to understand and removes the use of
for_each_wq_cpu() for walking pwqs, which can't support multiple
unbound pwqs which will be needed to implement unbound workqueues with
custom attributes.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 87 ++++++++++++++++++++++++++++--------------------------
1 file changed, 45 insertions(+), 42 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 869dbcc..9f195aa 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3598,6 +3598,8 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
void freeze_workqueues_begin(void)
{
struct worker_pool *pool;
+ struct workqueue_struct *wq;
+ struct pool_workqueue *pwq;
int id;

spin_lock_irq(&workqueue_lock);
@@ -3605,23 +3607,24 @@ void freeze_workqueues_begin(void)
WARN_ON_ONCE(workqueue_freezing);
workqueue_freezing = true;

+ /* set FREEZING */
for_each_pool(pool, id) {
- struct workqueue_struct *wq;
-
spin_lock(&pool->lock);
-
WARN_ON_ONCE(pool->flags & POOL_FREEZING);
pool->flags |= POOL_FREEZING;
+ spin_unlock(&pool->lock);
+ }

- list_for_each_entry(wq, &workqueues, list) {
- struct pool_workqueue *pwq = get_pwq(pool->cpu, wq);
+ /* suppress further executions by setting max_active to zero */
+ list_for_each_entry(wq, &workqueues, list) {
+ if (!(wq->flags & WQ_FREEZABLE))
+ continue;

- if (pwq && pwq->pool == pool &&
- (wq->flags & WQ_FREEZABLE))
- pwq->max_active = 0;
+ for_each_pwq(pwq, wq) {
+ spin_lock(&pwq->pool->lock);
+ pwq->max_active = 0;
+ spin_unlock(&pwq->pool->lock);
}
-
- spin_unlock(&pool->lock);
}

spin_unlock_irq(&workqueue_lock);
@@ -3642,25 +3645,22 @@ void freeze_workqueues_begin(void)
*/
bool freeze_workqueues_busy(void)
{
- unsigned int cpu;
bool busy = false;
+ struct workqueue_struct *wq;
+ struct pool_workqueue *pwq;

spin_lock_irq(&workqueue_lock);

WARN_ON_ONCE(!workqueue_freezing);

- for_each_wq_cpu(cpu) {
- struct workqueue_struct *wq;
+ list_for_each_entry(wq, &workqueues, list) {
+ if (!(wq->flags & WQ_FREEZABLE))
+ continue;
/*
* nr_active is monotonically decreasing. It's safe
* to peek without lock.
*/
- list_for_each_entry(wq, &workqueues, list) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
-
- if (!pwq || !(wq->flags & WQ_FREEZABLE))
- continue;
-
+ for_each_pwq(pwq, wq) {
WARN_ON_ONCE(pwq->nr_active < 0);
if (pwq->nr_active) {
busy = true;
@@ -3684,40 +3684,43 @@ out_unlock:
*/
void thaw_workqueues(void)
{
- unsigned int cpu;
+ struct workqueue_struct *wq;
+ struct pool_workqueue *pwq;
+ struct worker_pool *pool;
+ int id;

spin_lock_irq(&workqueue_lock);

if (!workqueue_freezing)
goto out_unlock;

- for_each_wq_cpu(cpu) {
- struct worker_pool *pool;
- struct workqueue_struct *wq;
-
- for_each_std_worker_pool(pool, cpu) {
- spin_lock(&pool->lock);
-
- WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
- pool->flags &= ~POOL_FREEZING;
-
- list_for_each_entry(wq, &workqueues, list) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
-
- if (!pwq || pwq->pool != pool ||
- !(wq->flags & WQ_FREEZABLE))
- continue;
-
- /* restore max_active and repopulate worklist */
- pwq_set_max_active(pwq, wq->saved_max_active);
- }
+ /* clear FREEZING */
+ for_each_pool(pool, id) {
+ spin_lock(&pool->lock);
+ WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
+ pool->flags &= ~POOL_FREEZING;
+ spin_unlock(&pool->lock);
+ }

- wake_up_worker(pool);
+ /* restore max_active and repopulate worklist */
+ list_for_each_entry(wq, &workqueues, list) {
+ if (!(wq->flags & WQ_FREEZABLE))
+ continue;

- spin_unlock(&pool->lock);
+ for_each_pwq(pwq, wq) {
+ spin_lock(&pwq->pool->lock);
+ pwq_set_max_active(pwq, wq->saved_max_active);
+ spin_unlock(&pwq->pool->lock);
}
}

+ /* kick workers */
+ for_each_pool(pool, id) {
+ spin_lock(&pool->lock);
+ wake_up_worker(pool);
+ spin_unlock(&pool->lock);
+ }
+
workqueue_freezing = false;
out_unlock:
spin_unlock_irq(&workqueue_lock);
--
1.8.1.2

2013-03-02 03:29:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/31] workqueue: add pool ID to the names of unbound kworkers

There are gonna be multiple unbound pools. Include pool ID in the
name of unbound kworkers.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 95a3dcc..1695bd6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1705,7 +1705,8 @@ static struct worker *create_worker(struct worker_pool *pool)
"kworker/%d:%d%s", pool->cpu, id, pri);
else
worker->task = kthread_create(worker_thread, worker,
- "kworker/u:%d%s", id, pri);
+ "kworker/%du:%d%s",
+ pool->id, id, pri);
if (IS_ERR(worker->task))
goto fail;

--
1.8.1.2

2013-03-02 03:25:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/31] workqueue: introduce for_each_pool()

With the scheduled unbound pools with custom attributes, there will be
multiple unbound pools, so it wouldn't be able to use
for_each_wq_cpu() + for_each_std_worker_pool() to iterate through all
pools.

Introduce for_each_pool() which iterates through all pools using
worker_pool_idr and use it instead of for_each_wq_cpu() +
for_each_std_worker_pool() combination in freeze_workqueues_begin().

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 36 +++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0055a31..869dbcc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -295,6 +295,14 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
(cpu) = __next_wq_cpu((cpu), cpu_online_mask, 3))

/**
+ * for_each_pool - iterate through all worker_pools in the system
+ * @pool: iteration cursor
+ * @id: integer used for iteration
+ */
+#define for_each_pool(pool, id) \
+ idr_for_each_entry(&worker_pool_idr, pool, id)
+
+/**
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
* @pwq: iteration cursor
* @wq: the target workqueue
@@ -3589,33 +3597,31 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
*/
void freeze_workqueues_begin(void)
{
- unsigned int cpu;
+ struct worker_pool *pool;
+ int id;

spin_lock_irq(&workqueue_lock);

WARN_ON_ONCE(workqueue_freezing);
workqueue_freezing = true;

- for_each_wq_cpu(cpu) {
- struct worker_pool *pool;
+ for_each_pool(pool, id) {
struct workqueue_struct *wq;

- for_each_std_worker_pool(pool, cpu) {
- spin_lock(&pool->lock);
-
- WARN_ON_ONCE(pool->flags & POOL_FREEZING);
- pool->flags |= POOL_FREEZING;
+ spin_lock(&pool->lock);

- list_for_each_entry(wq, &workqueues, list) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ WARN_ON_ONCE(pool->flags & POOL_FREEZING);
+ pool->flags |= POOL_FREEZING;

- if (pwq && pwq->pool == pool &&
- (wq->flags & WQ_FREEZABLE))
- pwq->max_active = 0;
- }
+ list_for_each_entry(wq, &workqueues, list) {
+ struct pool_workqueue *pwq = get_pwq(pool->cpu, wq);

- spin_unlock(&pool->lock);
+ if (pwq && pwq->pool == pool &&
+ (wq->flags & WQ_FREEZABLE))
+ pwq->max_active = 0;
}
+
+ spin_unlock(&pool->lock);
}

spin_unlock_irq(&workqueue_lock);
--
1.8.1.2

2013-03-02 03:29:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/31] workqueue: drop "std" from cpu_std_worker_pools and for_each_std_worker_pool()

All per-cpu pools are standard, so there's no need to use both "cpu"
and "std" and for_each_std_worker_pool() is confusing in that it can
be used only for per-cpu pools.

* s/cpu_std_worker_pools/cpu_worker_pools/

* s/for_each_std_worker_pool()/for_each_cpu_worker_pool()/

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f7f627c..95a3dcc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -252,9 +252,9 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
lockdep_is_held(&workqueue_lock), \
"sched RCU or workqueue lock should be held")

-#define for_each_std_worker_pool(pool, cpu) \
- for ((pool) = &per_cpu(cpu_std_worker_pools, cpu)[0]; \
- (pool) < &per_cpu(cpu_std_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
+#define for_each_cpu_worker_pool(pool, cpu) \
+ for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
+ (pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
(pool)++)

#define for_each_busy_worker(worker, i, pos, pool) \
@@ -416,7 +416,7 @@ static bool workqueue_freezing; /* W: have wqs started freezing? */
* POOL_DISASSOCIATED set, and their workers have WORKER_UNBOUND set.
*/
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
- cpu_std_worker_pools);
+ cpu_worker_pools);

/*
* idr of all pools. Modifications are protected by workqueue_lock. Read
@@ -3335,7 +3335,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
struct pool_workqueue *pwq =
per_cpu_ptr(wq->cpu_pwqs, cpu);
struct worker_pool *cpu_pools =
- per_cpu(cpu_std_worker_pools, cpu);
+ per_cpu(cpu_worker_pools, cpu);

pwq->pool = &cpu_pools[highpri];
list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
@@ -3688,7 +3688,7 @@ static void wq_unbind_fn(struct work_struct *work)
struct hlist_node *pos;
int i;

- for_each_std_worker_pool(pool, cpu) {
+ for_each_cpu_worker_pool(pool, cpu) {
WARN_ON_ONCE(cpu != smp_processor_id());

mutex_lock(&pool->assoc_mutex);
@@ -3731,7 +3731,7 @@ static void wq_unbind_fn(struct work_struct *work)
* unbound chain execution of pending work items if other workers
* didn't already.
*/
- for_each_std_worker_pool(pool, cpu)
+ for_each_cpu_worker_pool(pool, cpu)
atomic_set(&pool->nr_running, 0);
}

@@ -3748,7 +3748,7 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
- for_each_std_worker_pool(pool, cpu) {
+ for_each_cpu_worker_pool(pool, cpu) {
struct worker *worker;

if (pool->nr_workers)
@@ -3766,7 +3766,7 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,

case CPU_DOWN_FAILED:
case CPU_ONLINE:
- for_each_std_worker_pool(pool, cpu) {
+ for_each_cpu_worker_pool(pool, cpu) {
mutex_lock(&pool->assoc_mutex);
spin_lock_irq(&pool->lock);

@@ -4006,7 +4006,7 @@ static int __init init_workqueues(void)
struct worker_pool *pool;

i = 0;
- for_each_std_worker_pool(pool, cpu) {
+ for_each_cpu_worker_pool(pool, cpu) {
BUG_ON(init_worker_pool(pool));
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
@@ -4021,7 +4021,7 @@ static int __init init_workqueues(void)
for_each_online_cpu(cpu) {
struct worker_pool *pool;

- for_each_std_worker_pool(pool, cpu) {
+ for_each_cpu_worker_pool(pool, cpu) {
struct worker *worker;

pool->flags &= ~POOL_DISASSOCIATED;
--
1.8.1.2

2013-03-02 03:24:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/31] workqueue: replace for_each_pwq_cpu() with for_each_pwq()

Introduce for_each_pwq() which iterates all pool_workqueues of a
workqueue using the recently added workqueue->pwqs list and replace
for_each_pwq_cpu() usages with it.

This is primarily to remove the single unbound CPU assumption from pwq
iteration for the scheduled unbound pools with custom attributes
support which would introduce multiple unbound pwqs per workqueue;
however, it also simplifies iterator users.

Note that pwq->pool initialization is moved to alloc_and_link_pwqs()
as that now is the only place which is explicitly handling the two pwq
types.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 53 ++++++++++++++++++++++-------------------------------
1 file changed, 22 insertions(+), 31 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d493293..0055a31 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -273,12 +273,6 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
return WORK_CPU_END;
}

-static inline int __next_pwq_cpu(int cpu, const struct cpumask *mask,
- struct workqueue_struct *wq)
-{
- return __next_wq_cpu(cpu, mask, !(wq->flags & WQ_UNBOUND) ? 1 : 2);
-}
-
/*
* CPU iterators
*
@@ -289,8 +283,6 @@ static inline int __next_pwq_cpu(int cpu, const struct cpumask *mask,
*
* for_each_wq_cpu() : possible CPUs + WORK_CPU_UNBOUND
* for_each_online_wq_cpu() : online CPUs + WORK_CPU_UNBOUND
- * for_each_pwq_cpu() : possible CPUs for bound workqueues,
- * WORK_CPU_UNBOUND for unbound workqueues
*/
#define for_each_wq_cpu(cpu) \
for ((cpu) = __next_wq_cpu(-1, cpu_possible_mask, 3); \
@@ -302,10 +294,13 @@ static inline int __next_pwq_cpu(int cpu, const struct cpumask *mask,
(cpu) < WORK_CPU_END; \
(cpu) = __next_wq_cpu((cpu), cpu_online_mask, 3))

-#define for_each_pwq_cpu(cpu, wq) \
- for ((cpu) = __next_pwq_cpu(-1, cpu_possible_mask, (wq)); \
- (cpu) < WORK_CPU_END; \
- (cpu) = __next_pwq_cpu((cpu), cpu_possible_mask, (wq)))
+/**
+ * for_each_pwq - iterate through all pool_workqueues of the specified workqueue
+ * @pwq: iteration cursor
+ * @wq: the target workqueue
+ */
+#define for_each_pwq(pwq, wq) \
+ list_for_each_entry((pwq), &(wq)->pwqs, pwqs_node)

#ifdef CONFIG_DEBUG_OBJECTS_WORK

@@ -2507,15 +2502,14 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
int flush_color, int work_color)
{
bool wait = false;
- unsigned int cpu;
+ struct pool_workqueue *pwq;

if (flush_color >= 0) {
WARN_ON_ONCE(atomic_read(&wq->nr_pwqs_to_flush));
atomic_set(&wq->nr_pwqs_to_flush, 1);
}

- for_each_pwq_cpu(cpu, wq) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ for_each_pwq(pwq, wq) {
struct worker_pool *pool = pwq->pool;

spin_lock_irq(&pool->lock);
@@ -2714,7 +2708,7 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
void drain_workqueue(struct workqueue_struct *wq)
{
unsigned int flush_cnt = 0;
- unsigned int cpu;
+ struct pool_workqueue *pwq;

/*
* __queue_work() needs to test whether there are drainers, is much
@@ -2728,8 +2722,7 @@ void drain_workqueue(struct workqueue_struct *wq)
reflush:
flush_workqueue(wq);

- for_each_pwq_cpu(cpu, wq) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ for_each_pwq(pwq, wq) {
bool drained;

spin_lock_irq(&pwq->pool->lock);
@@ -3102,6 +3095,7 @@ int keventd_up(void)

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
+ bool highpri = wq->flags & WQ_HIGHPRI;
int cpu;

if (!(wq->flags & WQ_UNBOUND)) {
@@ -3112,6 +3106,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
for_each_possible_cpu(cpu) {
struct pool_workqueue *pwq = get_pwq(cpu, wq);

+ pwq->pool = get_std_worker_pool(cpu, highpri);
list_add_tail(&pwq->pwqs_node, &wq->pwqs);
}
} else {
@@ -3122,6 +3117,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
return -ENOMEM;

wq->pool_wq.single = pwq;
+ pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
list_add_tail(&pwq->pwqs_node, &wq->pwqs);
}

@@ -3156,7 +3152,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
{
va_list args, args1;
struct workqueue_struct *wq;
- unsigned int cpu;
+ struct pool_workqueue *pwq;
size_t namelen;

/* determine namelen, allocate wq and format name */
@@ -3197,11 +3193,8 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
if (alloc_and_link_pwqs(wq) < 0)
goto err;

- for_each_pwq_cpu(cpu, wq) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
-
+ for_each_pwq(pwq, wq) {
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
- pwq->pool = get_std_worker_pool(cpu, flags & WQ_HIGHPRI);
pwq->wq = wq;
pwq->flush_color = -1;
pwq->max_active = max_active;
@@ -3236,8 +3229,8 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
spin_lock_irq(&workqueue_lock);

if (workqueue_freezing && wq->flags & WQ_FREEZABLE)
- for_each_pwq_cpu(cpu, wq)
- get_pwq(cpu, wq)->max_active = 0;
+ for_each_pwq(pwq, wq)
+ pwq->max_active = 0;

list_add(&wq->list, &workqueues);

@@ -3263,14 +3256,13 @@ EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- unsigned int cpu;
+ struct pool_workqueue *pwq;

/* drain it before proceeding with destruction */
drain_workqueue(wq);

/* sanity checks */
- for_each_pwq_cpu(cpu, wq) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ for_each_pwq(pwq, wq) {
int i;

for (i = 0; i < WORK_NR_COLORS; i++)
@@ -3332,7 +3324,7 @@ static void pwq_set_max_active(struct pool_workqueue *pwq, int max_active)
*/
void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
{
- unsigned int cpu;
+ struct pool_workqueue *pwq;

max_active = wq_clamp_max_active(max_active, wq->flags, wq->name);

@@ -3340,8 +3332,7 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)

wq->saved_max_active = max_active;

- for_each_pwq_cpu(cpu, wq) {
- struct pool_workqueue *pwq = get_pwq(cpu, wq);
+ for_each_pwq(pwq, wq) {
struct worker_pool *pool = pwq->pool;

spin_lock(&pool->lock);
--
1.8.1.2

2013-03-02 03:30:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/31] workqueue: remove unbound_std_worker_pools[] and related helpers

Workqueue no longer makes use of unbound_std_worker_pools[]. All
unbound worker_pools are created dynamically and there's nothing
special about the standard ones. With unbound_std_worker_pools[]
unused, workqueue no longer has places where it needs to treat the
per-cpu pools-cpu and unbound pools together.

Remove unbound_std_worker_pools[] and the helpers wrapping it to
present unified per-cpu and unbound standard worker_pools.

* for_each_std_worker_pool() now only walks through per-cpu pools.

* for_each[_online]_wq_cpu() which don't have any users left are
removed.

* std_worker_pools() and std_worker_pool_pri() are unused and removed.

* get_std_worker_pool() is removed. Its only user -
alloc_and_link_pwqs() - only used it for per-cpu pools anyway. Open
code per_cpu access in alloc_and_link_pwqs() instead.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 66 +++++-------------------------------------------------
1 file changed, 6 insertions(+), 60 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fb91b67..f7f627c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -253,48 +253,13 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
"sched RCU or workqueue lock should be held")

#define for_each_std_worker_pool(pool, cpu) \
- for ((pool) = &std_worker_pools(cpu)[0]; \
- (pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++)
+ for ((pool) = &per_cpu(cpu_std_worker_pools, cpu)[0]; \
+ (pool) < &per_cpu(cpu_std_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
+ (pool)++)

#define for_each_busy_worker(worker, i, pos, pool) \
hash_for_each(pool->busy_hash, i, pos, worker, hentry)

-static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
- unsigned int sw)
-{
- if (cpu < nr_cpu_ids) {
- if (sw & 1) {
- cpu = cpumask_next(cpu, mask);
- if (cpu < nr_cpu_ids)
- return cpu;
- }
- if (sw & 2)
- return WORK_CPU_UNBOUND;
- }
- return WORK_CPU_END;
-}
-
-/*
- * CPU iterators
- *
- * An extra cpu number is defined using an invalid cpu number
- * (WORK_CPU_UNBOUND) to host workqueues which are not bound to any
- * specific CPU. The following iterators are similar to for_each_*_cpu()
- * iterators but also considers the unbound CPU.
- *
- * for_each_wq_cpu() : possible CPUs + WORK_CPU_UNBOUND
- * for_each_online_wq_cpu() : online CPUs + WORK_CPU_UNBOUND
- */
-#define for_each_wq_cpu(cpu) \
- for ((cpu) = __next_wq_cpu(-1, cpu_possible_mask, 3); \
- (cpu) < WORK_CPU_END; \
- (cpu) = __next_wq_cpu((cpu), cpu_possible_mask, 3))
-
-#define for_each_online_wq_cpu(cpu) \
- for ((cpu) = __next_wq_cpu(-1, cpu_online_mask, 3); \
- (cpu) < WORK_CPU_END; \
- (cpu) = __next_wq_cpu((cpu), cpu_online_mask, 3))
-
/**
* for_each_pool - iterate through all worker_pools in the system
* @pool: iteration cursor
@@ -452,7 +417,6 @@ static bool workqueue_freezing; /* W: have wqs started freezing? */
*/
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_std_worker_pools);
-static struct worker_pool unbound_std_worker_pools[NR_STD_WORKER_POOLS];

/*
* idr of all pools. Modifications are protected by workqueue_lock. Read
@@ -462,19 +426,6 @@ static DEFINE_IDR(worker_pool_idr);

static int worker_thread(void *__worker);

-static struct worker_pool *std_worker_pools(int cpu)
-{
- if (cpu != WORK_CPU_UNBOUND)
- return per_cpu(cpu_std_worker_pools, cpu);
- else
- return unbound_std_worker_pools;
-}
-
-static int std_worker_pool_pri(struct worker_pool *pool)
-{
- return pool - std_worker_pools(pool->cpu);
-}
-
/* allocate ID and assign it to @pool */
static int worker_pool_assign_id(struct worker_pool *pool)
{
@@ -492,13 +443,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
return ret;
}

-static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
-{
- struct worker_pool *pools = std_worker_pools(cpu);
-
- return &pools[highpri];
-}
-
/**
* first_pwq - return the first pool_workqueue of the specified workqueue
* @wq: the target workqueue
@@ -3390,8 +3334,10 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
for_each_possible_cpu(cpu) {
struct pool_workqueue *pwq =
per_cpu_ptr(wq->cpu_pwqs, cpu);
+ struct worker_pool *cpu_pools =
+ per_cpu(cpu_std_worker_pools, cpu);

- pwq->pool = get_std_worker_pool(cpu, highpri);
+ pwq->pool = &cpu_pools[highpri];
list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}
} else {
--
1.8.1.2

2013-03-02 03:30:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/31] workqueue: implement attribute-based unbound worker_pool management

This patch makes unbound worker_pools reference counted and
dynamically created and destroyed as workqueues needing them come and
go. All unbound worker_pools are hashed on unbound_pool_hash which is
keyed by the content of worker_pool->attrs.

When an unbound workqueue is allocated, get_unbound_pool() is called
with the attributes of the workqueue. If there already is a matching
worker_pool, the reference count is bumped and the pool is returned.
If not, a new worker_pool with matching attributes is created and
returned.

When an unbound workqueue is destroyed, put_unbound_pool() is called
which decrements the reference count of the associated worker_pool.
If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU
safe way.

Note that the standard unbound worker_pools - normal and highpri ones
with no specific cpumask affinity - are no longer created explicitly
during init_workqueues(). init_workqueues() only initializes
workqueue_attrs to be used for standard unbound pools -
unbound_std_wq_attrs[]. The pools are spawned on demand as workqueues
are created.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 230 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 218 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7eba824..fb91b67 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,7 @@
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
#include <linux/idr.h>
+#include <linux/jhash.h>
#include <linux/hashtable.h>
#include <linux/rculist.h>

@@ -80,6 +81,7 @@ enum {

NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */

+ UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */

MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
@@ -149,6 +151,8 @@ struct worker_pool {
struct ida worker_ida; /* L: for worker IDs */

struct workqueue_attrs *attrs; /* I: worker attributes */
+ struct hlist_node hash_node; /* R: unbound_pool_hash node */
+ atomic_t refcnt; /* refcnt for unbound pools */

/*
* The current concurrency level. As it's likely to be accessed
@@ -156,6 +160,12 @@ struct worker_pool {
* cacheline.
*/
atomic_t nr_running ____cacheline_aligned_in_smp;
+
+ /*
+ * Destruction of pool is sched-RCU protected to allow dereferences
+ * from get_work_pool().
+ */
+ struct rcu_head rcu;
} ____cacheline_aligned_in_smp;

/*
@@ -218,6 +228,11 @@ struct workqueue_struct {

static struct kmem_cache *pwq_cache;

+/* hash of all unbound pools keyed by pool->attrs */
+static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
+
+static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
+
struct workqueue_struct *system_wq __read_mostly;
EXPORT_SYMBOL_GPL(system_wq);
struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -1740,7 +1755,7 @@ static struct worker *create_worker(struct worker_pool *pool)
worker->pool = pool;
worker->id = id;

- if (pool->cpu != WORK_CPU_UNBOUND)
+ if (pool->cpu >= 0)
worker->task = kthread_create_on_node(worker_thread,
worker, cpu_to_node(pool->cpu),
"kworker/%d:%d%s", pool->cpu, id, pri);
@@ -3159,6 +3174,54 @@ fail:
return NULL;
}

+static void copy_workqueue_attrs(struct workqueue_attrs *to,
+ const struct workqueue_attrs *from)
+{
+ to->nice = from->nice;
+ cpumask_copy(to->cpumask, from->cpumask);
+}
+
+/*
+ * Hacky implementation of jhash of bitmaps which only considers the
+ * specified number of bits. We probably want a proper implementation in
+ * include/linux/jhash.h.
+ */
+static u32 jhash_bitmap(const unsigned long *bitmap, int bits, u32 hash)
+{
+ int nr_longs = bits / BITS_PER_LONG;
+ int nr_leftover = bits % BITS_PER_LONG;
+ unsigned long leftover = 0;
+
+ if (nr_longs)
+ hash = jhash(bitmap, nr_longs * sizeof(long), hash);
+ if (nr_leftover) {
+ bitmap_copy(&leftover, bitmap + nr_longs, nr_leftover);
+ hash = jhash(&leftover, sizeof(long), hash);
+ }
+ return hash;
+}
+
+/* hash value of the content of @attr */
+static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
+{
+ u32 hash = 0;
+
+ hash = jhash_1word(attrs->nice, hash);
+ hash = jhash_bitmap(cpumask_bits(attrs->cpumask), nr_cpu_ids, hash);
+ return hash;
+}
+
+/* content equality test */
+static bool wqattrs_equal(const struct workqueue_attrs *a,
+ const struct workqueue_attrs *b)
+{
+ if (a->nice != b->nice)
+ return false;
+ if (!cpumask_equal(a->cpumask, b->cpumask))
+ return false;
+ return true;
+}
+
/**
* init_worker_pool - initialize a newly zalloc'd worker_pool
* @pool: worker_pool to initialize
@@ -3169,6 +3232,8 @@ fail:
static int init_worker_pool(struct worker_pool *pool)
{
spin_lock_init(&pool->lock);
+ pool->id = -1;
+ pool->cpu = -1;
pool->flags |= POOL_DISASSOCIATED;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
@@ -3185,12 +3250,133 @@ static int init_worker_pool(struct worker_pool *pool)
mutex_init(&pool->assoc_mutex);
ida_init(&pool->worker_ida);

+ INIT_HLIST_NODE(&pool->hash_node);
+ atomic_set(&pool->refcnt, 1);
pool->attrs = alloc_workqueue_attrs(GFP_KERNEL);
if (!pool->attrs)
return -ENOMEM;
return 0;
}

+static void rcu_free_pool(struct rcu_head *rcu)
+{
+ struct worker_pool *pool = container_of(rcu, struct worker_pool, rcu);
+
+ ida_destroy(&pool->worker_ida);
+ free_workqueue_attrs(pool->attrs);
+ kfree(pool);
+}
+
+/**
+ * put_unbound_pool - put a worker_pool
+ * @pool: worker_pool to put
+ *
+ * Put @pool. If its refcnt reaches zero, it gets destroyed in sched-RCU
+ * safe manner.
+ */
+static void put_unbound_pool(struct worker_pool *pool)
+{
+ struct worker *worker;
+
+ if (!atomic_dec_and_test(&pool->refcnt))
+ return;
+
+ /* sanity checks */
+ if (WARN_ON(!(pool->flags & POOL_DISASSOCIATED)))
+ return;
+ if (WARN_ON(pool->nr_workers != pool->nr_idle))
+ return;
+ if (WARN_ON(!list_empty(&pool->worklist)))
+ return;
+
+ /* release id and unhash */
+ spin_lock_irq(&workqueue_lock);
+ if (pool->id >= 0)
+ idr_remove(&worker_pool_idr, pool->id);
+ hash_del(&pool->hash_node);
+ spin_unlock_irq(&workqueue_lock);
+
+ /* lock out manager and destroy all workers */
+ mutex_lock(&pool->manager_mutex);
+ spin_lock_irq(&pool->lock);
+
+ while ((worker = first_worker(pool)))
+ destroy_worker(worker);
+ WARN_ON(pool->nr_workers || pool->nr_idle);
+
+ spin_unlock_irq(&pool->lock);
+ mutex_unlock(&pool->manager_mutex);
+
+ /* shut down the timers */
+ del_timer_sync(&pool->idle_timer);
+ del_timer_sync(&pool->mayday_timer);
+
+ /* sched-RCU protected to allow dereferences from get_work_pool() */
+ call_rcu_sched(&pool->rcu, rcu_free_pool);
+}
+
+/**
+ * get_unbound_pool - get a worker_pool with the specified attributes
+ * @attrs: the attributes of the worker_pool to get
+ *
+ * Obtain a worker_pool which has the same attributes as @attrs, bump the
+ * reference count and return it. If there already is a matching
+ * worker_pool, it will be used; otherwise, this function attempts to
+ * create a new one. On failure, returns NULL.
+ */
+static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
+{
+ static DEFINE_MUTEX(create_mutex);
+ u32 hash = wqattrs_hash(attrs);
+ struct worker_pool *pool;
+ struct hlist_node *tmp;
+ struct worker *worker;
+
+ mutex_lock(&create_mutex);
+
+ /* do we already have a matching pool? */
+ spin_lock_irq(&workqueue_lock);
+ hash_for_each_possible(unbound_pool_hash, pool, tmp, hash_node, hash) {
+ if (wqattrs_equal(pool->attrs, attrs)) {
+ atomic_inc(&pool->refcnt);
+ goto out_unlock;
+ }
+ }
+ spin_unlock_irq(&workqueue_lock);
+
+ /* nope, create a new one */
+ pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+ if (!pool || init_worker_pool(pool) < 0)
+ goto fail;
+
+ copy_workqueue_attrs(pool->attrs, attrs);
+
+ if (worker_pool_assign_id(pool) < 0)
+ goto fail;
+
+ /* create and start the initial worker */
+ worker = create_worker(pool);
+ if (!worker)
+ goto fail;
+
+ spin_lock_irq(&pool->lock);
+ start_worker(worker);
+ spin_unlock_irq(&pool->lock);
+
+ /* install */
+ spin_lock_irq(&workqueue_lock);
+ hash_add(unbound_pool_hash, &pool->hash_node, hash);
+out_unlock:
+ spin_unlock_irq(&workqueue_lock);
+ mutex_unlock(&create_mutex);
+ return pool;
+fail:
+ mutex_unlock(&create_mutex);
+ if (pool)
+ put_unbound_pool(pool);
+ return NULL;
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3215,7 +3401,12 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
if (!pwq)
return -ENOMEM;

- pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
+ pwq->pool = get_unbound_pool(unbound_std_wq_attrs[highpri]);
+ if (!pwq->pool) {
+ kmem_cache_free(pwq_cache, pwq);
+ return -ENOMEM;
+ }
+
list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}

@@ -3393,6 +3584,15 @@ void destroy_workqueue(struct workqueue_struct *wq)
kfree(wq->rescuer);
}

+ /*
+ * We're the sole accessor of @wq at this point. Directly access
+ * the first pwq and put its pool.
+ */
+ if (wq->flags & WQ_UNBOUND) {
+ pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
+ pwqs_node);
+ put_unbound_pool(pwq->pool);
+ }
free_pwqs(wq);
kfree(wq);
}
@@ -3856,19 +4056,14 @@ static int __init init_workqueues(void)
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);

/* initialize CPU pools */
- for_each_wq_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct worker_pool *pool;

i = 0;
for_each_std_worker_pool(pool, cpu) {
BUG_ON(init_worker_pool(pool));
pool->cpu = cpu;
-
- if (cpu != WORK_CPU_UNBOUND)
- cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
- else
- cpumask_setall(pool->attrs->cpumask);
-
+ cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];

/* alloc pool ID */
@@ -3877,14 +4072,13 @@ static int __init init_workqueues(void)
}

/* create the initial worker */
- for_each_online_wq_cpu(cpu) {
+ for_each_online_cpu(cpu) {
struct worker_pool *pool;

for_each_std_worker_pool(pool, cpu) {
struct worker *worker;

- if (cpu != WORK_CPU_UNBOUND)
- pool->flags &= ~POOL_DISASSOCIATED;
+ pool->flags &= ~POOL_DISASSOCIATED;

worker = create_worker(pool);
BUG_ON(!worker);
@@ -3894,6 +4088,18 @@ static int __init init_workqueues(void)
}
}

+ /* create default unbound wq attrs */
+ for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
+ struct workqueue_attrs *attrs;
+
+ BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
+
+ attrs->nice = std_nice[i];
+ cpumask_setall(attrs->cpumask);
+
+ unbound_std_wq_attrs[i] = attrs;
+ }
+
system_wq = alloc_workqueue("events", 0, 0);
system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
system_long_wq = alloc_workqueue("events_long", 0, 0);
--
1.8.1.2

2013-03-02 03:24:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/31] workqueue: add workqueue_struct->pwqs list

Add workqueue_struct->pwqs list and chain all pool_workqueues
belonging to a workqueue there. This will be used to implement
generic pool_workqueue iteration and handle multiple pool_workqueues
for the scheduled unbound pools with custom attributes.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 69f1268..d493293 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -169,6 +169,7 @@ struct pool_workqueue {
int nr_active; /* L: nr of active works */
int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
+ struct list_head pwqs_node; /* I: node on wq->pwqs */
} __aligned(1 << WORK_STRUCT_FLAG_BITS);

/*
@@ -212,6 +213,7 @@ struct workqueue_struct {
struct pool_workqueue *single;
unsigned long v;
} pool_wq; /* I: pwq's */
+ struct list_head pwqs; /* I: all pwqs of this wq */
struct list_head list; /* W: list of all workqueues */

struct mutex flush_mutex; /* protects wq flushing */
@@ -3098,14 +3100,32 @@ int keventd_up(void)
return system_wq != NULL;
}

-static int alloc_pwqs(struct workqueue_struct *wq)
+static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
- if (!(wq->flags & WQ_UNBOUND))
+ int cpu;
+
+ if (!(wq->flags & WQ_UNBOUND)) {
wq->pool_wq.pcpu = alloc_percpu(struct pool_workqueue);
- else
- wq->pool_wq.single = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ if (!wq->pool_wq.pcpu)
+ return -ENOMEM;
+
+ for_each_possible_cpu(cpu) {
+ struct pool_workqueue *pwq = get_pwq(cpu, wq);

- return wq->pool_wq.v ? 0 : -ENOMEM;
+ list_add_tail(&pwq->pwqs_node, &wq->pwqs);
+ }
+ } else {
+ struct pool_workqueue *pwq;
+
+ pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+ if (!pwq)
+ return -ENOMEM;
+
+ wq->pool_wq.single = pwq;
+ list_add_tail(&pwq->pwqs_node, &wq->pwqs);
+ }
+
+ return 0;
}

static void free_pwqs(struct workqueue_struct *wq)
@@ -3167,13 +3187,14 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
wq->saved_max_active = max_active;
mutex_init(&wq->flush_mutex);
atomic_set(&wq->nr_pwqs_to_flush, 0);
+ INIT_LIST_HEAD(&wq->pwqs);
INIT_LIST_HEAD(&wq->flusher_queue);
INIT_LIST_HEAD(&wq->flusher_overflow);

lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);

- if (alloc_pwqs(wq) < 0)
+ if (alloc_and_link_pwqs(wq) < 0)
goto err;

for_each_pwq_cpu(cpu, wq) {
--
1.8.1.2

2013-03-02 03:31:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/31] workqueue: introduce workqueue_attrs

Introduce struct workqueue_attrs which carries worker attributes -
currently the nice level and allowed cpumask along with helper
routines alloc_workqueue_attrs() and free_workqueue_attrs().

Each worker_pool now carries ->attrs describing the attributes of its
workers. All functions dealing with cpumask and nice level of workers
are updated to follow worker_pool->attrs instead of determining them
from other characteristics of the worker_pool, and init_workqueues()
is updated to set worker_pool->attrs appropriately for all standard
pools.

Note that create_worker() is updated to always perform set_user_nice()
and use set_cpus_allowed_ptr() combined with manual assertion of
PF_THREAD_BOUND instead of kthread_bind(). This simplifies handling
random attributes without affecting the outcome.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 12 ++++++
kernel/workqueue.c | 103 ++++++++++++++++++++++++++++++++++++----------
2 files changed, 93 insertions(+), 22 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 899be66..2683e8e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -115,6 +115,15 @@ struct delayed_work {
int cpu;
};

+/*
+ * A struct for workqueue attributes. This can be used to change
+ * attributes of an unbound workqueue.
+ */
+struct workqueue_attrs {
+ int nice; /* nice level */
+ cpumask_var_t cpumask; /* allowed CPUs */
+};
+
static inline struct delayed_work *to_delayed_work(struct work_struct *work)
{
return container_of(work, struct delayed_work, work);
@@ -399,6 +408,9 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,

extern void destroy_workqueue(struct workqueue_struct *wq);

+struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask);
+void free_workqueue_attrs(struct workqueue_attrs *attrs);
+
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
extern bool queue_work(struct workqueue_struct *wq, struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f97539b..7eba824 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -148,6 +148,8 @@ struct worker_pool {
struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
struct ida worker_ida; /* L: for worker IDs */

+ struct workqueue_attrs *attrs; /* I: worker attributes */
+
/*
* The current concurrency level. As it's likely to be accessed
* from other CPUs during try_to_wake_up(), put it in a separate
@@ -1563,14 +1565,13 @@ __acquires(&pool->lock)
* against POOL_DISASSOCIATED.
*/
if (!(pool->flags & POOL_DISASSOCIATED))
- set_cpus_allowed_ptr(current, get_cpu_mask(pool->cpu));
+ set_cpus_allowed_ptr(current, pool->attrs->cpumask);

spin_lock_irq(&pool->lock);
if (pool->flags & POOL_DISASSOCIATED)
return false;
if (task_cpu(current) == pool->cpu &&
- cpumask_equal(&current->cpus_allowed,
- get_cpu_mask(pool->cpu)))
+ cpumask_equal(&current->cpus_allowed, pool->attrs->cpumask))
return true;
spin_unlock_irq(&pool->lock);

@@ -1677,7 +1678,7 @@ static void rebind_workers(struct worker_pool *pool)
* wq doesn't really matter but let's keep @worker->pool
* and @pwq->pool consistent for sanity.
*/
- if (std_worker_pool_pri(worker->pool))
+ if (worker->pool->attrs->nice < 0)
wq = system_highpri_wq;
else
wq = system_wq;
@@ -1719,7 +1720,7 @@ static struct worker *alloc_worker(void)
*/
static struct worker *create_worker(struct worker_pool *pool)
{
- const char *pri = std_worker_pool_pri(pool) ? "H" : "";
+ const char *pri = pool->attrs->nice < 0 ? "H" : "";
struct worker *worker = NULL;
int id = -1;

@@ -1749,24 +1750,23 @@ static struct worker *create_worker(struct worker_pool *pool)
if (IS_ERR(worker->task))
goto fail;

- if (std_worker_pool_pri(pool))
- set_user_nice(worker->task, HIGHPRI_NICE_LEVEL);
+ set_user_nice(worker->task, pool->attrs->nice);
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);

/*
- * Determine CPU binding of the new worker depending on
- * %POOL_DISASSOCIATED. The caller is responsible for ensuring the
- * flag remains stable across this function. See the comments
- * above the flag definition for details.
- *
- * As an unbound worker may later become a regular one if CPU comes
- * online, make sure every worker has %PF_THREAD_BOUND set.
+ * %PF_THREAD_BOUND is used to prevent userland from meddling with
+ * cpumask of workqueue workers. This is an abuse. We need
+ * %PF_KERNEL_CPUMASK.
*/
- if (!(pool->flags & POOL_DISASSOCIATED)) {
- kthread_bind(worker->task, pool->cpu);
- } else {
- worker->task->flags |= PF_THREAD_BOUND;
+ worker->task->flags |= PF_THREAD_BOUND;
+
+ /*
+ * The caller is responsible for ensuring %POOL_DISASSOCIATED
+ * remains stable across this function. See the comments above the
+ * flag definition for details.
+ */
+ if (pool->flags & POOL_DISASSOCIATED)
worker->flags |= WORKER_UNBOUND;
- }

return worker;
fail:
@@ -3121,7 +3121,52 @@ int keventd_up(void)
return system_wq != NULL;
}

-static void init_worker_pool(struct worker_pool *pool)
+/**
+ * free_workqueue_attrs - free a workqueue_attrs
+ * @attrs: workqueue_attrs to free
+ *
+ * Undo alloc_workqueue_attrs().
+ */
+void free_workqueue_attrs(struct workqueue_attrs *attrs)
+{
+ if (attrs) {
+ free_cpumask_var(attrs->cpumask);
+ kfree(attrs);
+ }
+}
+
+/**
+ * alloc_workqueue_attrs - allocate a workqueue_attrs
+ * @gfp_mask: allocation mask to use
+ *
+ * Allocate a new workqueue_attrs, initialize with default settings and
+ * return it. Returns NULL on failure.
+ */
+struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
+{
+ struct workqueue_attrs *attrs;
+
+ attrs = kzalloc(sizeof(*attrs), gfp_mask);
+ if (!attrs)
+ goto fail;
+ if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
+ goto fail;
+
+ cpumask_setall(attrs->cpumask);
+ return attrs;
+fail:
+ free_workqueue_attrs(attrs);
+ return NULL;
+}
+
+/**
+ * init_worker_pool - initialize a newly zalloc'd worker_pool
+ * @pool: worker_pool to initialize
+ *
+ * Initiailize a newly zalloc'd @pool. It also allocates @pool->attrs.
+ * Returns 0 on success, -errno on failure.
+ */
+static int init_worker_pool(struct worker_pool *pool)
{
spin_lock_init(&pool->lock);
pool->flags |= POOL_DISASSOCIATED;
@@ -3139,6 +3184,11 @@ static void init_worker_pool(struct worker_pool *pool)
mutex_init(&pool->manager_mutex);
mutex_init(&pool->assoc_mutex);
ida_init(&pool->worker_ida);
+
+ pool->attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pool->attrs)
+ return -ENOMEM;
+ return 0;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3791,7 +3841,8 @@ out_unlock:

static int __init init_workqueues(void)
{
- int cpu;
+ int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
+ int i, cpu;

/* make sure we have enough bits for OFFQ pool ID */
BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
@@ -3808,10 +3859,18 @@ static int __init init_workqueues(void)
for_each_wq_cpu(cpu) {
struct worker_pool *pool;

+ i = 0;
for_each_std_worker_pool(pool, cpu) {
- init_worker_pool(pool);
+ BUG_ON(init_worker_pool(pool));
pool->cpu = cpu;

+ if (cpu != WORK_CPU_UNBOUND)
+ cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
+ else
+ cpumask_setall(pool->attrs->cpumask);
+
+ pool->attrs->nice = std_nice[i++];
+
/* alloc pool ID */
BUG_ON(worker_pool_assign_id(pool));
}
--
1.8.1.2

2013-03-02 03:24:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/31] workqueue: introduce kmem_cache for pool_workqueues

pool_workqueues need to be aligned to 1 << WORK_STRUCT_FLAG_BITS as
the lower bits of work->data are used for flags when they're pointing
to pool_workqueues.

Due to historical reasons, unbound pool_workqueues are allocated using
kzalloc() with sufficient buffer area for alignment and aligned
manually. The original pointer is stored at the end which free_pwqs()
retrieves when freeing it.

There's no reason for this hackery anymore. Set alignment of struct
pool_workqueue to 1 << WORK_STRUCT_FLAG_BITS, add kmem_cache for
pool_workqueues with proper alignment and replace the hacky alloc and
free implementation with plain kmem_cache_zalloc/free().

In case WORK_STRUCT_FLAG_BITS gets shrunk too much and makes fields of
pool_workqueues misaligned, trigger WARN if the alignment of struct
pool_workqueue becomes smaller than that of long long.

Note that assertion on IS_ALIGNED() is removed from alloc_pwqs(). We
already have another one in pwq init loop in __alloc_workqueue_key().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 43 ++++++++++++-------------------------------
1 file changed, 12 insertions(+), 31 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 61f78ef..69f1268 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -169,7 +169,7 @@ struct pool_workqueue {
int nr_active; /* L: nr of active works */
int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
-};
+} __aligned(1 << WORK_STRUCT_FLAG_BITS);

/*
* Structure used to wait for workqueue flush.
@@ -233,6 +233,8 @@ struct workqueue_struct {
char name[]; /* I: workqueue name */
};

+static struct kmem_cache *pwq_cache;
+
struct workqueue_struct *system_wq __read_mostly;
EXPORT_SYMBOL_GPL(system_wq);
struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -3098,34 +3100,11 @@ int keventd_up(void)

static int alloc_pwqs(struct workqueue_struct *wq)
{
- /*
- * pwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
- * Make sure that the alignment isn't lower than that of
- * unsigned long long.
- */
- const size_t size = sizeof(struct pool_workqueue);
- const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS,
- __alignof__(unsigned long long));
-
if (!(wq->flags & WQ_UNBOUND))
- wq->pool_wq.pcpu = __alloc_percpu(size, align);
- else {
- void *ptr;
-
- /*
- * Allocate enough room to align pwq and put an extra
- * pointer at the end pointing back to the originally
- * allocated pointer which will be used for free.
- */
- ptr = kzalloc(size + align + sizeof(void *), GFP_KERNEL);
- if (ptr) {
- wq->pool_wq.single = PTR_ALIGN(ptr, align);
- *(void **)(wq->pool_wq.single + 1) = ptr;
- }
- }
+ wq->pool_wq.pcpu = alloc_percpu(struct pool_workqueue);
+ else
+ wq->pool_wq.single = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);

- /* just in case, make sure it's actually aligned */
- BUG_ON(!IS_ALIGNED(wq->pool_wq.v, align));
return wq->pool_wq.v ? 0 : -ENOMEM;
}

@@ -3133,10 +3112,8 @@ static void free_pwqs(struct workqueue_struct *wq)
{
if (!(wq->flags & WQ_UNBOUND))
free_percpu(wq->pool_wq.pcpu);
- else if (wq->pool_wq.single) {
- /* the pointer to free is stored right after the pwq */
- kfree(*(void **)(wq->pool_wq.single + 1));
- }
+ else
+ kmem_cache_free(pwq_cache, wq->pool_wq.single);
}

static int wq_clamp_max_active(int max_active, unsigned int flags,
@@ -3737,6 +3714,10 @@ static int __init init_workqueues(void)
BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
WORK_CPU_END * NR_STD_WORKER_POOLS);

+ WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
+
+ pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
+
cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);

--
1.8.1.2

2013-03-02 03:24:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/31] workqueue: make workqueue_lock irq-safe

workqueue_lock will be used to synchronize areas which require
irq-safety and there isn't much benefit in keeping it not irq-safe.
Make it irq-safe.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 44 ++++++++++++++++++++++----------------------
1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a533e77..61f78ef 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2717,10 +2717,10 @@ void drain_workqueue(struct workqueue_struct *wq)
* hotter than drain_workqueue() and already looks at @wq->flags.
* Use WQ_DRAINING so that queue doesn't have to check nr_drainers.
*/
- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);
if (!wq->nr_drainers++)
wq->flags |= WQ_DRAINING;
- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);
reflush:
flush_workqueue(wq);

@@ -2742,10 +2742,10 @@ reflush:
goto reflush;
}

- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);
if (!--wq->nr_drainers)
wq->flags &= ~WQ_DRAINING;
- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);
}
EXPORT_SYMBOL_GPL(drain_workqueue);

@@ -3235,7 +3235,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
* list. Grab it, set max_active accordingly and add the new
* workqueue to workqueues list.
*/
- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);

if (workqueue_freezing && wq->flags & WQ_FREEZABLE)
for_each_pwq_cpu(cpu, wq)
@@ -3243,7 +3243,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,

list_add(&wq->list, &workqueues);

- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);

return wq;
err:
@@ -3287,9 +3287,9 @@ void destroy_workqueue(struct workqueue_struct *wq)
* wq list is used to freeze wq, remove from list after
* flushing is complete in case freeze races us.
*/
- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);
list_del(&wq->list);
- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);

if (wq->flags & WQ_RESCUER) {
kthread_stop(wq->rescuer->task);
@@ -3338,7 +3338,7 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)

max_active = wq_clamp_max_active(max_active, wq->flags, wq->name);

- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);

wq->saved_max_active = max_active;

@@ -3346,16 +3346,16 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
struct pool_workqueue *pwq = get_pwq(cpu, wq);
struct worker_pool *pool = pwq->pool;

- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);

if (!(wq->flags & WQ_FREEZABLE) ||
!(pool->flags & POOL_FREEZING))
pwq_set_max_active(pwq, max_active);

- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
}

- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);
}
EXPORT_SYMBOL_GPL(workqueue_set_max_active);

@@ -3602,7 +3602,7 @@ void freeze_workqueues_begin(void)
{
unsigned int cpu;

- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);

WARN_ON_ONCE(workqueue_freezing);
workqueue_freezing = true;
@@ -3612,7 +3612,7 @@ void freeze_workqueues_begin(void)
struct workqueue_struct *wq;

for_each_std_worker_pool(pool, cpu) {
- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);

WARN_ON_ONCE(pool->flags & POOL_FREEZING);
pool->flags |= POOL_FREEZING;
@@ -3625,11 +3625,11 @@ void freeze_workqueues_begin(void)
pwq->max_active = 0;
}

- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
}
}

- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);
}

/**
@@ -3650,7 +3650,7 @@ bool freeze_workqueues_busy(void)
unsigned int cpu;
bool busy = false;

- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);

WARN_ON_ONCE(!workqueue_freezing);

@@ -3674,7 +3674,7 @@ bool freeze_workqueues_busy(void)
}
}
out_unlock:
- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);
return busy;
}

@@ -3691,7 +3691,7 @@ void thaw_workqueues(void)
{
unsigned int cpu;

- spin_lock(&workqueue_lock);
+ spin_lock_irq(&workqueue_lock);

if (!workqueue_freezing)
goto out_unlock;
@@ -3701,7 +3701,7 @@ void thaw_workqueues(void)
struct workqueue_struct *wq;

for_each_std_worker_pool(pool, cpu) {
- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);

WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
pool->flags &= ~POOL_FREEZING;
@@ -3719,13 +3719,13 @@ void thaw_workqueues(void)

wake_up_worker(pool);

- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
}
}

workqueue_freezing = false;
out_unlock:
- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&workqueue_lock);
}
#endif /* CONFIG_FREEZER */

--
1.8.1.2

2013-03-02 03:31:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/31] workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_mutex

POOL_MANAGING_WORKERS is used to synchronize the manager role.
Synchronizing among workers doesn't need blocking and that's why it's
implemented as a flag.

It got converted to a mutex a while back to add blocking wait from CPU
hotplug path - 6037315269 ("workqueue: use mutex for global_cwq
manager exclusion"). Later it turned out that synchronization among
workers and cpu hotplug need to be done separately. Eventually,
POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got
morphed into workqueue->assoc_mutex - 552a37e936 ("workqueue: restore
POOL_MANAGING_WORKERS") and b2eb83d123 ("workqueue: rename
manager_mutex to assoc_mutex").

Now, we're gonna need to be able to lock out managers from
destroy_workqueue() to support multiple unbound pools with custom
attributes making it again necessary to be able to block on the
manager role. This patch replaces POOL_MANAGING_WORKERS with
worker_pool->manager_mutex.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2645218..68b3443 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -64,7 +64,6 @@ enum {
* create_worker() is in progress.
*/
POOL_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
- POOL_MANAGING_WORKERS = 1 << 1, /* managing workers */
POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
POOL_FREEZING = 1 << 3, /* freeze in progress */

@@ -145,6 +144,7 @@ struct worker_pool {
DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
/* L: hash of busy workers */

+ struct mutex manager_mutex; /* the holder is the manager */
struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
struct ida worker_ida; /* L: for worker IDs */

@@ -702,7 +702,7 @@ static bool need_to_manage_workers(struct worker_pool *pool)
/* Do we have too many workers and should some go away? */
static bool too_many_workers(struct worker_pool *pool)
{
- bool managing = pool->flags & POOL_MANAGING_WORKERS;
+ bool managing = mutex_is_locked(&pool->manager_mutex);
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;

@@ -2027,15 +2027,13 @@ static bool manage_workers(struct worker *worker)
struct worker_pool *pool = worker->pool;
bool ret = false;

- if (pool->flags & POOL_MANAGING_WORKERS)
+ if (!mutex_trylock(&pool->manager_mutex))
return ret;

- pool->flags |= POOL_MANAGING_WORKERS;
-
/*
* To simplify both worker management and CPU hotplug, hold off
* management while hotplug is in progress. CPU hotplug path can't
- * grab %POOL_MANAGING_WORKERS to achieve this because that can
+ * grab @pool->manager_mutex to achieve this because that can
* lead to idle worker depletion (all become busy thinking someone
* else is managing) which in turn can result in deadlock under
* extreme circumstances. Use @pool->assoc_mutex to synchronize
@@ -2075,8 +2073,8 @@ static bool manage_workers(struct worker *worker)
ret |= maybe_destroy_workers(pool);
ret |= maybe_create_worker(pool);

- pool->flags &= ~POOL_MANAGING_WORKERS;
mutex_unlock(&pool->assoc_mutex);
+ mutex_unlock(&pool->manager_mutex);
return ret;
}

@@ -3805,6 +3803,7 @@ static int __init init_workqueues(void)
setup_timer(&pool->mayday_timer, pool_mayday_timeout,
(unsigned long)pool);

+ mutex_init(&pool->manager_mutex);
mutex_init(&pool->assoc_mutex);
ida_init(&pool->worker_ida);

--
1.8.1.2

2013-03-02 03:31:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/31] workqueue: update synchronization rules on worker_pool_idr

Make worker_pool_idr protected by workqueue_lock for writes and
sched-RCU protected for reads. Lockdep assertions are added to
for_each_pool() and get_work_pool() and all their users are converted
to either hold workqueue_lock or disable preemption/irq.

worker_pool_assign_id() is updated to hold workqueue_lock when
allocating a pool ID. As idr_get_new() always performs RCU-safe
assignment, this is enough on the writer side.

As standard pools are never destroyed, there's nothing to do on that
side.

The locking is superflous at this point. This is to help
implementation of unbound pools/pwqs with custom attributes.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 69 ++++++++++++++++++++++++++++++++++--------------------
1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ff51c59..2645218 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -282,9 +282,16 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
* for_each_pool - iterate through all worker_pools in the system
* @pool: iteration cursor
* @id: integer used for iteration
+ *
+ * This must be called either with workqueue_lock held or sched RCU read
+ * locked. If the pool needs to be used beyond the locking in effect, the
+ * caller is responsible for guaranteeing that the pool stays online.
+ *
+ * The if clause exists only for the lockdep assertion and can be ignored.
*/
#define for_each_pool(pool, id) \
- idr_for_each_entry(&worker_pool_idr, pool, id)
+ idr_for_each_entry(&worker_pool_idr, pool, id) \
+ if (({ assert_rcu_or_wq_lock(); true; }))

/**
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
@@ -430,8 +437,10 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_std_worker_pools);
static struct worker_pool unbound_std_worker_pools[NR_STD_WORKER_POOLS];

-/* idr of all pools */
-static DEFINE_MUTEX(worker_pool_idr_mutex);
+/*
+ * idr of all pools. Modifications are protected by workqueue_lock. Read
+ * accesses are protected by sched-RCU protected.
+ */
static DEFINE_IDR(worker_pool_idr);

static int worker_thread(void *__worker);
@@ -454,23 +463,18 @@ static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;

- mutex_lock(&worker_pool_idr_mutex);
- idr_pre_get(&worker_pool_idr, GFP_KERNEL);
- ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
- mutex_unlock(&worker_pool_idr_mutex);
+ do {
+ if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
+ return -ENOMEM;
+
+ spin_lock_irq(&workqueue_lock);
+ ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
+ spin_unlock_irq(&workqueue_lock);
+ } while (ret == -EAGAIN);

return ret;
}

-/*
- * Lookup worker_pool by id. The idr currently is built during boot and
- * never modified. Don't worry about locking for now.
- */
-static struct worker_pool *worker_pool_by_id(int pool_id)
-{
- return idr_find(&worker_pool_idr, pool_id);
-}
-
static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
{
struct worker_pool *pools = std_worker_pools(cpu);
@@ -584,13 +588,23 @@ static struct pool_workqueue *get_work_pwq(struct work_struct *work)
* @work: the work item of interest
*
* Return the worker_pool @work was last associated with. %NULL if none.
+ *
+ * Pools are created and destroyed under workqueue_lock, and allows read
+ * access under sched-RCU read lock. As such, this function should be
+ * called under workqueue_lock or with preemption disabled.
+ *
+ * All fields of the returned pool are accessible as long as the above
+ * mentioned locking is in effect. If the returned pool needs to be used
+ * beyond the critical section, the caller is responsible for ensuring the
+ * returned pool is and stays online.
*/
static struct worker_pool *get_work_pool(struct work_struct *work)
{
unsigned long data = atomic_long_read(&work->data);
- struct worker_pool *pool;
int pool_id;

+ assert_rcu_or_wq_lock();
+
if (data & WORK_STRUCT_PWQ)
return ((struct pool_workqueue *)
(data & WORK_STRUCT_WQ_DATA_MASK))->pool;
@@ -599,9 +613,7 @@ static struct worker_pool *get_work_pool(struct work_struct *work)
if (pool_id == WORK_OFFQ_POOL_NONE)
return NULL;

- pool = worker_pool_by_id(pool_id);
- WARN_ON_ONCE(!pool);
- return pool;
+ return idr_find(&worker_pool_idr, pool_id);
}

/**
@@ -2767,11 +2779,15 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr)
struct pool_workqueue *pwq;

might_sleep();
+
+ local_irq_disable();
pool = get_work_pool(work);
- if (!pool)
+ if (!pool) {
+ local_irq_enable();
return false;
+ }

- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);
/* see the comment in try_to_grab_pending() with the same code */
pwq = get_work_pwq(work);
if (pwq) {
@@ -3414,19 +3430,22 @@ EXPORT_SYMBOL_GPL(workqueue_congested);
*/
unsigned int work_busy(struct work_struct *work)
{
- struct worker_pool *pool = get_work_pool(work);
+ struct worker_pool *pool;
unsigned long flags;
unsigned int ret = 0;

if (work_pending(work))
ret |= WORK_BUSY_PENDING;

+ local_irq_save(flags);
+ pool = get_work_pool(work);
if (pool) {
- spin_lock_irqsave(&pool->lock, flags);
+ spin_lock(&pool->lock);
if (find_worker_executing_work(pool, work))
ret |= WORK_BUSY_RUNNING;
- spin_unlock_irqrestore(&pool->lock, flags);
+ spin_unlock(&pool->lock);
}
+ local_irq_restore(flags);

return ret;
}
--
1.8.1.2

2013-03-02 18:17:15

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Fri, Mar 01, 2013 at 07:24:21PM -0800, Tejun Heo wrote:
> Kay tells me the most appropriate place to expose workqueues to
> userland would be /sys/devices/virtual/workqueues/WQ_NAME which is
> symlinked to /sys/bus/workqueue/devices/WQ_NAME and that we're lacking
> a way to do that outside of driver core as virtual_device_parent()
> isn't exported and there's no inteface to conveniently create a
> virtual subsystem.

I'm almost afraid to ask what you want to export to userspace for a
workqueue that userspace would care about...

If you create a subsystem, the devices will show up under the virtual
"bus" if you don't give them a parent, so this patch shouldn't be
needed, unless you are abusing the driver model. What am I missing
here?

thanks,

greg k-h

2013-03-02 20:26:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

Hello, Greg.

On Sat, Mar 02, 2013 at 10:17:27AM -0800, Greg Kroah-Hartman wrote:
> I'm almost afraid to ask what you want to export to userspace for a
> workqueue that userspace would care about...

Workqueue is being extended to support worker pools with custom
attributes so that it can replace private worker pool implementations
in writeback, btrfs and other places. They want to expose
per-workqueue tunables to userland - nice level, cpu affinity and
cgroup association of those IO threads to userland, so that's where
the sysfs interface comes in. The export is opt-in and those
workqueues should have well defined name.

> If you create a subsystem, the devices will show up under the virtual
> "bus" if you don't give them a parent, so this patch shouldn't be
> needed, unless you are abusing the driver model. What am I missing
> here?

If you don't give the parent, it ends up in /sys/devices/ with
symlinks appearing under /sys/bus/. I didn't know where to put it.
Right under /sys/devices/ doesn't seem right to me. We already have
one like that /sys/devices/software/ which actually is for perf
software events and it just is weird. I was wondering where to put it
and Kay told me that /sys/devices/virtual/ would be the best fit
although we don't yet have proper interface for it, so the patch. I
don't really care where it shows up and it apparently shouldn't matter
for userland as long as /sys/bus/ part is there but /sys/virtual/
seems to be the best fit at the moment.

Thanks.

--
tejun

2013-03-03 06:42:54

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sat, Mar 2, 2013 at 7:17 PM, Greg Kroah-Hartman
<[email protected]> wrote:
> On Fri, Mar 01, 2013 at 07:24:21PM -0800, Tejun Heo wrote:
>> Kay tells me the most appropriate place to expose workqueues to
>> userland would be /sys/devices/virtual/workqueues/WQ_NAME which is
>> symlinked to /sys/bus/workqueue/devices/WQ_NAME and that we're lacking
>> a way to do that outside of driver core as virtual_device_parent()
>> isn't exported and there's no inteface to conveniently create a
>> virtual subsystem.
>
> I'm almost afraid to ask what you want to export to userspace for a
> workqueue that userspace would care about...
>
> If you create a subsystem, the devices will show up under the virtual
> "bus" if you don't give them a parent, so this patch shouldn't be
> needed, unless you are abusing the driver model. What am I missing
> here?

Unfortunately, the parent == NULL --> /sys/devices/virtual/<subsys>/
we have only implemented for classes, and not for buses. We should fix
that.

Kay

2013-03-04 18:30:10

by Tejun Heo

[permalink] [raw]
Subject: [PATCH UPDATED 28/31] workqueue: reject adjusting max_active or applying attrs to ordered workqueues

Adjusting max_active of or applying new workqueue_attrs to an ordered
workqueue breaks its ordering guarantee. The former is obvious. The
latter is because applying attrs creates a new pwq (pool_workqueue)
and there is no ordering constraint between the old and new pwqs.

Make apply_workqueue_attrs() and workqueue_set_max_active() trigger
WARN_ON() if those operations are requested on an ordered workqueue
and fail / ignore respectively.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 3 ++-
kernel/workqueue.c | 9 +++++++++
2 files changed, 11 insertions(+), 1 deletion(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -294,6 +294,7 @@ enum {
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */

__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
+ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */

WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
@@ -396,7 +397,7 @@ __alloc_workqueue_key(const char *fmt, u
* Pointer to the allocated workqueue on success, %NULL on failure.
*/
#define alloc_ordered_workqueue(fmt, flags, args...) \
- alloc_workqueue(fmt, WQ_UNBOUND | (flags), 1, ##args)
+ alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)

#define create_workqueue(name) \
alloc_workqueue((name), WQ_MEM_RECLAIM, 1)
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3487,9 +3487,14 @@ int apply_workqueue_attrs(struct workque
struct pool_workqueue *pwq, *last_pwq;
struct worker_pool *pool;

+ /* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
return -EINVAL;

+ /* creating multiple pwqs breaks ordering guarantee */
+ if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
+ return -EINVAL;
+
pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
if (!pwq)
return -ENOMEM;
@@ -3745,6 +3750,10 @@ void workqueue_set_max_active(struct wor
{
struct pool_workqueue *pwq;

+ /* disallow meddling with max_active for ordered workqueues */
+ if (WARN_ON(wq->flags & __WQ_ORDERED))
+ return;
+
max_active = wq_clamp_max_active(max_active, wq->flags, wq->name);

spin_lock_irq(&workqueue_lock);

2013-03-04 18:30:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 31/31] workqueue: implement sysfs interface for workqueues

There are cases where workqueue users want to expose control knobs to
userland. e.g. Unbound workqueues with custom attributes are
scheduled to be used for writeback workers and depending on
configuration it can be useful to allow admins to tinker with the
priority or allowed CPUs.

This patch implements workqueue_sysfs_register(), which makes the
workqueue visible under /sys/bus/workqueue/devices/WQ_NAME. There
currently are two attributes common to both per-cpu and unbound pools
and extra attributes for unbound pools including nice level and
cpumask.

If alloc_workqueue*() is called with WQ_SYSFS,
workqueue_sysfs_register() is called automatically as part of
workqueue creation. This is the preferred method unless the workqueue
user wants to apply workqueue_attrs before making the workqueue
visible to userland.

v2: Disallow exposing ordered workqueues as ordered workqueues can't
be tuned in any way.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 8 +
kernel/workqueue.c | 288 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 296 insertions(+)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -292,6 +292,7 @@ enum {
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
WQ_HIGHPRI = 1 << 4, /* high priority */
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */
+ WQ_SYSFS = 1 << 6, /* visible in sysfs, see wq_sysfs_register() */

__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
@@ -494,4 +495,11 @@ extern bool freeze_workqueues_busy(void)
extern void thaw_workqueues(void);
#endif /* CONFIG_FREEZER */

+#ifdef CONFIG_SYSFS
+int workqueue_sysfs_register(struct workqueue_struct *wq);
+#else /* CONFIG_SYSFS */
+static inline int workqueue_sysfs_register(struct workqueue_struct *wq)
+{ return 0; }
+#endif /* CONFIG_SYSFS */
+
#endif
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -210,6 +210,8 @@ struct wq_flusher {
struct completion done; /* flush completion */
};

+struct wq_device;
+
/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
@@ -233,6 +235,10 @@ struct workqueue_struct {

int nr_drainers; /* W: drain in progress */
int saved_max_active; /* W: saved pwq max_active */
+
+#ifdef CONFIG_SYSFS
+ struct wq_device *wq_dev; /* I: for sysfs interface */
+#endif
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -438,6 +444,8 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(str
static DEFINE_IDR(worker_pool_idr);

static int worker_thread(void *__worker);
+static void copy_workqueue_attrs(struct workqueue_attrs *to,
+ const struct workqueue_attrs *from);

/* allocate ID and assign it to @pool */
static int worker_pool_assign_id(struct worker_pool *pool)
@@ -3151,6 +3159,281 @@ int keventd_up(void)
return system_wq != NULL;
}

+#ifdef CONFIG_SYSFS
+/*
+ * Workqueues with WQ_SYSFS flag set is visible to userland via
+ * /sys/bus/workqueue/devices/WQ_NAME. All visible workqueues have the
+ * following attributes.
+ *
+ * per_cpu RO bool : whether the workqueue is per-cpu or unbound
+ * max_active RW int : maximum number of in-flight work items
+ *
+ * Unbound workqueues have the following extra attributes.
+ *
+ * id RO int : the associated pool ID
+ * nice RW int : nice value of the workers
+ * cpumask RW mask : bitmask of allowed CPUs for the workers
+ */
+struct wq_device {
+ struct workqueue_struct *wq;
+ struct device dev;
+};
+
+static struct workqueue_struct *dev_to_wq(struct device *dev)
+{
+ struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
+
+ return wq_dev->wq;
+}
+
+static ssize_t wq_per_cpu_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n", (bool)!(wq->flags & WQ_UNBOUND));
+}
+
+static ssize_t wq_max_active_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n", wq->saved_max_active);
+}
+
+static ssize_t wq_max_active_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int val;
+
+ if (sscanf(buf, "%d", &val) != 1 || val <= 0)
+ return -EINVAL;
+
+ workqueue_set_max_active(wq, val);
+ return count;
+}
+
+static struct device_attribute wq_sysfs_attrs[] = {
+ __ATTR(per_cpu, 0444, wq_per_cpu_show, NULL),
+ __ATTR(max_active, 0644, wq_max_active_show, wq_max_active_store),
+ __ATTR_NULL,
+};
+
+static ssize_t wq_pool_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct worker_pool *pool;
+ int written;
+
+ rcu_read_lock_sched();
+ pool = first_pwq(wq)->pool;
+ written = scnprintf(buf, PAGE_SIZE, "%d\n", pool->id);
+ rcu_read_unlock_sched();
+
+ return written;
+}
+
+static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int written;
+
+ rcu_read_lock_sched();
+ written = scnprintf(buf, PAGE_SIZE, "%d\n",
+ first_pwq(wq)->pool->attrs->nice);
+ rcu_read_unlock_sched();
+
+ return written;
+}
+
+/* prepare workqueue_attrs for sysfs store operations */
+static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
+{
+ struct workqueue_attrs *attrs;
+
+ attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!attrs)
+ return NULL;
+
+ rcu_read_lock_sched();
+ copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
+ rcu_read_unlock_sched();
+ return attrs;
+}
+
+static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct workqueue_attrs *attrs;
+ int ret;
+
+ attrs = wq_sysfs_prep_attrs(wq);
+ if (!attrs)
+ return -ENOMEM;
+
+ if (sscanf(buf, "%d", &attrs->nice) == 1 &&
+ attrs->nice >= -20 && attrs->nice <= 19)
+ ret = apply_workqueue_attrs(wq, attrs);
+ else
+ ret = -EINVAL;
+
+ free_workqueue_attrs(attrs);
+ return ret ?: count;
+}
+
+static ssize_t wq_cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ int written;
+
+ rcu_read_lock_sched();
+ written = cpumask_scnprintf(buf, PAGE_SIZE,
+ first_pwq(wq)->pool->attrs->cpumask);
+ rcu_read_unlock_sched();
+
+ written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
+ return written;
+}
+
+static ssize_t wq_cpumask_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct workqueue_struct *wq = dev_to_wq(dev);
+ struct workqueue_attrs *attrs;
+ int ret;
+
+ attrs = wq_sysfs_prep_attrs(wq);
+ if (!attrs)
+ return -ENOMEM;
+
+ ret = cpumask_parse(buf, attrs->cpumask);
+ if (!ret)
+ ret = apply_workqueue_attrs(wq, attrs);
+
+ free_workqueue_attrs(attrs);
+ return ret ?: count;
+}
+
+static struct device_attribute wq_sysfs_unbound_attrs[] = {
+ __ATTR(pool_id, 0444, wq_pool_id_show, NULL),
+ __ATTR(nice, 0644, wq_nice_show, wq_nice_store),
+ __ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+ __ATTR_NULL,
+};
+
+static struct bus_type wq_subsys = {
+ .name = "workqueue",
+ .dev_attrs = wq_sysfs_attrs,
+};
+
+static int __init wq_sysfs_init(void)
+{
+ return subsys_virtual_register(&wq_subsys, NULL);
+}
+core_initcall(wq_sysfs_init);
+
+static void wq_device_release(struct device *dev)
+{
+ struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
+
+ kfree(wq_dev);
+}
+
+/**
+ * workqueue_sysfs_register - make a workqueue visible in sysfs
+ * @wq: the workqueue to register
+ *
+ * Expose @wq in sysfs under /sys/bus/workqueue/devices.
+ * alloc_workqueue*() automatically calls this function if WQ_SYSFS is set
+ * which is the preferred method.
+ *
+ * Workqueue user should use this function directly iff it wants to apply
+ * workqueue_attrs before making the workqueue visible in sysfs; otherwise,
+ * apply_workqueue_attrs() may race against userland updating the
+ * attributes.
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int workqueue_sysfs_register(struct workqueue_struct *wq)
+{
+ struct wq_device *wq_dev;
+ int ret;
+
+ /*
+ * Adjusting max_active or creating new pwqs by applyting
+ * attributes breaks ordering guarantee. Disallow exposing ordered
+ * workqueues.
+ */
+ if (WARN_ON(wq->flags & __WQ_ORDERED))
+ return -EINVAL;
+
+ wq->wq_dev = wq_dev = kzalloc(sizeof(*wq_dev), GFP_KERNEL);
+ if (!wq_dev)
+ return -ENOMEM;
+
+ wq_dev->wq = wq;
+ wq_dev->dev.bus = &wq_subsys;
+ wq_dev->dev.init_name = wq->name;
+ wq_dev->dev.release = wq_device_release;
+
+ /*
+ * unbound_attrs are created separately. Suppress uevent until
+ * everything is ready.
+ */
+ dev_set_uevent_suppress(&wq_dev->dev, true);
+
+ ret = device_register(&wq_dev->dev);
+ if (ret) {
+ kfree(wq_dev);
+ wq->wq_dev = NULL;
+ return ret;
+ }
+
+ if (wq->flags & WQ_UNBOUND) {
+ struct device_attribute *attr;
+
+ for (attr = wq_sysfs_unbound_attrs; attr->attr.name; attr++) {
+ ret = device_create_file(&wq_dev->dev, attr);
+ if (ret) {
+ device_unregister(&wq_dev->dev);
+ wq->wq_dev = NULL;
+ return ret;
+ }
+ }
+ }
+
+ kobject_uevent(&wq_dev->dev.kobj, KOBJ_ADD);
+ return 0;
+}
+
+/**
+ * workqueue_sysfs_unregister - undo workqueue_sysfs_register()
+ * @wq: the workqueue to unregister
+ *
+ * If @wq is registered to sysfs by workqueue_sysfs_register(), unregister.
+ */
+static void workqueue_sysfs_unregister(struct workqueue_struct *wq)
+{
+ struct wq_device *wq_dev = wq->wq_dev;
+
+ if (!wq->wq_dev)
+ return;
+
+ wq->wq_dev = NULL;
+ device_unregister(&wq_dev->dev);
+}
+#else /* CONFIG_SYSFS */
+static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
+#endif /* CONFIG_SYSFS */
+
/**
* free_workqueue_attrs - free a workqueue_attrs
* @attrs: workqueue_attrs to free
@@ -3618,6 +3901,9 @@ struct workqueue_struct *__alloc_workque
wake_up_process(rescuer->task);
}

+ if ((wq->flags & WQ_SYSFS) && workqueue_sysfs_register(wq))
+ goto err_destroy;
+
/*
* workqueue_lock protects global freeze state and workqueues
* list. Grab it, set max_active accordingly and add the new
@@ -3686,6 +3972,8 @@ void destroy_workqueue(struct workqueue_

spin_unlock_irq(&workqueue_lock);

+ workqueue_sysfs_unregister(wq);
+
if (wq->rescuer) {
kthread_stop(wq->rescuer->task);
kfree(wq->rescuer);

2013-03-04 18:37:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

Introduce struct workqueue_attrs which carries worker attributes -
currently the nice level and allowed cpumask along with helper
routines alloc_workqueue_attrs() and free_workqueue_attrs().

Each worker_pool now carries ->attrs describing the attributes of its
workers. All functions dealing with cpumask and nice level of workers
are updated to follow worker_pool->attrs instead of determining them
from other characteristics of the worker_pool, and init_workqueues()
is updated to set worker_pool->attrs appropriately for all standard
pools.

Note that create_worker() is updated to always perform set_user_nice()
and use set_cpus_allowed_ptr() combined with manual assertion of
PF_THREAD_BOUND instead of kthread_bind(). This simplifies handling
random attributes without affecting the outcome.

This patch doesn't introduce any behavior changes.

v2: Missing cpumask_var_t definition caused build failure on some
archs. linux/cpumask.h included.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: kbuild test robot <[email protected]>
---
include/linux/workqueue.h | 13 +++++
kernel/workqueue.c | 103 ++++++++++++++++++++++++++++++++++++----------
2 files changed, 94 insertions(+), 22 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -11,6 +11,7 @@
#include <linux/lockdep.h>
#include <linux/threads.h>
#include <linux/atomic.h>
+#include <linux/cpumask.h>

struct workqueue_struct;

@@ -115,6 +116,15 @@ struct delayed_work {
int cpu;
};

+/*
+ * A struct for workqueue attributes. This can be used to change
+ * attributes of an unbound workqueue.
+ */
+struct workqueue_attrs {
+ int nice; /* nice level */
+ cpumask_var_t cpumask; /* allowed CPUs */
+};
+
static inline struct delayed_work *to_delayed_work(struct work_struct *work)
{
return container_of(work, struct delayed_work, work);
@@ -399,6 +409,9 @@ __alloc_workqueue_key(const char *fmt, u

extern void destroy_workqueue(struct workqueue_struct *wq);

+struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask);
+void free_workqueue_attrs(struct workqueue_attrs *attrs);
+
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
extern bool queue_work(struct workqueue_struct *wq, struct work_struct *work);
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -148,6 +148,8 @@ struct worker_pool {
struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
struct ida worker_ida; /* L: for worker IDs */

+ struct workqueue_attrs *attrs; /* I: worker attributes */
+
/*
* The current concurrency level. As it's likely to be accessed
* from other CPUs during try_to_wake_up(), put it in a separate
@@ -1563,14 +1565,13 @@ __acquires(&pool->lock)
* against POOL_DISASSOCIATED.
*/
if (!(pool->flags & POOL_DISASSOCIATED))
- set_cpus_allowed_ptr(current, get_cpu_mask(pool->cpu));
+ set_cpus_allowed_ptr(current, pool->attrs->cpumask);

spin_lock_irq(&pool->lock);
if (pool->flags & POOL_DISASSOCIATED)
return false;
if (task_cpu(current) == pool->cpu &&
- cpumask_equal(&current->cpus_allowed,
- get_cpu_mask(pool->cpu)))
+ cpumask_equal(&current->cpus_allowed, pool->attrs->cpumask))
return true;
spin_unlock_irq(&pool->lock);

@@ -1677,7 +1678,7 @@ static void rebind_workers(struct worker
* wq doesn't really matter but let's keep @worker->pool
* and @pwq->pool consistent for sanity.
*/
- if (std_worker_pool_pri(worker->pool))
+ if (worker->pool->attrs->nice < 0)
wq = system_highpri_wq;
else
wq = system_wq;
@@ -1719,7 +1720,7 @@ static struct worker *alloc_worker(void)
*/
static struct worker *create_worker(struct worker_pool *pool)
{
- const char *pri = std_worker_pool_pri(pool) ? "H" : "";
+ const char *pri = pool->attrs->nice < 0 ? "H" : "";
struct worker *worker = NULL;
int id = -1;

@@ -1749,24 +1750,23 @@ static struct worker *create_worker(stru
if (IS_ERR(worker->task))
goto fail;

- if (std_worker_pool_pri(pool))
- set_user_nice(worker->task, HIGHPRI_NICE_LEVEL);
+ set_user_nice(worker->task, pool->attrs->nice);
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);

/*
- * Determine CPU binding of the new worker depending on
- * %POOL_DISASSOCIATED. The caller is responsible for ensuring the
- * flag remains stable across this function. See the comments
- * above the flag definition for details.
- *
- * As an unbound worker may later become a regular one if CPU comes
- * online, make sure every worker has %PF_THREAD_BOUND set.
+ * %PF_THREAD_BOUND is used to prevent userland from meddling with
+ * cpumask of workqueue workers. This is an abuse. We need
+ * %PF_KERNEL_CPUMASK.
*/
- if (!(pool->flags & POOL_DISASSOCIATED)) {
- kthread_bind(worker->task, pool->cpu);
- } else {
- worker->task->flags |= PF_THREAD_BOUND;
+ worker->task->flags |= PF_THREAD_BOUND;
+
+ /*
+ * The caller is responsible for ensuring %POOL_DISASSOCIATED
+ * remains stable across this function. See the comments above the
+ * flag definition for details.
+ */
+ if (pool->flags & POOL_DISASSOCIATED)
worker->flags |= WORKER_UNBOUND;
- }

return worker;
fail:
@@ -3121,7 +3121,52 @@ int keventd_up(void)
return system_wq != NULL;
}

-static void init_worker_pool(struct worker_pool *pool)
+/**
+ * free_workqueue_attrs - free a workqueue_attrs
+ * @attrs: workqueue_attrs to free
+ *
+ * Undo alloc_workqueue_attrs().
+ */
+void free_workqueue_attrs(struct workqueue_attrs *attrs)
+{
+ if (attrs) {
+ free_cpumask_var(attrs->cpumask);
+ kfree(attrs);
+ }
+}
+
+/**
+ * alloc_workqueue_attrs - allocate a workqueue_attrs
+ * @gfp_mask: allocation mask to use
+ *
+ * Allocate a new workqueue_attrs, initialize with default settings and
+ * return it. Returns NULL on failure.
+ */
+struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
+{
+ struct workqueue_attrs *attrs;
+
+ attrs = kzalloc(sizeof(*attrs), gfp_mask);
+ if (!attrs)
+ goto fail;
+ if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
+ goto fail;
+
+ cpumask_setall(attrs->cpumask);
+ return attrs;
+fail:
+ free_workqueue_attrs(attrs);
+ return NULL;
+}
+
+/**
+ * init_worker_pool - initialize a newly zalloc'd worker_pool
+ * @pool: worker_pool to initialize
+ *
+ * Initiailize a newly zalloc'd @pool. It also allocates @pool->attrs.
+ * Returns 0 on success, -errno on failure.
+ */
+static int init_worker_pool(struct worker_pool *pool)
{
spin_lock_init(&pool->lock);
pool->flags |= POOL_DISASSOCIATED;
@@ -3139,6 +3184,11 @@ static void init_worker_pool(struct work
mutex_init(&pool->manager_mutex);
mutex_init(&pool->assoc_mutex);
ida_init(&pool->worker_ida);
+
+ pool->attrs = alloc_workqueue_attrs(GFP_KERNEL);
+ if (!pool->attrs)
+ return -ENOMEM;
+ return 0;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3791,7 +3841,8 @@ out_unlock:

static int __init init_workqueues(void)
{
- int cpu;
+ int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
+ int i, cpu;

/* make sure we have enough bits for OFFQ pool ID */
BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
@@ -3808,10 +3859,18 @@ static int __init init_workqueues(void)
for_each_wq_cpu(cpu) {
struct worker_pool *pool;

+ i = 0;
for_each_std_worker_pool(pool, cpu) {
- init_worker_pool(pool);
+ BUG_ON(init_worker_pool(pool));
pool->cpu = cpu;

+ if (cpu != WORK_CPU_UNBOUND)
+ cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
+ else
+ cpumask_setall(pool->attrs->cpumask);
+
+ pool->attrs->nice = std_nice[i++];
+
/* alloc pool ID */
BUG_ON(worker_pool_assign_id(pool));
}

2013-03-05 20:41:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On Fri, Mar 01, 2013 at 07:23:51PM -0800, Tejun Heo wrote:
> which is scheduled to be rebased on top of v3.9-rc1 once it comes out.
> The changes are also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-attrs

Branch rebased on top of wq/for-3.10 with the updated patches.

Lai, I'd really appreciate if you can go over the patches.

Thanks.

--
tejun

2013-03-05 20:43:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 03, 2013 at 07:42:31AM +0100, Kay Sievers wrote:
> On Sat, Mar 2, 2013 at 7:17 PM, Greg Kroah-Hartman
> <[email protected]> wrote:
> > On Fri, Mar 01, 2013 at 07:24:21PM -0800, Tejun Heo wrote:
> >> Kay tells me the most appropriate place to expose workqueues to
> >> userland would be /sys/devices/virtual/workqueues/WQ_NAME which is
> >> symlinked to /sys/bus/workqueue/devices/WQ_NAME and that we're lacking
> >> a way to do that outside of driver core as virtual_device_parent()
> >> isn't exported and there's no inteface to conveniently create a
> >> virtual subsystem.
> >
> > I'm almost afraid to ask what you want to export to userspace for a
> > workqueue that userspace would care about...
> >
> > If you create a subsystem, the devices will show up under the virtual
> > "bus" if you don't give them a parent, so this patch shouldn't be
> > needed, unless you are abusing the driver model. What am I missing
> > here?
>
> Unfortunately, the parent == NULL --> /sys/devices/virtual/<subsys>/
> we have only implemented for classes, and not for buses. We should fix
> that.

Greg, how should I proceed on this? As I wrote before, I don't really
care about where or how. As long as I can make workqueues visible to
userland, I'm happy.

Thanks.

--
tejun

2013-03-05 22:29:50

by Ryan Mallon

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

On 05/03/13 05:37, Tejun Heo wrote:
> Introduce struct workqueue_attrs which carries worker attributes -
> currently the nice level and allowed cpumask along with helper
> routines alloc_workqueue_attrs() and free_workqueue_attrs().
>
> Each worker_pool now carries ->attrs describing the attributes of its
> workers. All functions dealing with cpumask and nice level of workers
> are updated to follow worker_pool->attrs instead of determining them
> from other characteristics of the worker_pool, and init_workqueues()
> is updated to set worker_pool->attrs appropriately for all standard
> pools.
>
> Note that create_worker() is updated to always perform set_user_nice()
> and use set_cpus_allowed_ptr() combined with manual assertion of
> PF_THREAD_BOUND instead of kthread_bind(). This simplifies handling
> random attributes without affecting the outcome.
>
> This patch doesn't introduce any behavior changes.
>
> v2: Missing cpumask_var_t definition caused build failure on some
> archs. linux/cpumask.h included.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: kbuild test robot <[email protected]>

> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -148,6 +148,8 @@ struct worker_pool {
> struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
> struct ida worker_ida; /* L: for worker IDs */
>
> + struct workqueue_attrs *attrs; /* I: worker attributes */

If attrs always exists, why not just embed the struct and avoid the
need to alloc/free it?

~Ryan

2013-03-05 22:33:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

Hello, Ryan.

On Wed, Mar 06, 2013 at 09:29:35AM +1100, Ryan Mallon wrote:
> > @@ -148,6 +148,8 @@ struct worker_pool {
> > struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
> > struct ida worker_ida; /* L: for worker IDs */
> >
> > + struct workqueue_attrs *attrs; /* I: worker attributes */
>
> If attrs always exists, why not just embed the struct and avoid the
> need to alloc/free it?

Because then it'll need a separate init paths for embedded ones. If
the field was in any way hot, I'd have embedded it but it isn't and
it's just less code to share the alloc path.

Thanks.

--
tejun

2013-03-05 22:34:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

On Tue, Mar 05, 2013 at 02:33:27PM -0800, Tejun Heo wrote:
> Hello, Ryan.
>
> On Wed, Mar 06, 2013 at 09:29:35AM +1100, Ryan Mallon wrote:
> > > @@ -148,6 +148,8 @@ struct worker_pool {
> > > struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
> > > struct ida worker_ida; /* L: for worker IDs */
> > >
> > > + struct workqueue_attrs *attrs; /* I: worker attributes */
> >
> > If attrs always exists, why not just embed the struct and avoid the
> > need to alloc/free it?
>
> Because then it'll need a separate init paths for embedded ones. If
> the field was in any way hot, I'd have embedded it but it isn't and
> it's just less code to share the alloc path.

Ooh, right, and that cpumask_t is going away and you can't statically
allocate cpumask_var_t, so it needs an allocation and error check from
it anyway.

--
tejun

2013-03-05 22:40:57

by Ryan Mallon

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

On 06/03/13 09:34, Tejun Heo wrote:
> On Tue, Mar 05, 2013 at 02:33:27PM -0800, Tejun Heo wrote:
>> Hello, Ryan.
>>
>> On Wed, Mar 06, 2013 at 09:29:35AM +1100, Ryan Mallon wrote:
>>>> @@ -148,6 +148,8 @@ struct worker_pool {
>>>> struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
>>>> struct ida worker_ida; /* L: for worker IDs */
>>>>
>>>> + struct workqueue_attrs *attrs; /* I: worker attributes */
>>>
>>> If attrs always exists, why not just embed the struct and avoid the
>>> need to alloc/free it?
>>
>> Because then it'll need a separate init paths for embedded ones. If
>> the field was in any way hot, I'd have embedded it but it isn't and
>> it's just less code to share the alloc path.
>
> Ooh, right, and that cpumask_t is going away and you can't statically
> allocate cpumask_var_t, so it needs an allocation and error check from
> it anyway.

Not sure I follow. I mean drop the pointer, eg:

struct workqueue_attr attrs;

Since, at least in this patch, struct worker_pool appears to always
alloc the attrs field. You do still of course need the cpumask_t
initialisation. Am I missing something?

~Ryan

2013-03-05 22:44:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

Hello,

On Wed, Mar 06, 2013 at 09:40:48AM +1100, Ryan Mallon wrote:
> > Ooh, right, and that cpumask_t is going away and you can't statically
> > allocate cpumask_var_t, so it needs an allocation and error check from
> > it anyway.
>
> Not sure I follow. I mean drop the pointer, eg:
>
> struct workqueue_attr attrs;
>
> Since, at least in this patch, struct worker_pool appears to always
> alloc the attrs field. You do still of course need the cpumask_t
> initialisation. Am I missing something?

So, new usages of cpumask_t is frowned upon and we gotta use
cpumask_var_t which needs alloc_cpumask_var() which may fail, so we
have try-to-alloc-and-check-for-failure no matter what. Now, if we
want to embed workqueue_attrs, we have to separate out initialization
of allocated attrs from the actaul allocation. ie. we'll need
init_workqueue_attrs() and alloc_workqueue_attrs() and as the former
may fail too, it doesn't really simplify pool initilaization path.
So, we end up with more code. The added code is minor but it also
doesn't buy anything.

Thanks.

--
tejun

2013-03-05 23:20:38

by Ryan Mallon

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

On 06/03/13 09:44, Tejun Heo wrote:
> Hello,
>
> On Wed, Mar 06, 2013 at 09:40:48AM +1100, Ryan Mallon wrote:
>>> Ooh, right, and that cpumask_t is going away and you can't statically
>>> allocate cpumask_var_t, so it needs an allocation and error check from
>>> it anyway.
>>
>> Not sure I follow. I mean drop the pointer, eg:
>>
>> struct workqueue_attr attrs;
>>
>> Since, at least in this patch, struct worker_pool appears to always
>> alloc the attrs field. You do still of course need the cpumask_t
>> initialisation. Am I missing something?
>
> So, new usages of cpumask_t is frowned upon and we gotta use
> cpumask_var_t which needs alloc_cpumask_var() which may fail, so we
> have try-to-alloc-and-check-for-failure no matter what. Now, if we
> want to embed workqueue_attrs, we have to separate out initialization
> of allocated attrs from the actaul allocation. ie. we'll need
> init_workqueue_attrs() and alloc_workqueue_attrs() and as the former
> may fail too, it doesn't really simplify pool initilaization path.
> So, we end up with more code. The added code is minor but it also
> doesn't buy anything.

I don't get why you would need to separate init/alloc. Nothing in the
patch series appears to have optional attrs (e.g. a case where attrs
might be NULL), so allocing isn't necessary, which is my point. The init
function can fail due to the cpumask_t, as you point out, but at least
you can remove one alloc/free per attrs struct:

static int workqueue_init_attrs(struct workqueue_attrs *attrs,
gfp_t gfp_mask)
{
memset(attrs, 0, sizeof(*attrs));
if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
return -ENOMEM;
cpumask_setall(attrs->cpumask);
return 0;
}

static void workqueue_deinit_attrs(struct workqueue_attrs *attrs)
{
free_cpumask_var(attrs->cpumask);
}

In patch 17 unbound_std_wq_attrs can easily be changed to a non-pointer
type, and in patch 31 you remove the need to alloc/free the attrs
structure in wq_nice_store, so you would have something like:

struct workqueue_attrs attrs;
int err;

err = workqueue_init_attrs(&attrs, GFP_KERNEL);
if (err)
return err;

rcu_read_lock_sched();
copy_workqueue_attrs(&attrs, first_pwq(wq)->pool->attrs);
rcu_read_unlock_sched();

apply_workqueue_attrs(wq, &attrs);

/* Needed to free the temp cpumask */
workqueue_deinit_attrs(&attrs);

If there are cases where the attrs need to be a pointer (e.g. it can
optionally be NULL, which needs to be tested against), then you could
just leave the responsibility of allocation to the caller.

~Ryan

2013-03-05 23:28:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 16/31] workqueue: introduce workqueue_attrs

On Wed, Mar 06, 2013 at 10:20:20AM +1100, Ryan Mallon wrote:
> I don't get why you would need to separate init/alloc. Nothing in the
> patch series appears to have optional attrs (e.g. a case where attrs

Because workqueue users would want to use workqueue_attrs to specify
attributes and thanks to cpumask_var_t we need alloc interface anyway
and by just allocating the whole thing dynamically we can allow
workqueue_attrs to grow beyond size which is appropriate to allocate
on stack. Plus, its users are less likely to make mistakes with plain
alloc/free interface than init interface which takes pointer to an
existing object but may fail, which is rather unusual.

Whether the struct itself is dynamic or not just doesn't matter and
it's just easier to have plain alloc/free if there has to be an init
step which may fail.

--
tejun

2013-03-07 23:31:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Tue, Mar 05, 2013 at 12:43:27PM -0800, Tejun Heo wrote:
> On Sun, Mar 03, 2013 at 07:42:31AM +0100, Kay Sievers wrote:
> > On Sat, Mar 2, 2013 at 7:17 PM, Greg Kroah-Hartman
> > <[email protected]> wrote:
> > > On Fri, Mar 01, 2013 at 07:24:21PM -0800, Tejun Heo wrote:
> > >> Kay tells me the most appropriate place to expose workqueues to
> > >> userland would be /sys/devices/virtual/workqueues/WQ_NAME which is
> > >> symlinked to /sys/bus/workqueue/devices/WQ_NAME and that we're lacking
> > >> a way to do that outside of driver core as virtual_device_parent()
> > >> isn't exported and there's no inteface to conveniently create a
> > >> virtual subsystem.
> > >
> > > I'm almost afraid to ask what you want to export to userspace for a
> > > workqueue that userspace would care about...
> > >
> > > If you create a subsystem, the devices will show up under the virtual
> > > "bus" if you don't give them a parent, so this patch shouldn't be
> > > needed, unless you are abusing the driver model. What am I missing
> > > here?
> >
> > Unfortunately, the parent == NULL --> /sys/devices/virtual/<subsys>/
> > we have only implemented for classes, and not for buses. We should fix
> > that.
>
> Greg, how should I proceed on this? As I wrote before, I don't really
> care about where or how. As long as I can make workqueues visible to
> userland, I'm happy.

Sorry for the delay, I'm at a conference all this week, and haven't had
much time to think about this.

If Kay says this is ok for now, that's good enough for me.

thanks,

greg k-h

2013-03-08 00:04:48

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Fri, Mar 8, 2013 at 12:31 AM, Greg Kroah-Hartman
<[email protected]> wrote:
> On Tue, Mar 05, 2013 at 12:43:27PM -0800, Tejun Heo wrote:
>> On Sun, Mar 03, 2013 at 07:42:31AM +0100, Kay Sievers wrote:
>> > On Sat, Mar 2, 2013 at 7:17 PM, Greg Kroah-Hartman
>> > <[email protected]> wrote:
>> > > On Fri, Mar 01, 2013 at 07:24:21PM -0800, Tejun Heo wrote:
>> > >> Kay tells me the most appropriate place to expose workqueues to
>> > >> userland would be /sys/devices/virtual/workqueues/WQ_NAME which is
>> > >> symlinked to /sys/bus/workqueue/devices/WQ_NAME and that we're lacking
>> > >> a way to do that outside of driver core as virtual_device_parent()
>> > >> isn't exported and there's no inteface to conveniently create a
>> > >> virtual subsystem.
>> > >
>> > > I'm almost afraid to ask what you want to export to userspace for a
>> > > workqueue that userspace would care about...
>> > >
>> > > If you create a subsystem, the devices will show up under the virtual
>> > > "bus" if you don't give them a parent, so this patch shouldn't be
>> > > needed, unless you are abusing the driver model. What am I missing
>> > > here?
>> >
>> > Unfortunately, the parent == NULL --> /sys/devices/virtual/<subsys>/
>> > we have only implemented for classes, and not for buses. We should fix
>> > that.
>>
>> Greg, how should I proceed on this? As I wrote before, I don't really
>> care about where or how. As long as I can make workqueues visible to
>> userland, I'm happy.
>
> Sorry for the delay, I'm at a conference all this week, and haven't had
> much time to think about this.
>
> If Kay says this is ok for now, that's good enough for me.

Yes, it looks fine to me. If we provide the unified handling of
classes and buses some day, this can probably go away, but until that
it looks fine and is straight forward to do it that way,

Thanks,
Kay

2013-03-10 10:06:48

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 17/31] workqueue: implement attribute-based unbound worker_pool management

On 02/03/13 11:24, Tejun Heo wrote:
> This patch makes unbound worker_pools reference counted and
> dynamically created and destroyed as workqueues needing them come and
> go. All unbound worker_pools are hashed on unbound_pool_hash which is
> keyed by the content of worker_pool->attrs.
>
> When an unbound workqueue is allocated, get_unbound_pool() is called
> with the attributes of the workqueue. If there already is a matching
> worker_pool, the reference count is bumped and the pool is returned.
> If not, a new worker_pool with matching attributes is created and
> returned.
>
> When an unbound workqueue is destroyed, put_unbound_pool() is called
> which decrements the reference count of the associated worker_pool.
> If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU
> safe way.
>
> Note that the standard unbound worker_pools - normal and highpri ones
> with no specific cpumask affinity - are no longer created explicitly
> during init_workqueues(). init_workqueues() only initializes
> workqueue_attrs to be used for standard unbound pools -
> unbound_std_wq_attrs[]. The pools are spawned on demand as workqueues
> are created.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 230 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 218 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 7eba824..fb91b67 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -41,6 +41,7 @@
> #include <linux/debug_locks.h>
> #include <linux/lockdep.h>
> #include <linux/idr.h>
> +#include <linux/jhash.h>
> #include <linux/hashtable.h>
> #include <linux/rculist.h>
>
> @@ -80,6 +81,7 @@ enum {
>
> NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
>
> + UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
> BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
>
> MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
> @@ -149,6 +151,8 @@ struct worker_pool {
> struct ida worker_ida; /* L: for worker IDs */
>
> struct workqueue_attrs *attrs; /* I: worker attributes */
> + struct hlist_node hash_node; /* R: unbound_pool_hash node */
> + atomic_t refcnt; /* refcnt for unbound pools */
>
> /*
> * The current concurrency level. As it's likely to be accessed
> @@ -156,6 +160,12 @@ struct worker_pool {
> * cacheline.
> */
> atomic_t nr_running ____cacheline_aligned_in_smp;
> +
> + /*
> + * Destruction of pool is sched-RCU protected to allow dereferences
> + * from get_work_pool().
> + */
> + struct rcu_head rcu;
> } ____cacheline_aligned_in_smp;
>
> /*
> @@ -218,6 +228,11 @@ struct workqueue_struct {
>
> static struct kmem_cache *pwq_cache;
>
> +/* hash of all unbound pools keyed by pool->attrs */
> +static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
> +
> +static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
> +
> struct workqueue_struct *system_wq __read_mostly;
> EXPORT_SYMBOL_GPL(system_wq);
> struct workqueue_struct *system_highpri_wq __read_mostly;
> @@ -1740,7 +1755,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> worker->pool = pool;
> worker->id = id;
>
> - if (pool->cpu != WORK_CPU_UNBOUND)
> + if (pool->cpu >= 0)
> worker->task = kthread_create_on_node(worker_thread,
> worker, cpu_to_node(pool->cpu),
> "kworker/%d:%d%s", pool->cpu, id, pri);
> @@ -3159,6 +3174,54 @@ fail:
> return NULL;
> }
>
> +static void copy_workqueue_attrs(struct workqueue_attrs *to,
> + const struct workqueue_attrs *from)
> +{
> + to->nice = from->nice;
> + cpumask_copy(to->cpumask, from->cpumask);
> +}
> +
> +/*
> + * Hacky implementation of jhash of bitmaps which only considers the
> + * specified number of bits. We probably want a proper implementation in
> + * include/linux/jhash.h.
> + */
> +static u32 jhash_bitmap(const unsigned long *bitmap, int bits, u32 hash)
> +{
> + int nr_longs = bits / BITS_PER_LONG;
> + int nr_leftover = bits % BITS_PER_LONG;
> + unsigned long leftover = 0;
> +
> + if (nr_longs)
> + hash = jhash(bitmap, nr_longs * sizeof(long), hash);
> + if (nr_leftover) {
> + bitmap_copy(&leftover, bitmap + nr_longs, nr_leftover);
> + hash = jhash(&leftover, sizeof(long), hash);
> + }
> + return hash;
> +}
> +
> +/* hash value of the content of @attr */
> +static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
> +{
> + u32 hash = 0;
> +
> + hash = jhash_1word(attrs->nice, hash);
> + hash = jhash_bitmap(cpumask_bits(attrs->cpumask), nr_cpu_ids, hash);
> + return hash;
> +}
> +
> +/* content equality test */
> +static bool wqattrs_equal(const struct workqueue_attrs *a,
> + const struct workqueue_attrs *b)
> +{
> + if (a->nice != b->nice)
> + return false;
> + if (!cpumask_equal(a->cpumask, b->cpumask))
> + return false;
> + return true;
> +}
> +
> /**
> * init_worker_pool - initialize a newly zalloc'd worker_pool
> * @pool: worker_pool to initialize
> @@ -3169,6 +3232,8 @@ fail:
> static int init_worker_pool(struct worker_pool *pool)
> {
> spin_lock_init(&pool->lock);
> + pool->id = -1;
> + pool->cpu = -1;
> pool->flags |= POOL_DISASSOCIATED;
> INIT_LIST_HEAD(&pool->worklist);
> INIT_LIST_HEAD(&pool->idle_list);
> @@ -3185,12 +3250,133 @@ static int init_worker_pool(struct worker_pool *pool)
> mutex_init(&pool->assoc_mutex);
> ida_init(&pool->worker_ida);
>
> + INIT_HLIST_NODE(&pool->hash_node);
> + atomic_set(&pool->refcnt, 1);

We should document: the code before "atomic_set(&pool->refcnt, 1);" should not failed.
(In case we add failable code before it when we forget this requirement in future".
reason: when get_unbound_pool() fails, we expected ->refcnt = 1)

> pool->attrs = alloc_workqueue_attrs(GFP_KERNEL);
> if (!pool->attrs)
> return -ENOMEM;
> return 0;
> }
>
> +static void rcu_free_pool(struct rcu_head *rcu)
> +{
> + struct worker_pool *pool = container_of(rcu, struct worker_pool, rcu);
> +
> + ida_destroy(&pool->worker_ida);
> + free_workqueue_attrs(pool->attrs);
> + kfree(pool);
> +}
> +
> +/**
> + * put_unbound_pool - put a worker_pool
> + * @pool: worker_pool to put
> + *
> + * Put @pool. If its refcnt reaches zero, it gets destroyed in sched-RCU
> + * safe manner.
> + */
> +static void put_unbound_pool(struct worker_pool *pool)
> +{
> + struct worker *worker;
> +
> + if (!atomic_dec_and_test(&pool->refcnt))
> + return;

if get_unbound_pool() happens here, it will get a destroyed pool.
so we need to move "spin_lock_irq(&workqueue_lock);" before above statement.
(and ->refcnt don't need atomic after moved)


> +
> + /* sanity checks */
> + if (WARN_ON(!(pool->flags & POOL_DISASSOCIATED)))
> + return;


> + if (WARN_ON(pool->nr_workers != pool->nr_idle))
> + return;

This can be false-negative. we should remove this WARN_ON().

> + if (WARN_ON(!list_empty(&pool->worklist)))
> + return;
> +
> + /* release id and unhash */
> + spin_lock_irq(&workqueue_lock);
> + if (pool->id >= 0)
> + idr_remove(&worker_pool_idr, pool->id);
> + hash_del(&pool->hash_node);
> + spin_unlock_irq(&workqueue_lock);
> +
> + /* lock out manager and destroy all workers */
> + mutex_lock(&pool->manager_mutex);
> + spin_lock_irq(&pool->lock);
> +
> + while ((worker = first_worker(pool)))
> + destroy_worker(worker);
> + WARN_ON(pool->nr_workers || pool->nr_idle);
> +
> + spin_unlock_irq(&pool->lock);
> + mutex_unlock(&pool->manager_mutex);
> +
> + /* shut down the timers */
> + del_timer_sync(&pool->idle_timer);
> + del_timer_sync(&pool->mayday_timer);
> +
> + /* sched-RCU protected to allow dereferences from get_work_pool() */
> + call_rcu_sched(&pool->rcu, rcu_free_pool);
> +}
> +
> +/**
> + * get_unbound_pool - get a worker_pool with the specified attributes
> + * @attrs: the attributes of the worker_pool to get
> + *
> + * Obtain a worker_pool which has the same attributes as @attrs, bump the
> + * reference count and return it. If there already is a matching
> + * worker_pool, it will be used; otherwise, this function attempts to
> + * create a new one. On failure, returns NULL.
> + */
> +static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
> +{
> + static DEFINE_MUTEX(create_mutex);
> + u32 hash = wqattrs_hash(attrs);
> + struct worker_pool *pool;
> + struct hlist_node *tmp;
> + struct worker *worker;
> +
> + mutex_lock(&create_mutex);
> +
> + /* do we already have a matching pool? */
> + spin_lock_irq(&workqueue_lock);
> + hash_for_each_possible(unbound_pool_hash, pool, tmp, hash_node, hash) {
> + if (wqattrs_equal(pool->attrs, attrs)) {
> + atomic_inc(&pool->refcnt);
> + goto out_unlock;
> + }
> + }
> + spin_unlock_irq(&workqueue_lock);
> +
> + /* nope, create a new one */
> + pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> + if (!pool || init_worker_pool(pool) < 0)
> + goto fail;
> +
> + copy_workqueue_attrs(pool->attrs, attrs);
> +
> + if (worker_pool_assign_id(pool) < 0)
> + goto fail;
> +
> + /* create and start the initial worker */
> + worker = create_worker(pool);
> + if (!worker)
> + goto fail;
> +
> + spin_lock_irq(&pool->lock);
> + start_worker(worker);
> + spin_unlock_irq(&pool->lock);
> +
> + /* install */
> + spin_lock_irq(&workqueue_lock);
> + hash_add(unbound_pool_hash, &pool->hash_node, hash);
> +out_unlock:
> + spin_unlock_irq(&workqueue_lock);
> + mutex_unlock(&create_mutex);
> + return pool;
> +fail:
> + mutex_unlock(&create_mutex);
> + if (pool)
> + put_unbound_pool(pool);
> + return NULL;
> +}
> +
> static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> {
> bool highpri = wq->flags & WQ_HIGHPRI;
> @@ -3215,7 +3401,12 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> if (!pwq)
> return -ENOMEM;
>
> - pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
> + pwq->pool = get_unbound_pool(unbound_std_wq_attrs[highpri]);
> + if (!pwq->pool) {
> + kmem_cache_free(pwq_cache, pwq);
> + return -ENOMEM;
> + }
> +
> list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
> }
>
> @@ -3393,6 +3584,15 @@ void destroy_workqueue(struct workqueue_struct *wq)
> kfree(wq->rescuer);
> }
>
> + /*
> + * We're the sole accessor of @wq at this point. Directly access
> + * the first pwq and put its pool.
> + */
> + if (wq->flags & WQ_UNBOUND) {
> + pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
> + pwqs_node);
> + put_unbound_pool(pwq->pool);
> + }
> free_pwqs(wq);
> kfree(wq);
> }
> @@ -3856,19 +4056,14 @@ static int __init init_workqueues(void)
> hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> /* initialize CPU pools */
> - for_each_wq_cpu(cpu) {
> + for_each_possible_cpu(cpu) {
> struct worker_pool *pool;
>
> i = 0;
> for_each_std_worker_pool(pool, cpu) {
> BUG_ON(init_worker_pool(pool));
> pool->cpu = cpu;
> -
> - if (cpu != WORK_CPU_UNBOUND)
> - cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> - else
> - cpumask_setall(pool->attrs->cpumask);
> -
> + cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> pool->attrs->nice = std_nice[i++];
>
> /* alloc pool ID */
> @@ -3877,14 +4072,13 @@ static int __init init_workqueues(void)
> }
>
> /* create the initial worker */
> - for_each_online_wq_cpu(cpu) {
> + for_each_online_cpu(cpu) {
> struct worker_pool *pool;
>
> for_each_std_worker_pool(pool, cpu) {
> struct worker *worker;
>
> - if (cpu != WORK_CPU_UNBOUND)
> - pool->flags &= ~POOL_DISASSOCIATED;
> + pool->flags &= ~POOL_DISASSOCIATED;
>
> worker = create_worker(pool);
> BUG_ON(!worker);
> @@ -3894,6 +4088,18 @@ static int __init init_workqueues(void)
> }
> }
>
> + /* create default unbound wq attrs */
> + for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
> + struct workqueue_attrs *attrs;
> +
> + BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
> +
> + attrs->nice = std_nice[i];
> + cpumask_setall(attrs->cpumask);
> +
> + unbound_std_wq_attrs[i] = attrs;
> + }
> +
> system_wq = alloc_workqueue("events", 0, 0);
> system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
> system_long_wq = alloc_workqueue("events_long", 0, 0);

2013-03-10 10:07:09

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 07/31] workqueue: restructure pool / pool_workqueue iterations in freeze/thaw functions

On 02/03/13 11:23, Tejun Heo wrote:
> The three freeze/thaw related functions - freeze_workqueues_begin(),
> freeze_workqueues_busy() and thaw_workqueues() - need to iterate
> through all pool_workqueues of all freezable workqueues. They did it
> by first iterating pools and then visiting all pwqs (pool_workqueues)
> of all workqueues and process it if its pwq->pool matches the current
> pool. This is rather backwards and done this way partly because
> workqueue didn't have fitting iteration helpers and partly to avoid
> the number of lock operations on pool->lock.
>
> Workqueue now has fitting iterators and the locking operation overhead
> isn't anything to worry about - those locks are unlikely to be
> contended and the same CPU visiting the same set of locks multiple
> times isn't expensive.
>
> Restructure the three functions such that the flow better matches the
> logical steps and pwq iteration is done using for_each_pwq() inside
> workqueue iteration.
>
> * freeze_workqueues_begin(): Setting of FREEZING is moved into a
> separate for_each_pool() iteration. pwq iteration for clearing
> max_active is updated as described above.
>
> * freeze_workqueues_busy(): pwq iteration updated as described above.
>
> * thaw_workqueues(): The single for_each_wq_cpu() iteration is broken
> into three discrete steps - clearing FREEZING, restoring max_active,
> and kicking workers. The first and last steps use for_each_pool()
> and the second step uses pwq iteration described above.
>
> This makes the code easier to understand and removes the use of
> for_each_wq_cpu() for walking pwqs, which can't support multiple
> unbound pwqs which will be needed to implement unbound workqueues with
> custom attributes.
>
> This patch doesn't introduce any visible behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 87 ++++++++++++++++++++++++++++--------------------------
> 1 file changed, 45 insertions(+), 42 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 869dbcc..9f195aa 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3598,6 +3598,8 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
> void freeze_workqueues_begin(void)
> {
> struct worker_pool *pool;
> + struct workqueue_struct *wq;
> + struct pool_workqueue *pwq;
> int id;
>
> spin_lock_irq(&workqueue_lock);
> @@ -3605,23 +3607,24 @@ void freeze_workqueues_begin(void)
> WARN_ON_ONCE(workqueue_freezing);
> workqueue_freezing = true;
>
> + /* set FREEZING */
> for_each_pool(pool, id) {
> - struct workqueue_struct *wq;
> -
> spin_lock(&pool->lock);
> -
> WARN_ON_ONCE(pool->flags & POOL_FREEZING);
> pool->flags |= POOL_FREEZING;
> + spin_unlock(&pool->lock);
> + }
>
> - list_for_each_entry(wq, &workqueues, list) {
> - struct pool_workqueue *pwq = get_pwq(pool->cpu, wq);
> + /* suppress further executions by setting max_active to zero */
> + list_for_each_entry(wq, &workqueues, list) {
> + if (!(wq->flags & WQ_FREEZABLE))
> + continue;
>
> - if (pwq && pwq->pool == pool &&
> - (wq->flags & WQ_FREEZABLE))
> - pwq->max_active = 0;
> + for_each_pwq(pwq, wq) {
> + spin_lock(&pwq->pool->lock);
> + pwq->max_active = 0;
> + spin_unlock(&pwq->pool->lock);
> }
> -
> - spin_unlock(&pool->lock);
> }
>
> spin_unlock_irq(&workqueue_lock);
> @@ -3642,25 +3645,22 @@ void freeze_workqueues_begin(void)
> */
> bool freeze_workqueues_busy(void)
> {
> - unsigned int cpu;
> bool busy = false;
> + struct workqueue_struct *wq;
> + struct pool_workqueue *pwq;
>
> spin_lock_irq(&workqueue_lock);
>
> WARN_ON_ONCE(!workqueue_freezing);
>
> - for_each_wq_cpu(cpu) {
> - struct workqueue_struct *wq;
> + list_for_each_entry(wq, &workqueues, list) {
> + if (!(wq->flags & WQ_FREEZABLE))
> + continue;
> /*
> * nr_active is monotonically decreasing. It's safe
> * to peek without lock.
> */
> - list_for_each_entry(wq, &workqueues, list) {
> - struct pool_workqueue *pwq = get_pwq(cpu, wq);
> -
> - if (!pwq || !(wq->flags & WQ_FREEZABLE))
> - continue;
> -
> + for_each_pwq(pwq, wq) {
> WARN_ON_ONCE(pwq->nr_active < 0);
> if (pwq->nr_active) {
> busy = true;
> @@ -3684,40 +3684,43 @@ out_unlock:
> */
> void thaw_workqueues(void)
> {
> - unsigned int cpu;
> + struct workqueue_struct *wq;
> + struct pool_workqueue *pwq;
> + struct worker_pool *pool;
> + int id;
>
> spin_lock_irq(&workqueue_lock);
>
> if (!workqueue_freezing)
> goto out_unlock;
>
> - for_each_wq_cpu(cpu) {
> - struct worker_pool *pool;
> - struct workqueue_struct *wq;
> -
> - for_each_std_worker_pool(pool, cpu) {
> - spin_lock(&pool->lock);
> -
> - WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
> - pool->flags &= ~POOL_FREEZING;
> -
> - list_for_each_entry(wq, &workqueues, list) {
> - struct pool_workqueue *pwq = get_pwq(cpu, wq);
> -
> - if (!pwq || pwq->pool != pool ||
> - !(wq->flags & WQ_FREEZABLE))
> - continue;
> -
> - /* restore max_active and repopulate worklist */
> - pwq_set_max_active(pwq, wq->saved_max_active);
> - }
> + /* clear FREEZING */
> + for_each_pool(pool, id) {
> + spin_lock(&pool->lock);
> + WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
> + pool->flags &= ~POOL_FREEZING;
> + spin_unlock(&pool->lock);
> + }


I think it would be better if we move this code to ...

>
> - wake_up_worker(pool);
> + /* restore max_active and repopulate worklist */
> + list_for_each_entry(wq, &workqueues, list) {
> + if (!(wq->flags & WQ_FREEZABLE))
> + continue;
>
> - spin_unlock(&pool->lock);
> + for_each_pwq(pwq, wq) {
> + spin_lock(&pwq->pool->lock);
> + pwq_set_max_active(pwq, wq->saved_max_active);
> + spin_unlock(&pwq->pool->lock);
> }
> }
>
> + /* kick workers */
> + for_each_pool(pool, id) {
> + spin_lock(&pool->lock);
> + wake_up_worker(pool);
> + spin_unlock(&pool->lock);
> + }


... to here.

clear FREEZING and then kick.


> +
> workqueue_freezing = false;
> out_unlock:
> spin_unlock_irq(&workqueue_lock);

2013-03-10 10:07:22

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 12/31] workqueue: update synchronization rules on workqueue->pwqs

On 02/03/13 11:24, Tejun Heo wrote:
> Make workqueue->pwqs protected by workqueue_lock for writes and
> sched-RCU protected for reads. Lockdep assertions are added to
> for_each_pwq() and first_pwq() and all their users are converted to
> either hold workqueue_lock or disable preemption/irq.
>
> alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for
> consistency which isn't strictly necessary as the workqueue isn't
> visible. destroy_workqueue() isn't updated to sched-RCU release pwqs.
> This is okay as the workqueue should have on users left by that point.
>
> The locking is superflous at this point. This is to help
> implementation of unbound pools/pwqs with custom attributes.
>
> This patch doesn't introduce any behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 85 +++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 68 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 02f51b8..ff51c59 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -42,6 +42,7 @@
> #include <linux/lockdep.h>
> #include <linux/idr.h>
> #include <linux/hashtable.h>
> +#include <linux/rculist.h>
>
> #include "workqueue_internal.h"
>
> @@ -118,6 +119,8 @@ enum {
> * F: wq->flush_mutex protected.
> *
> * W: workqueue_lock protected.
> + *
> + * R: workqueue_lock protected for writes. Sched-RCU protected for reads.
> */
>
> /* struct worker is defined in workqueue_internal.h */
> @@ -169,7 +172,7 @@ struct pool_workqueue {
> int nr_active; /* L: nr of active works */
> int max_active; /* L: max active works */
> struct list_head delayed_works; /* L: delayed works */
> - struct list_head pwqs_node; /* I: node on wq->pwqs */
> + struct list_head pwqs_node; /* R: node on wq->pwqs */
> struct list_head mayday_node; /* W: node on wq->maydays */
> } __aligned(1 << WORK_STRUCT_FLAG_BITS);
>
> @@ -189,7 +192,7 @@ struct wq_flusher {
> struct workqueue_struct {
> unsigned int flags; /* W: WQ_* flags */
> struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
> - struct list_head pwqs; /* I: all pwqs of this wq */
> + struct list_head pwqs; /* R: all pwqs of this wq */
> struct list_head list; /* W: list of all workqueues */
>
> struct mutex flush_mutex; /* protects wq flushing */
> @@ -227,6 +230,11 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
> #define CREATE_TRACE_POINTS
> #include <trace/events/workqueue.h>
>
> +#define assert_rcu_or_wq_lock() \
> + rcu_lockdep_assert(rcu_read_lock_sched_held() || \
> + lockdep_is_held(&workqueue_lock), \
> + "sched RCU or workqueue lock should be held")
> +
> #define for_each_std_worker_pool(pool, cpu) \
> for ((pool) = &std_worker_pools(cpu)[0]; \
> (pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++)
> @@ -282,9 +290,16 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
> * for_each_pwq - iterate through all pool_workqueues of the specified workqueue
> * @pwq: iteration cursor
> * @wq: the target workqueue
> + *
> + * This must be called either with workqueue_lock held or sched RCU read
> + * locked. If the pwq needs to be used beyond the locking in effect, the
> + * caller is responsible for guaranteeing that the pwq stays online.
> + *
> + * The if clause exists only for the lockdep assertion and can be ignored.
> */
> #define for_each_pwq(pwq, wq) \
> - list_for_each_entry((pwq), &(wq)->pwqs, pwqs_node)
> + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
> + if (({ assert_rcu_or_wq_lock(); true; }))

Aware this:

if (somecondition)
for_each_pwq(pwq, wq)
one_statement;q
else
xxxxx;


for_each_pwq() will eat the else.

To avoid this, you can use:

#define for_each_pwq(pwq, wq) \
list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
if (({ assert_rcu_or_wq_lock(); false; })) { } \
else


The same for for_each_pool() in later patch.


>
> #ifdef CONFIG_DEBUG_OBJECTS_WORK
>
> @@ -463,9 +478,19 @@ static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
> return &pools[highpri];
> }
>
> +/**
> + * first_pwq - return the first pool_workqueue of the specified workqueue
> + * @wq: the target workqueue
> + *
> + * This must be called either with workqueue_lock held or sched RCU read
> + * locked. If the pwq needs to be used beyond the locking in effect, the
> + * caller is responsible for guaranteeing that the pwq stays online.
> + */
> static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
> {
> - return list_first_entry(&wq->pwqs, struct pool_workqueue, pwqs_node);
> + assert_rcu_or_wq_lock();
> + return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
> + pwqs_node);
> }
>
> static unsigned int work_color_to_flags(int color)
> @@ -2488,10 +2513,12 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
> atomic_set(&wq->nr_pwqs_to_flush, 1);
> }
>
> + local_irq_disable();
> +
> for_each_pwq(pwq, wq) {
> struct worker_pool *pool = pwq->pool;
>
> - spin_lock_irq(&pool->lock);
> + spin_lock(&pool->lock);
>
> if (flush_color >= 0) {
> WARN_ON_ONCE(pwq->flush_color != -1);
> @@ -2508,9 +2535,11 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
> pwq->work_color = work_color;
> }
>
> - spin_unlock_irq(&pool->lock);
> + spin_unlock(&pool->lock);
> }
>
> + local_irq_enable();
> +
> if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_pwqs_to_flush))
> complete(&wq->first_flusher->done);
>
> @@ -2701,12 +2730,14 @@ void drain_workqueue(struct workqueue_struct *wq)
> reflush:
> flush_workqueue(wq);
>
> + local_irq_disable();
> +
> for_each_pwq(pwq, wq) {
> bool drained;
>
> - spin_lock_irq(&pwq->pool->lock);
> + spin_lock(&pwq->pool->lock);
> drained = !pwq->nr_active && list_empty(&pwq->delayed_works);
> - spin_unlock_irq(&pwq->pool->lock);
> + spin_unlock(&pwq->pool->lock);
>
> if (drained)
> continue;
> @@ -2715,13 +2746,17 @@ reflush:
> (flush_cnt % 100 == 0 && flush_cnt <= 1000))
> pr_warn("workqueue %s: flush on destruction isn't complete after %u tries\n",
> wq->name, flush_cnt);
> +
> + local_irq_enable();
> goto reflush;
> }
>
> - spin_lock_irq(&workqueue_lock);
> + spin_lock(&workqueue_lock);
> if (!--wq->nr_drainers)
> wq->flags &= ~WQ_DRAINING;
> - spin_unlock_irq(&workqueue_lock);
> + spin_unlock(&workqueue_lock);
> +
> + local_irq_enable();
> }
> EXPORT_SYMBOL_GPL(drain_workqueue);
>
> @@ -3087,7 +3122,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> per_cpu_ptr(wq->cpu_pwqs, cpu);
>
> pwq->pool = get_std_worker_pool(cpu, highpri);
> - list_add_tail(&pwq->pwqs_node, &wq->pwqs);
> + list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
> }
> } else {
> struct pool_workqueue *pwq;
> @@ -3097,7 +3132,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> return -ENOMEM;
>
> pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
> - list_add_tail(&pwq->pwqs_node, &wq->pwqs);
> + list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
> }
>
> return 0;
> @@ -3174,6 +3209,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
> if (alloc_and_link_pwqs(wq) < 0)
> goto err;
>
> + local_irq_disable();
> for_each_pwq(pwq, wq) {
> BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
> pwq->wq = wq;
> @@ -3182,6 +3218,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
> INIT_LIST_HEAD(&pwq->delayed_works);
> INIT_LIST_HEAD(&pwq->mayday_node);
> }
> + local_irq_enable();
>
> if (flags & WQ_RESCUER) {
> struct worker *rescuer;
> @@ -3239,24 +3276,32 @@ void destroy_workqueue(struct workqueue_struct *wq)
> /* drain it before proceeding with destruction */
> drain_workqueue(wq);
>
> + spin_lock_irq(&workqueue_lock);
> +
> /* sanity checks */
> for_each_pwq(pwq, wq) {
> int i;
>
> - for (i = 0; i < WORK_NR_COLORS; i++)
> - if (WARN_ON(pwq->nr_in_flight[i]))
> + for (i = 0; i < WORK_NR_COLORS; i++) {
> + if (WARN_ON(pwq->nr_in_flight[i])) {
> + spin_unlock_irq(&workqueue_lock);
> return;
> + }
> + }
> +
> if (WARN_ON(pwq->nr_active) ||
> - WARN_ON(!list_empty(&pwq->delayed_works)))
> + WARN_ON(!list_empty(&pwq->delayed_works))) {
> + spin_unlock_irq(&workqueue_lock);
> return;
> + }
> }
>
> /*
> * wq list is used to freeze wq, remove from list after
> * flushing is complete in case freeze races us.
> */
> - spin_lock_irq(&workqueue_lock);
> list_del(&wq->list);
> +
> spin_unlock_irq(&workqueue_lock);
>
> if (wq->flags & WQ_RESCUER) {
> @@ -3340,13 +3385,19 @@ EXPORT_SYMBOL_GPL(workqueue_set_max_active);
> bool workqueue_congested(int cpu, struct workqueue_struct *wq)
> {
> struct pool_workqueue *pwq;
> + bool ret;
> +
> + preempt_disable();
>
> if (!(wq->flags & WQ_UNBOUND))
> pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> else
> pwq = first_pwq(wq);
>
> - return !list_empty(&pwq->delayed_works);
> + ret = !list_empty(&pwq->delayed_works);
> + preempt_enable();
> +
> + return ret;
> }
> EXPORT_SYMBOL_GPL(workqueue_congested);
>

2013-03-10 10:07:29

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 14/31] workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_mutex

On 02/03/13 11:24, Tejun Heo wrote:
> POOL_MANAGING_WORKERS is used to synchronize the manager role.
> Synchronizing among workers doesn't need blocking and that's why it's
> implemented as a flag.
>
> It got converted to a mutex a while back to add blocking wait from CPU
> hotplug path - 6037315269 ("workqueue: use mutex for global_cwq
> manager exclusion"). Later it turned out that synchronization among
> workers and cpu hotplug need to be done separately. Eventually,
> POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got
> morphed into workqueue->assoc_mutex - 552a37e936 ("workqueue: restore
> POOL_MANAGING_WORKERS") and b2eb83d123 ("workqueue: rename
> manager_mutex to assoc_mutex").
>
> Now, we're gonna need to be able to lock out managers from
> destroy_workqueue() to support multiple unbound pools with custom
> attributes making it again necessary to be able to block on the
> manager role. This patch replaces POOL_MANAGING_WORKERS with
> worker_pool->manager_mutex.
>
> This patch doesn't introduce any behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/workqueue.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2645218..68b3443 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -64,7 +64,6 @@ enum {
> * create_worker() is in progress.
> */
> POOL_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
> - POOL_MANAGING_WORKERS = 1 << 1, /* managing workers */
> POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
> POOL_FREEZING = 1 << 3, /* freeze in progress */
>
> @@ -145,6 +144,7 @@ struct worker_pool {
> DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
> /* L: hash of busy workers */
>
> + struct mutex manager_mutex; /* the holder is the manager */
> struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
> struct ida worker_ida; /* L: for worker IDs */
>
> @@ -702,7 +702,7 @@ static bool need_to_manage_workers(struct worker_pool *pool)
> /* Do we have too many workers and should some go away? */
> static bool too_many_workers(struct worker_pool *pool)
> {
> - bool managing = pool->flags & POOL_MANAGING_WORKERS;
> + bool managing = mutex_is_locked(&pool->manager_mutex);
> int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
> int nr_busy = pool->nr_workers - nr_idle;
>
> @@ -2027,15 +2027,13 @@ static bool manage_workers(struct worker *worker)
> struct worker_pool *pool = worker->pool;
> bool ret = false;
>
> - if (pool->flags & POOL_MANAGING_WORKERS)
> + if (!mutex_trylock(&pool->manager_mutex))
> return ret;
>
> - pool->flags |= POOL_MANAGING_WORKERS;


if mutex_trylock(&pool->manager_mutex) fails, it does not mean
the pool is managing workers. (although current code does).
so I recommend to keep POOL_MANAGING_WORKERS.

I suggest that you reuse assoc_mutex for your purpose(later patches).
(and rename assoc_mutex back to manager_mutex)


> -
> /*
> * To simplify both worker management and CPU hotplug, hold off
> * management while hotplug is in progress. CPU hotplug path can't
> - * grab %POOL_MANAGING_WORKERS to achieve this because that can
> + * grab @pool->manager_mutex to achieve this because that can
> * lead to idle worker depletion (all become busy thinking someone
> * else is managing) which in turn can result in deadlock under
> * extreme circumstances. Use @pool->assoc_mutex to synchronize
> @@ -2075,8 +2073,8 @@ static bool manage_workers(struct worker *worker)
> ret |= maybe_destroy_workers(pool);
> ret |= maybe_create_worker(pool);
>
> - pool->flags &= ~POOL_MANAGING_WORKERS;
> mutex_unlock(&pool->assoc_mutex);
> + mutex_unlock(&pool->manager_mutex);
> return ret;
> }
>
> @@ -3805,6 +3803,7 @@ static int __init init_workqueues(void)
> setup_timer(&pool->mayday_timer, pool_mayday_timeout,
> (unsigned long)pool);
>
> + mutex_init(&pool->manager_mutex);
> mutex_init(&pool->assoc_mutex);
> ida_init(&pool->worker_ida);
>

2013-03-10 10:32:22

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On 02/03/13 11:23, Tejun Heo wrote:

Hi, Tejun,

I agree almost the whole design.(only except some locks)
And I found only a little problems for current review.

>
> This patchset contains the following 31 patches.
>
> 0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch

> 0002-workqueue-make-workqueue_lock-irq-safe.patch

workqueue_lock protects too many things. We can introduce different locks
for different purpose later.

> 0003-workqueue-introduce-kmem_cache-for-pool_workqueues.patch
> 0004-workqueue-add-workqueue_struct-pwqs-list.patch
> 0005-workqueue-replace-for_each_pwq_cpu-with-for_each_pwq.patch
> 0006-workqueue-introduce-for_each_pool.patch
> 0007-workqueue-restructure-pool-pool_workqueue-iterations.patch
> 0008-workqueue-add-wokrqueue_struct-maydays-list-to-repla.patch
> 0009-workqueue-consistently-use-int-for-cpu-variables.patch
> 0010-workqueue-remove-workqueue_struct-pool_wq.single.patch
> 0011-workqueue-replace-get_pwq-with-explicit-per_cpu_ptr-.patch
> 0012-workqueue-update-synchronization-rules-on-workqueue-.patch
> 0013-workqueue-update-synchronization-rules-on-worker_poo.patch

> 0014-workqueue-replace-POOL_MANAGING_WORKERS-flag-with-wo.patch
> 0015-workqueue-separate-out-init_worker_pool-from-init_wo.patch
> 0016-workqueue-introduce-workqueue_attrs.patch
> 0017-workqueue-implement-attribute-based-unbound-worker_p.patch
> 0018-workqueue-remove-unbound_std_worker_pools-and-relate.patch
> 0019-workqueue-drop-std-from-cpu_std_worker_pools-and-for.patch
> 0020-workqueue-add-pool-ID-to-the-names-of-unbound-kworke.patch
> 0021-workqueue-drop-WQ_RESCUER-and-test-workqueue-rescuer.patch
> 0022-workqueue-restructure-__alloc_workqueue_key.patch


> 0023-workqueue-implement-get-put_pwq.patch

I guess this patch and patch25 may have very deep issue VS RCU.

> 0024-workqueue-prepare-flush_workqueue-for-dynamic-creati.patch
> 0025-workqueue-perform-non-reentrancy-test-when-queueing-.patch
> 0026-workqueue-implement-apply_workqueue_attrs.patch
> 0027-workqueue-make-it-clear-that-WQ_DRAINING-is-an-inter.patch
> 0028-workqueue-reject-increasing-max_active-for-ordered-w.patch
> 0029-cpumask-implement-cpumask_parse.patch
> 0030-driver-base-implement-subsys_virtual_register.patch
> 0031-workqueue-implement-sysfs-interface-for-workqueues.patch
>


for 1~13,15~22,26~28, please add Reviewed-by: Lai Jiangshan <[email protected]>


> 0001-0003 are misc preps.
>
> 0004-0008 update various iterators such that they don't operate on cpu
> number.
>
> 0009-0011 are another set of misc preps / cleanups.
>
> 0012-0014 update synchronization rules to prepare for dynamic
> management of pwqs and pools.
>
> 0015-0022 introduce workqueue_attrs and prepare for dynamic management
> of pwqs and pools.
>
> 0023-0026 implement dynamic application of workqueue_attrs which
> involes creating and destroying unbound pwqs and pools dynamically.
>
> 0027-0028 prepare workqueue for sysfs exports.
>
> 0029-0030 make cpumask and driver core changes for workqueue sysfs
> exports.
>
> 0031 implements sysfs exports for workqueues.
>
> This patchset is on top of
>
> [1] wq/for-3.10-tmp 7bceeff75e ("workqueue: better define synchronization rule around rescuer->pool updates")
>
> which is scheduled to be rebased on top of v3.9-rc1 once it comes out.
> The changes are also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-attrs
>
> diffstat follows.
>
> drivers/base/base.h | 2
> drivers/base/bus.c | 73 +
> drivers/base/core.c | 2
> include/linux/cpumask.h | 15
> include/linux/device.h | 2
> include/linux/workqueue.h | 34
> kernel/workqueue.c | 1716 +++++++++++++++++++++++++++++++-------------
> kernel/workqueue_internal.h | 5
> 8 files changed, 1322 insertions(+), 527 deletions(-)
>
> Thanks.
>
> --
> tejun
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.10-tmp
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-03-10 11:57:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

Hey, guys.

On Fri, Mar 08, 2013 at 01:04:25AM +0100, Kay Sievers wrote:
> > Sorry for the delay, I'm at a conference all this week, and haven't had
> > much time to think about this.
> >
> > If Kay says this is ok for now, that's good enough for me.
>
> Yes, it looks fine to me. If we provide the unified handling of
> classes and buses some day, this can probably go away, but until that
> it looks fine and is straight forward to do it that way,

How should this be routed? I can take it but Kay needs it too so
workqueue tree probably isn't the best fit although I can set up a
separate branch if needed.

Thanks.

--
tejun

2013-03-10 12:01:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

Hey, Lai.

On Sun, Mar 10, 2013 at 06:34:33PM +0800, Lai Jiangshan wrote:
> > This patchset contains the following 31 patches.
> >
> > 0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
>
> > 0002-workqueue-make-workqueue_lock-irq-safe.patch
>
> workqueue_lock protects too many things. We can introduce different locks
> for different purpose later.

I don't know. My general attitude toward locking is the simpler the
better. None of the paths protected by workqueue_lock are hot.
There's no actual benefit in making them finer grained.

> > 0023-workqueue-implement-get-put_pwq.patch
>
> I guess this patch and patch25 may have very deep issue VS RCU.

Hmmm... scary. I suppose you're gonna elaborate on the review of the
actual patch?

> > 0024-workqueue-prepare-flush_workqueue-for-dynamic-creati.patch
> > 0025-workqueue-perform-non-reentrancy-test-when-queueing-.patch
> > 0026-workqueue-implement-apply_workqueue_attrs.patch
> > 0027-workqueue-make-it-clear-that-WQ_DRAINING-is-an-inter.patch
> > 0028-workqueue-reject-increasing-max_active-for-ordered-w.patch
> > 0029-cpumask-implement-cpumask_parse.patch
> > 0030-driver-base-implement-subsys_virtual_register.patch
> > 0031-workqueue-implement-sysfs-interface-for-workqueues.patch
>
>
> for 1~13,15~22,26~28, please add Reviewed-by: Lai Jiangshan <[email protected]>

Done.

Thanks.

--
tejun

2013-03-10 12:34:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 07/31] workqueue: restructure pool / pool_workqueue iterations in freeze/thaw functions

Hello, Lai.

On Sun, Mar 10, 2013 at 06:09:17PM +0800, Lai Jiangshan wrote:
> > + /* clear FREEZING */
> > + for_each_pool(pool, id) {
> > + spin_lock(&pool->lock);
> > + WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
> > + pool->flags &= ~POOL_FREEZING;
> > + spin_unlock(&pool->lock);
> > + }
>
>
> I think it would be better if we move this code to ...
>
> >
> > - wake_up_worker(pool);
> > + /* restore max_active and repopulate worklist */
> > + list_for_each_entry(wq, &workqueues, list) {
> > + if (!(wq->flags & WQ_FREEZABLE))
> > + continue;
> >
> > - spin_unlock(&pool->lock);
> > + for_each_pwq(pwq, wq) {
> > + spin_lock(&pwq->pool->lock);
> > + pwq_set_max_active(pwq, wq->saved_max_active);
> > + spin_unlock(&pwq->pool->lock);
> > }
> > }
> >
> > + /* kick workers */
> > + for_each_pool(pool, id) {
> > + spin_lock(&pool->lock);
> > + wake_up_worker(pool);
> > + spin_unlock(&pool->lock);
> > + }
>
>
> ... to here.
>
> clear FREEZING and then kick.

Yeah, that would be prettier but also change the order of operations
which I'd like to keep the same across the conversion. We can create
a separate patch to merge the two loops later. Care to send a
separate patch?

Thanks.

--
tejun

2013-03-10 12:38:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/31] workqueue: update synchronization rules on workqueue->pwqs

Hello, Lai.

On Sun, Mar 10, 2013 at 06:09:28PM +0800, Lai Jiangshan wrote:
> > #define for_each_pwq(pwq, wq) \
> > - list_for_each_entry((pwq), &(wq)->pwqs, pwqs_node)
> > + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
> > + if (({ assert_rcu_or_wq_lock(); true; }))
>
> Aware this:
>
> if (somecondition)
> for_each_pwq(pwq, wq)
> one_statement;q
> else
> xxxxx;
>
>
> for_each_pwq() will eat the else.

Yeah, but that will also generate a compiler warning.

> To avoid this, you can use:
>
> #define for_each_pwq(pwq, wq) \
> list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
> if (({ assert_rcu_or_wq_lock(); false; })) { } \
> else
>
>
> The same for for_each_pool() in later patch.

Ooh, yeah, that's better. Will do that.

Thanks.

--
tejun

2013-03-10 12:46:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/31] workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_mutex

Hello, Lai.

On Sun, Mar 10, 2013 at 06:09:38PM +0800, Lai Jiangshan wrote:
> > - if (pool->flags & POOL_MANAGING_WORKERS)
> > + if (!mutex_trylock(&pool->manager_mutex))
> > return ret;
> >
> > - pool->flags |= POOL_MANAGING_WORKERS;
>
>
> if mutex_trylock(&pool->manager_mutex) fails, it does not mean
> the pool is managing workers. (although current code does).
> so I recommend to keep POOL_MANAGING_WORKERS.

So, that's the intention. It's gonna be used during pool destruction
and we want all the workers to think that the pool is being managed
and it's safe to proceed.

> I suggest that you reuse assoc_mutex for your purpose(later patches).
> (and rename assoc_mutex back to manager_mutex)

They are different. assoc_mutex makes the workers wait for the
managership, which shouldn't happen during pool destruction. We want
the workers to assume that the pool is managed which is what
manager_mutex achieves.

Thanks.

--
tejun

2013-03-10 12:58:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 17/31] workqueue: implement attribute-based unbound worker_pool management

On Sun, Mar 10, 2013 at 06:08:57PM +0800, Lai Jiangshan wrote:
> > @@ -3185,12 +3250,133 @@ static int init_worker_pool(struct worker_pool *pool)
> > mutex_init(&pool->assoc_mutex);
> > ida_init(&pool->worker_ida);
> >
> > + INIT_HLIST_NODE(&pool->hash_node);
> > + atomic_set(&pool->refcnt, 1);
>
> We should document: the code before "atomic_set(&pool->refcnt, 1);" should not failed.
> (In case we add failable code before it when we forget this requirement in future".
> reason: when get_unbound_pool() fails, we expected ->refcnt = 1)

Yeap, comments added.

> > +/**
> > + * put_unbound_pool - put a worker_pool
> > + * @pool: worker_pool to put
> > + *
> > + * Put @pool. If its refcnt reaches zero, it gets destroyed in sched-RCU
> > + * safe manner.
> > + */
> > +static void put_unbound_pool(struct worker_pool *pool)
> > +{
> > + struct worker *worker;
> > +
> > + if (!atomic_dec_and_test(&pool->refcnt))
> > + return;
>
> if get_unbound_pool() happens here, it will get a destroyed pool.
> so we need to move "spin_lock_irq(&workqueue_lock);" before above statement.
> (and ->refcnt don't need atomic after moved)

Hmmm... right. Nice catch. Updating...

> > + if (WARN_ON(pool->nr_workers != pool->nr_idle))
> > + return;
>
> This can be false-negative. we should remove this WARN_ON().

How would the test fail spuriously? Can you please elaborate?

Thanks.

--
tejun

2013-03-10 16:44:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 10, 2013 at 04:57:02AM -0700, Tejun Heo wrote:
> Hey, guys.
>
> On Fri, Mar 08, 2013 at 01:04:25AM +0100, Kay Sievers wrote:
> > > Sorry for the delay, I'm at a conference all this week, and haven't had
> > > much time to think about this.
> > >
> > > If Kay says this is ok for now, that's good enough for me.
> >
> > Yes, it looks fine to me. If we provide the unified handling of
> > classes and buses some day, this can probably go away, but until that
> > it looks fine and is straight forward to do it that way,
>
> How should this be routed? I can take it but Kay needs it too so
> workqueue tree probably isn't the best fit although I can set up a
> separate branch if needed.

What patch set does Kay need it for? I have no objection for you to
take it through the workqueue tree:

Acked-by: Greg Kroah-Hartman <[email protected]>

2013-03-10 17:00:40

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 10, 2013 at 5:45 PM, Greg Kroah-Hartman
<[email protected]> wrote:
> On Sun, Mar 10, 2013 at 04:57:02AM -0700, Tejun Heo wrote:
>> Hey, guys.
>>
>> On Fri, Mar 08, 2013 at 01:04:25AM +0100, Kay Sievers wrote:
>> > > Sorry for the delay, I'm at a conference all this week, and haven't had
>> > > much time to think about this.
>> > >
>> > > If Kay says this is ok for now, that's good enough for me.
>> >
>> > Yes, it looks fine to me. If we provide the unified handling of
>> > classes and buses some day, this can probably go away, but until that
>> > it looks fine and is straight forward to do it that way,
>>
>> How should this be routed? I can take it but Kay needs it too so
>> workqueue tree probably isn't the best fit although I can set up a
>> separate branch if needed.
>
> What patch set does Kay need it for? I have no objection for you to
> take it through the workqueue tree:

The dbus bus has the same issues and needs the devices put under
virtual/ and not the devices/ root.

Kay

2013-03-10 17:23:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 10, 2013 at 06:00:18PM +0100, Kay Sievers wrote:
> On Sun, Mar 10, 2013 at 5:45 PM, Greg Kroah-Hartman
> <[email protected]> wrote:
> > On Sun, Mar 10, 2013 at 04:57:02AM -0700, Tejun Heo wrote:
> >> Hey, guys.
> >>
> >> On Fri, Mar 08, 2013 at 01:04:25AM +0100, Kay Sievers wrote:
> >> > > Sorry for the delay, I'm at a conference all this week, and haven't had
> >> > > much time to think about this.
> >> > >
> >> > > If Kay says this is ok for now, that's good enough for me.
> >> >
> >> > Yes, it looks fine to me. If we provide the unified handling of
> >> > classes and buses some day, this can probably go away, but until that
> >> > it looks fine and is straight forward to do it that way,
> >>
> >> How should this be routed? I can take it but Kay needs it too so
> >> workqueue tree probably isn't the best fit although I can set up a
> >> separate branch if needed.
> >
> > What patch set does Kay need it for? I have no objection for you to
> > take it through the workqueue tree:
>
> The dbus bus has the same issues and needs the devices put under
> virtual/ and not the devices/ root.

Yes, but I can keep Tejun's patch in my local queue for now, dbus is
going to not make 3.10, right?

thanks,

greg k-h

2013-03-10 17:50:58

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 10, 2013 at 6:24 PM, Greg Kroah-Hartman
<[email protected]> wrote:
> On Sun, Mar 10, 2013 at 06:00:18PM +0100, Kay Sievers wrote:
>> On Sun, Mar 10, 2013 at 5:45 PM, Greg Kroah-Hartman
>> <[email protected]> wrote:
>> > On Sun, Mar 10, 2013 at 04:57:02AM -0700, Tejun Heo wrote:
>> >> Hey, guys.
>> >>
>> >> On Fri, Mar 08, 2013 at 01:04:25AM +0100, Kay Sievers wrote:
>> >> > > Sorry for the delay, I'm at a conference all this week, and haven't had
>> >> > > much time to think about this.
>> >> > >
>> >> > > If Kay says this is ok for now, that's good enough for me.
>> >> >
>> >> > Yes, it looks fine to me. If we provide the unified handling of
>> >> > classes and buses some day, this can probably go away, but until that
>> >> > it looks fine and is straight forward to do it that way,
>> >>
>> >> How should this be routed? I can take it but Kay needs it too so
>> >> workqueue tree probably isn't the best fit although I can set up a
>> >> separate branch if needed.
>> >
>> > What patch set does Kay need it for? I have no objection for you to
>> > take it through the workqueue tree:
>>
>> The dbus bus has the same issues and needs the devices put under
>> virtual/ and not the devices/ root.
>
> Yes, but I can keep Tejun's patch in my local queue for now, dbus is
> going to not make 3.10, right?

No, sure not. It's just something we will need there too, but there is
no hurry, it's only a cosmetic issue anyway and nothing that matters
functionality-wise.

Kay

2013-03-10 18:34:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 10, 2013 at 06:50:34PM +0100, Kay Sievers wrote:
> > Yes, but I can keep Tejun's patch in my local queue for now, dbus is
> > going to not make 3.10, right?
>
> No, sure not. It's just something we will need there too, but there is
> no hurry, it's only a cosmetic issue anyway and nothing that matters
> functionality-wise.

In that case, I'll just route it together with the rest of workqueue
changes with Greg's ack added.

Thanks!

--
tejun

2013-03-10 18:36:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 17/31] workqueue: implement attribute-based unbound worker_pool management

On Sun, Mar 10, 2013 at 05:58:21AM -0700, Tejun Heo wrote:
> > > + if (WARN_ON(pool->nr_workers != pool->nr_idle))
> > > + return;
> >
> > This can be false-negative. we should remove this WARN_ON().
>
> How would the test fail spuriously? Can you please elaborate?

I got it. It'll be short by one if there's an active manager.
Removed.

Thanks.

--
tejun

2013-03-11 15:24:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On Sun, Mar 10, 2013 at 05:01:13AM -0700, Tejun Heo wrote:
> Hey, Lai.
>
> On Sun, Mar 10, 2013 at 06:34:33PM +0800, Lai Jiangshan wrote:
> > > This patchset contains the following 31 patches.
> > >
> > > 0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
> >
> > > 0002-workqueue-make-workqueue_lock-irq-safe.patch
> >
> > workqueue_lock protects too many things. We can introduce different locks
> > for different purpose later.
>
> I don't know. My general attitude toward locking is the simpler the
> better. None of the paths protected by workqueue_lock are hot.
> There's no actual benefit in making them finer grained.

Heh, I need to make workqueues and pools protected by a mutex rather
than spinlock, so I'm breaking out the locking after all. This is
gonna be a separate series of patches and it seems like there are
gonna be three locks - wq_mutex (pool and workqueues), pwq_lock
(spinlock protecting pwqs), wq_mayday_lock (lock for the mayday list).

Thanks.

--
tejun

2013-03-11 15:40:30

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On Mon, Mar 11, 2013 at 11:24 PM, Tejun Heo <[email protected]> wrote:
> On Sun, Mar 10, 2013 at 05:01:13AM -0700, Tejun Heo wrote:
>> Hey, Lai.
>>
>> On Sun, Mar 10, 2013 at 06:34:33PM +0800, Lai Jiangshan wrote:
>> > > This patchset contains the following 31 patches.
>> > >
>> > > 0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
>> >
>> > > 0002-workqueue-make-workqueue_lock-irq-safe.patch
>> >
>> > workqueue_lock protects too many things. We can introduce different locks
>> > for different purpose later.
>>
>> I don't know. My general attitude toward locking is the simpler the
>> better. None of the paths protected by workqueue_lock are hot.
>> There's no actual benefit in making them finer grained.
>
> Heh, I need to make workqueues and pools protected by a mutex rather
> than spinlock, so I'm breaking out the locking after all. This is
> gonna be a separate series of patches and it seems like there are
> gonna be three locks - wq_mutex (pool and workqueues), pwq_lock
> (spinlock protecting pwqs), wq_mayday_lock (lock for the mayday list).

Glad to hear this.
wq_mayday_lock is needed at least. spin_lock_irq(workqueue_lock)
with long loop in its C.S hurts RT people.

Thanks,
Lai

>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-11 15:42:30

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On Sun, Mar 10, 2013 at 8:01 PM, Tejun Heo <[email protected]> wrote:
> Hey, Lai.
>
> On Sun, Mar 10, 2013 at 06:34:33PM +0800, Lai Jiangshan wrote:
>> > This patchset contains the following 31 patches.
>> >
>> > 0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
>>
>> > 0002-workqueue-make-workqueue_lock-irq-safe.patch
>>
>> workqueue_lock protects too many things. We can introduce different locks
>> for different purpose later.
>
> I don't know. My general attitude toward locking is the simpler the
> better. None of the paths protected by workqueue_lock are hot.
> There's no actual benefit in making them finer grained.
>
>> > 0023-workqueue-implement-get-put_pwq.patch
>>
>> I guess this patch and patch25 may have very deep issue VS RCU.
>
> Hmmm... scary. I suppose you're gonna elaborate on the review of the
> actual patch?
>
>> > 0024-workqueue-prepare-flush_workqueue-for-dynamic-creati.patch
>> > 0025-workqueue-perform-non-reentrancy-test-when-queueing-.patch
>> > 0026-workqueue-implement-apply_workqueue_attrs.patch
>> > 0027-workqueue-make-it-clear-that-WQ_DRAINING-is-an-inter.patch
>> > 0028-workqueue-reject-increasing-max_active-for-ordered-w.patch
>> > 0029-cpumask-implement-cpumask_parse.patch
>> > 0030-driver-base-implement-subsys_virtual_register.patch
>> > 0031-workqueue-implement-sysfs-interface-for-workqueues.patch
>>
>>
>> for 1~13,15~22,26~28, please add Reviewed-by: Lai Jiangshan <[email protected]>

OK, Also add my Reviewed-by to 23~25.

>
> Done.
>

I didn't see you updated branch in your tree.

Thanks,
Lai

> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-11 15:44:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On Mon, Mar 11, 2013 at 11:42:29PM +0800, Lai Jiangshan wrote:
> >> for 1~13,15~22,26~28, please add Reviewed-by: Lai Jiangshan <[email protected]>
>
> OK, Also add my Reviewed-by to 23~25.
>
> >
> > Done.
> >
>
> I didn't see you updated branch in your tree.

Working on something else and also waiting on the
s/PF_THREAD_BOUND/PF_NO_SETAFFINITY/ patch as the whole series has
been rebased on top. I'll probably re-post in a few days.

Thanks!

--
tejun

2013-03-12 18:10:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

Hello, Lai.

On Sun, Mar 10, 2013 at 05:01:13AM -0700, Tejun Heo wrote:
> > > 0023-workqueue-implement-get-put_pwq.patch
> >
> > I guess this patch and patch25 may have very deep issue VS RCU.
>
> Hmmm... scary. I suppose you're gonna elaborate on the review of the
> actual patch?

What happened with this? Was it false alarm?

--
tejun

2013-03-12 18:19:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 14/31] workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_mutex

>From 45fcfc6da599d190fc6355f046ef1e63d89475e4 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Tue, 12 Mar 2013 11:14:47 -0700

POOL_MANAGING_WORKERS is used to synchronize the manager role.
Synchronizing among workers doesn't need blocking and that's why it's
implemented as a flag.

It got converted to a mutex a while back to add blocking wait from CPU
hotplug path - 6037315269 ("workqueue: use mutex for global_cwq
manager exclusion"). Later it turned out that synchronization among
workers and cpu hotplug need to be done separately. Eventually,
POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got
morphed into workqueue->assoc_mutex - 552a37e936 ("workqueue: restore
POOL_MANAGING_WORKERS") and b2eb83d123 ("workqueue: rename
manager_mutex to assoc_mutex").

Now, we're gonna need to be able to lock out managers from
destroy_workqueue() to support multiple unbound pools with custom
attributes making it again necessary to be able to block on the
manager role. This patch replaces POOL_MANAGING_WORKERS with
worker_pool->manager_arb.

This patch doesn't introduce any behavior changes.

v2: s/manager_mutex/manager_arb/

Signed-off-by: Tejun Heo <[email protected]>
---
manager_mutex renamed to manager_arb. assoc_mutex will later be
renamed to manager_mutex along with other locking cleanups.

Thanks.

kernel/workqueue.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4638149..16f7f8d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -64,7 +64,6 @@ enum {
* create_worker() is in progress.
*/
POOL_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
- POOL_MANAGING_WORKERS = 1 << 1, /* managing workers */
POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
POOL_FREEZING = 1 << 3, /* freeze in progress */

@@ -145,6 +144,7 @@ struct worker_pool {
DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
/* L: hash of busy workers */

+ struct mutex manager_arb; /* manager arbitration */
struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */
struct ida worker_ida; /* L: for worker IDs */

@@ -706,7 +706,7 @@ static bool need_to_manage_workers(struct worker_pool *pool)
/* Do we have too many workers and should some go away? */
static bool too_many_workers(struct worker_pool *pool)
{
- bool managing = pool->flags & POOL_MANAGING_WORKERS;
+ bool managing = mutex_is_locked(&pool->manager_arb);
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;

@@ -2029,19 +2029,17 @@ static bool manage_workers(struct worker *worker)
struct worker_pool *pool = worker->pool;
bool ret = false;

- if (pool->flags & POOL_MANAGING_WORKERS)
+ if (!mutex_trylock(&pool->manager_arb))
return ret;

- pool->flags |= POOL_MANAGING_WORKERS;
-
/*
* To simplify both worker management and CPU hotplug, hold off
* management while hotplug is in progress. CPU hotplug path can't
- * grab %POOL_MANAGING_WORKERS to achieve this because that can
- * lead to idle worker depletion (all become busy thinking someone
- * else is managing) which in turn can result in deadlock under
- * extreme circumstances. Use @pool->assoc_mutex to synchronize
- * manager against CPU hotplug.
+ * grab @pool->manager_arb to achieve this because that can lead to
+ * idle worker depletion (all become busy thinking someone else is
+ * managing) which in turn can result in deadlock under extreme
+ * circumstances. Use @pool->assoc_mutex to synchronize manager
+ * against CPU hotplug.
*
* assoc_mutex would always be free unless CPU hotplug is in
* progress. trylock first without dropping @pool->lock.
@@ -2077,8 +2075,8 @@ static bool manage_workers(struct worker *worker)
ret |= maybe_destroy_workers(pool);
ret |= maybe_create_worker(pool);

- pool->flags &= ~POOL_MANAGING_WORKERS;
mutex_unlock(&pool->assoc_mutex);
+ mutex_unlock(&pool->manager_arb);
return ret;
}

@@ -3806,6 +3804,7 @@ static int __init init_workqueues(void)
setup_timer(&pool->mayday_timer, pool_mayday_timeout,
(unsigned long)pool);

+ mutex_init(&pool->manager_arb);
mutex_init(&pool->assoc_mutex);
ida_init(&pool->worker_ida);

--
1.8.1.4

2013-03-12 18:20:20

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 12/31] workqueue: update synchronization rules on workqueue->pwqs

>From afa326a1a445265a90bad26177c5e2e4d523580a Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Tue, 12 Mar 2013 11:14:45 -0700

Make workqueue->pwqs protected by workqueue_lock for writes and
sched-RCU protected for reads. Lockdep assertions are added to
for_each_pwq() and first_pwq() and all their users are converted to
either hold workqueue_lock or disable preemption/irq.

alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for
consistency which isn't strictly necessary as the workqueue isn't
visible. destroy_workqueue() isn't updated to sched-RCU release pwqs.
This is okay as the workqueue should have on users left by that point.

The locking is superflous at this point. This is to help
implementation of unbound pools/pwqs with custom attributes.

This patch doesn't introduce any behavior changes.

v2: Updated for_each_pwq() use if/else for the hidden assertion
statement instead of just if as suggested by Lai. This avoids
confusing the following else clause.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 87 +++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 70 insertions(+), 17 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 577ac71..e060ff2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -42,6 +42,7 @@
#include <linux/lockdep.h>
#include <linux/idr.h>
#include <linux/hashtable.h>
+#include <linux/rculist.h>

#include "workqueue_internal.h"

@@ -118,6 +119,8 @@ enum {
* F: wq->flush_mutex protected.
*
* W: workqueue_lock protected.
+ *
+ * R: workqueue_lock protected for writes. Sched-RCU protected for reads.
*/

/* struct worker is defined in workqueue_internal.h */
@@ -169,7 +172,7 @@ struct pool_workqueue {
int nr_active; /* L: nr of active works */
int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
- struct list_head pwqs_node; /* I: node on wq->pwqs */
+ struct list_head pwqs_node; /* R: node on wq->pwqs */
struct list_head mayday_node; /* W: node on wq->maydays */
} __aligned(1 << WORK_STRUCT_FLAG_BITS);

@@ -189,7 +192,7 @@ struct wq_flusher {
struct workqueue_struct {
unsigned int flags; /* W: WQ_* flags */
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
- struct list_head pwqs; /* I: all pwqs of this wq */
+ struct list_head pwqs; /* R: all pwqs of this wq */
struct list_head list; /* W: list of all workqueues */

struct mutex flush_mutex; /* protects wq flushing */
@@ -227,6 +230,11 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
#define CREATE_TRACE_POINTS
#include <trace/events/workqueue.h>

+#define assert_rcu_or_wq_lock() \
+ rcu_lockdep_assert(rcu_read_lock_sched_held() || \
+ lockdep_is_held(&workqueue_lock), \
+ "sched RCU or workqueue lock should be held")
+
#define for_each_std_worker_pool(pool, cpu) \
for ((pool) = &std_worker_pools(cpu)[0]; \
(pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++)
@@ -282,9 +290,18 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
* @pwq: iteration cursor
* @wq: the target workqueue
+ *
+ * This must be called either with workqueue_lock held or sched RCU read
+ * locked. If the pwq needs to be used beyond the locking in effect, the
+ * caller is responsible for guaranteeing that the pwq stays online.
+ *
+ * The if/else clause exists only for the lockdep assertion and can be
+ * ignored.
*/
#define for_each_pwq(pwq, wq) \
- list_for_each_entry((pwq), &(wq)->pwqs, pwqs_node)
+ list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
+ if (({ assert_rcu_or_wq_lock(); false; })) { } \
+ else

#ifdef CONFIG_DEBUG_OBJECTS_WORK

@@ -463,9 +480,19 @@ static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
return &pools[highpri];
}

+/**
+ * first_pwq - return the first pool_workqueue of the specified workqueue
+ * @wq: the target workqueue
+ *
+ * This must be called either with workqueue_lock held or sched RCU read
+ * locked. If the pwq needs to be used beyond the locking in effect, the
+ * caller is responsible for guaranteeing that the pwq stays online.
+ */
static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
{
- return list_first_entry(&wq->pwqs, struct pool_workqueue, pwqs_node);
+ assert_rcu_or_wq_lock();
+ return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
+ pwqs_node);
}

static unsigned int work_color_to_flags(int color)
@@ -2486,10 +2513,12 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
atomic_set(&wq->nr_pwqs_to_flush, 1);
}

+ local_irq_disable();
+
for_each_pwq(pwq, wq) {
struct worker_pool *pool = pwq->pool;

- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);

if (flush_color >= 0) {
WARN_ON_ONCE(pwq->flush_color != -1);
@@ -2506,9 +2535,11 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
pwq->work_color = work_color;
}

- spin_unlock_irq(&pool->lock);
+ spin_unlock(&pool->lock);
}

+ local_irq_enable();
+
if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_pwqs_to_flush))
complete(&wq->first_flusher->done);

@@ -2699,12 +2730,14 @@ void drain_workqueue(struct workqueue_struct *wq)
reflush:
flush_workqueue(wq);

+ local_irq_disable();
+
for_each_pwq(pwq, wq) {
bool drained;

- spin_lock_irq(&pwq->pool->lock);
+ spin_lock(&pwq->pool->lock);
drained = !pwq->nr_active && list_empty(&pwq->delayed_works);
- spin_unlock_irq(&pwq->pool->lock);
+ spin_unlock(&pwq->pool->lock);

if (drained)
continue;
@@ -2713,13 +2746,17 @@ reflush:
(flush_cnt % 100 == 0 && flush_cnt <= 1000))
pr_warn("workqueue %s: flush on destruction isn't complete after %u tries\n",
wq->name, flush_cnt);
+
+ local_irq_enable();
goto reflush;
}

- spin_lock_irq(&workqueue_lock);
+ spin_lock(&workqueue_lock);
if (!--wq->nr_drainers)
wq->flags &= ~WQ_DRAINING;
- spin_unlock_irq(&workqueue_lock);
+ spin_unlock(&workqueue_lock);
+
+ local_irq_enable();
}
EXPORT_SYMBOL_GPL(drain_workqueue);

@@ -3085,7 +3122,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
per_cpu_ptr(wq->cpu_pwqs, cpu);

pwq->pool = get_std_worker_pool(cpu, highpri);
- list_add_tail(&pwq->pwqs_node, &wq->pwqs);
+ list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}
} else {
struct pool_workqueue *pwq;
@@ -3095,7 +3132,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
return -ENOMEM;

pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
- list_add_tail(&pwq->pwqs_node, &wq->pwqs);
+ list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}

return 0;
@@ -3172,6 +3209,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
if (alloc_and_link_pwqs(wq) < 0)
goto err;

+ local_irq_disable();
for_each_pwq(pwq, wq) {
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
pwq->wq = wq;
@@ -3180,6 +3218,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
}
+ local_irq_enable();

if (flags & WQ_RESCUER) {
struct worker *rescuer;
@@ -3237,24 +3276,32 @@ void destroy_workqueue(struct workqueue_struct *wq)
/* drain it before proceeding with destruction */
drain_workqueue(wq);

+ spin_lock_irq(&workqueue_lock);
+
/* sanity checks */
for_each_pwq(pwq, wq) {
int i;

- for (i = 0; i < WORK_NR_COLORS; i++)
- if (WARN_ON(pwq->nr_in_flight[i]))
+ for (i = 0; i < WORK_NR_COLORS; i++) {
+ if (WARN_ON(pwq->nr_in_flight[i])) {
+ spin_unlock_irq(&workqueue_lock);
return;
+ }
+ }
+
if (WARN_ON(pwq->nr_active) ||
- WARN_ON(!list_empty(&pwq->delayed_works)))
+ WARN_ON(!list_empty(&pwq->delayed_works))) {
+ spin_unlock_irq(&workqueue_lock);
return;
+ }
}

/*
* wq list is used to freeze wq, remove from list after
* flushing is complete in case freeze races us.
*/
- spin_lock_irq(&workqueue_lock);
list_del(&wq->list);
+
spin_unlock_irq(&workqueue_lock);

if (wq->flags & WQ_RESCUER) {
@@ -3338,13 +3385,19 @@ EXPORT_SYMBOL_GPL(workqueue_set_max_active);
bool workqueue_congested(int cpu, struct workqueue_struct *wq)
{
struct pool_workqueue *pwq;
+ bool ret;
+
+ preempt_disable();

if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
else
pwq = first_pwq(wq);

- return !list_empty(&pwq->delayed_works);
+ ret = !list_empty(&pwq->delayed_works);
+ preempt_enable();
+
+ return ret;
}
EXPORT_SYMBOL_GPL(workqueue_congested);

--
1.8.1.4

2013-03-12 18:20:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 13/31] workqueue: update synchronization rules on worker_pool_idr

>From 854a1fa808ed339280eb61b31f18c496c56c7796 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Tue, 12 Mar 2013 11:14:45 -0700

Make worker_pool_idr protected by workqueue_lock for writes and
sched-RCU protected for reads. Lockdep assertions are added to
for_each_pool() and get_work_pool() and all their users are converted
to either hold workqueue_lock or disable preemption/irq.

worker_pool_assign_id() is updated to hold workqueue_lock when
allocating a pool ID. As idr_get_new() always performs RCU-safe
assignment, this is enough on the writer side.

As standard pools are never destroyed, there's nothing to do on that
side.

The locking is superflous at this point. This is to help
implementation of unbound pools/pwqs with custom attributes.

This patch doesn't introduce any behavior changes.

v2: Updated for_each_pwq() use if/else for the hidden assertion
statement instead of just if as suggested by Lai. This avoids
confusing the following else clause.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 71 +++++++++++++++++++++++++++++++++++-------------------
1 file changed, 46 insertions(+), 25 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e060ff2..4638149 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -282,9 +282,18 @@ static inline int __next_wq_cpu(int cpu, const struct cpumask *mask,
* for_each_pool - iterate through all worker_pools in the system
* @pool: iteration cursor
* @id: integer used for iteration
+ *
+ * This must be called either with workqueue_lock held or sched RCU read
+ * locked. If the pool needs to be used beyond the locking in effect, the
+ * caller is responsible for guaranteeing that the pool stays online.
+ *
+ * The if/else clause exists only for the lockdep assertion and can be
+ * ignored.
*/
#define for_each_pool(pool, id) \
- idr_for_each_entry(&worker_pool_idr, pool, id)
+ idr_for_each_entry(&worker_pool_idr, pool, id) \
+ if (({ assert_rcu_or_wq_lock(); false; })) { } \
+ else

/**
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
@@ -432,8 +441,10 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_std_worker_pools);
static struct worker_pool unbound_std_worker_pools[NR_STD_WORKER_POOLS];

-/* idr of all pools */
-static DEFINE_MUTEX(worker_pool_idr_mutex);
+/*
+ * idr of all pools. Modifications are protected by workqueue_lock. Read
+ * accesses are protected by sched-RCU protected.
+ */
static DEFINE_IDR(worker_pool_idr);

static int worker_thread(void *__worker);
@@ -456,21 +467,16 @@ static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;

- mutex_lock(&worker_pool_idr_mutex);
- idr_pre_get(&worker_pool_idr, GFP_KERNEL);
- ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
- mutex_unlock(&worker_pool_idr_mutex);
+ do {
+ if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
+ return -ENOMEM;

- return ret;
-}
+ spin_lock_irq(&workqueue_lock);
+ ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
+ spin_unlock_irq(&workqueue_lock);
+ } while (ret == -EAGAIN);

-/*
- * Lookup worker_pool by id. The idr currently is built during boot and
- * never modified. Don't worry about locking for now.
- */
-static struct worker_pool *worker_pool_by_id(int pool_id)
-{
- return idr_find(&worker_pool_idr, pool_id);
+ return ret;
}

static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
@@ -586,13 +592,23 @@ static struct pool_workqueue *get_work_pwq(struct work_struct *work)
* @work: the work item of interest
*
* Return the worker_pool @work was last associated with. %NULL if none.
+ *
+ * Pools are created and destroyed under workqueue_lock, and allows read
+ * access under sched-RCU read lock. As such, this function should be
+ * called under workqueue_lock or with preemption disabled.
+ *
+ * All fields of the returned pool are accessible as long as the above
+ * mentioned locking is in effect. If the returned pool needs to be used
+ * beyond the critical section, the caller is responsible for ensuring the
+ * returned pool is and stays online.
*/
static struct worker_pool *get_work_pool(struct work_struct *work)
{
unsigned long data = atomic_long_read(&work->data);
- struct worker_pool *pool;
int pool_id;

+ assert_rcu_or_wq_lock();
+
if (data & WORK_STRUCT_PWQ)
return ((struct pool_workqueue *)
(data & WORK_STRUCT_WQ_DATA_MASK))->pool;
@@ -601,9 +617,7 @@ static struct worker_pool *get_work_pool(struct work_struct *work)
if (pool_id == WORK_OFFQ_POOL_NONE)
return NULL;

- pool = worker_pool_by_id(pool_id);
- WARN_ON_ONCE(!pool);
- return pool;
+ return idr_find(&worker_pool_idr, pool_id);
}

/**
@@ -2767,11 +2781,15 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr)
struct pool_workqueue *pwq;

might_sleep();
+
+ local_irq_disable();
pool = get_work_pool(work);
- if (!pool)
+ if (!pool) {
+ local_irq_enable();
return false;
+ }

- spin_lock_irq(&pool->lock);
+ spin_lock(&pool->lock);
/* see the comment in try_to_grab_pending() with the same code */
pwq = get_work_pwq(work);
if (pwq) {
@@ -3414,19 +3432,22 @@ EXPORT_SYMBOL_GPL(workqueue_congested);
*/
unsigned int work_busy(struct work_struct *work)
{
- struct worker_pool *pool = get_work_pool(work);
+ struct worker_pool *pool;
unsigned long flags;
unsigned int ret = 0;

if (work_pending(work))
ret |= WORK_BUSY_PENDING;

+ local_irq_save(flags);
+ pool = get_work_pool(work);
if (pool) {
- spin_lock_irqsave(&pool->lock, flags);
+ spin_lock(&pool->lock);
if (find_worker_executing_work(pool, work))
ret |= WORK_BUSY_RUNNING;
- spin_unlock_irqrestore(&pool->lock, flags);
+ spin_unlock(&pool->lock);
}
+ local_irq_restore(flags);

return ret;
}
--
1.8.1.4

2013-03-12 18:21:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 17/31] workqueue: implement attribute-based unbound worker_pool management

>From 9b5057946252dce279acdd899d9b1513671a1ff7 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Tue, 12 Mar 2013 11:14:49 -0700

This patch makes unbound worker_pools reference counted and
dynamically created and destroyed as workqueues needing them come and
go. All unbound worker_pools are hashed on unbound_pool_hash which is
keyed by the content of worker_pool->attrs.

When an unbound workqueue is allocated, get_unbound_pool() is called
with the attributes of the workqueue. If there already is a matching
worker_pool, the reference count is bumped and the pool is returned.
If not, a new worker_pool with matching attributes is created and
returned.

When an unbound workqueue is destroyed, put_unbound_pool() is called
which decrements the reference count of the associated worker_pool.
If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU
safe way.

Note that the standard unbound worker_pools - normal and highpri ones
with no specific cpumask affinity - are no longer created explicitly
during init_workqueues(). init_workqueues() only initializes
workqueue_attrs to be used for standard unbound pools -
unbound_std_wq_attrs[]. The pools are spawned on demand as workqueues
are created.

v2: - Comment added to init_worker_pool() explaining that @pool should
be in a condition which can be passed to put_unbound_pool() even
on failure.

- pool->refcnt reaching zero and the pool being removed from
unbound_pool_hash should be dynamic. pool->refcnt is converted
to int from atomic_t and now manipulated inside workqueue_lock.

- Removed an incorrect sanity check on nr_idle in
put_unbound_pool() which may trigger spuriously.

All changes were suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 237 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 224 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b0d3cbb..7203956 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,7 @@
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
#include <linux/idr.h>
+#include <linux/jhash.h>
#include <linux/hashtable.h>
#include <linux/rculist.h>

@@ -80,6 +81,7 @@ enum {

NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */

+ UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */

MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
@@ -149,6 +151,8 @@ struct worker_pool {
struct ida worker_ida; /* L: for worker IDs */

struct workqueue_attrs *attrs; /* I: worker attributes */
+ struct hlist_node hash_node; /* R: unbound_pool_hash node */
+ int refcnt; /* refcnt for unbound pools */

/*
* The current concurrency level. As it's likely to be accessed
@@ -156,6 +160,12 @@ struct worker_pool {
* cacheline.
*/
atomic_t nr_running ____cacheline_aligned_in_smp;
+
+ /*
+ * Destruction of pool is sched-RCU protected to allow dereferences
+ * from get_work_pool().
+ */
+ struct rcu_head rcu;
} ____cacheline_aligned_in_smp;

/*
@@ -218,6 +228,11 @@ struct workqueue_struct {

static struct kmem_cache *pwq_cache;

+/* hash of all unbound pools keyed by pool->attrs */
+static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
+
+static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
+
struct workqueue_struct *system_wq __read_mostly;
EXPORT_SYMBOL_GPL(system_wq);
struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -1742,7 +1757,7 @@ static struct worker *create_worker(struct worker_pool *pool)
worker->pool = pool;
worker->id = id;

- if (pool->cpu != WORK_CPU_UNBOUND)
+ if (pool->cpu >= 0)
worker->task = kthread_create_on_node(worker_thread,
worker, cpu_to_node(pool->cpu),
"kworker/%d:%d%s", pool->cpu, id, pri);
@@ -3161,16 +3176,68 @@ fail:
return NULL;
}

+static void copy_workqueue_attrs(struct workqueue_attrs *to,
+ const struct workqueue_attrs *from)
+{
+ to->nice = from->nice;
+ cpumask_copy(to->cpumask, from->cpumask);
+}
+
+/*
+ * Hacky implementation of jhash of bitmaps which only considers the
+ * specified number of bits. We probably want a proper implementation in
+ * include/linux/jhash.h.
+ */
+static u32 jhash_bitmap(const unsigned long *bitmap, int bits, u32 hash)
+{
+ int nr_longs = bits / BITS_PER_LONG;
+ int nr_leftover = bits % BITS_PER_LONG;
+ unsigned long leftover = 0;
+
+ if (nr_longs)
+ hash = jhash(bitmap, nr_longs * sizeof(long), hash);
+ if (nr_leftover) {
+ bitmap_copy(&leftover, bitmap + nr_longs, nr_leftover);
+ hash = jhash(&leftover, sizeof(long), hash);
+ }
+ return hash;
+}
+
+/* hash value of the content of @attr */
+static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
+{
+ u32 hash = 0;
+
+ hash = jhash_1word(attrs->nice, hash);
+ hash = jhash_bitmap(cpumask_bits(attrs->cpumask), nr_cpu_ids, hash);
+ return hash;
+}
+
+/* content equality test */
+static bool wqattrs_equal(const struct workqueue_attrs *a,
+ const struct workqueue_attrs *b)
+{
+ if (a->nice != b->nice)
+ return false;
+ if (!cpumask_equal(a->cpumask, b->cpumask))
+ return false;
+ return true;
+}
+
/**
* init_worker_pool - initialize a newly zalloc'd worker_pool
* @pool: worker_pool to initialize
*
* Initiailize a newly zalloc'd @pool. It also allocates @pool->attrs.
- * Returns 0 on success, -errno on failure.
+ * Returns 0 on success, -errno on failure. Even on failure, all fields
+ * inside @pool proper are initialized and put_unbound_pool() can be called
+ * on @pool safely to release it.
*/
static int init_worker_pool(struct worker_pool *pool)
{
spin_lock_init(&pool->lock);
+ pool->id = -1;
+ pool->cpu = -1;
pool->flags |= POOL_DISASSOCIATED;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
@@ -3187,12 +3254,136 @@ static int init_worker_pool(struct worker_pool *pool)
mutex_init(&pool->assoc_mutex);
ida_init(&pool->worker_ida);

+ INIT_HLIST_NODE(&pool->hash_node);
+ pool->refcnt = 1;
+
+ /* shouldn't fail above this point */
pool->attrs = alloc_workqueue_attrs(GFP_KERNEL);
if (!pool->attrs)
return -ENOMEM;
return 0;
}

+static void rcu_free_pool(struct rcu_head *rcu)
+{
+ struct worker_pool *pool = container_of(rcu, struct worker_pool, rcu);
+
+ ida_destroy(&pool->worker_ida);
+ free_workqueue_attrs(pool->attrs);
+ kfree(pool);
+}
+
+/**
+ * put_unbound_pool - put a worker_pool
+ * @pool: worker_pool to put
+ *
+ * Put @pool. If its refcnt reaches zero, it gets destroyed in sched-RCU
+ * safe manner.
+ */
+static void put_unbound_pool(struct worker_pool *pool)
+{
+ struct worker *worker;
+
+ spin_lock_irq(&workqueue_lock);
+ if (--pool->refcnt) {
+ spin_unlock_irq(&workqueue_lock);
+ return;
+ }
+
+ /* sanity checks */
+ if (WARN_ON(!(pool->flags & POOL_DISASSOCIATED)) ||
+ WARN_ON(!list_empty(&pool->worklist))) {
+ spin_unlock_irq(&workqueue_lock);
+ return;
+ }
+
+ /* release id and unhash */
+ if (pool->id >= 0)
+ idr_remove(&worker_pool_idr, pool->id);
+ hash_del(&pool->hash_node);
+
+ spin_unlock_irq(&workqueue_lock);
+
+ /* lock out manager and destroy all workers */
+ mutex_lock(&pool->manager_mutex);
+ spin_lock_irq(&pool->lock);
+
+ while ((worker = first_worker(pool)))
+ destroy_worker(worker);
+ WARN_ON(pool->nr_workers || pool->nr_idle);
+
+ spin_unlock_irq(&pool->lock);
+ mutex_unlock(&pool->manager_mutex);
+
+ /* shut down the timers */
+ del_timer_sync(&pool->idle_timer);
+ del_timer_sync(&pool->mayday_timer);
+
+ /* sched-RCU protected to allow dereferences from get_work_pool() */
+ call_rcu_sched(&pool->rcu, rcu_free_pool);
+}
+
+/**
+ * get_unbound_pool - get a worker_pool with the specified attributes
+ * @attrs: the attributes of the worker_pool to get
+ *
+ * Obtain a worker_pool which has the same attributes as @attrs, bump the
+ * reference count and return it. If there already is a matching
+ * worker_pool, it will be used; otherwise, this function attempts to
+ * create a new one. On failure, returns NULL.
+ */
+static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
+{
+ static DEFINE_MUTEX(create_mutex);
+ u32 hash = wqattrs_hash(attrs);
+ struct worker_pool *pool;
+ struct worker *worker;
+
+ mutex_lock(&create_mutex);
+
+ /* do we already have a matching pool? */
+ spin_lock_irq(&workqueue_lock);
+ hash_for_each_possible(unbound_pool_hash, pool, hash_node, hash) {
+ if (wqattrs_equal(pool->attrs, attrs)) {
+ pool->refcnt++;
+ goto out_unlock;
+ }
+ }
+ spin_unlock_irq(&workqueue_lock);
+
+ /* nope, create a new one */
+ pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+ if (!pool || init_worker_pool(pool) < 0)
+ goto fail;
+
+ copy_workqueue_attrs(pool->attrs, attrs);
+
+ if (worker_pool_assign_id(pool) < 0)
+ goto fail;
+
+ /* create and start the initial worker */
+ worker = create_worker(pool);
+ if (!worker)
+ goto fail;
+
+ spin_lock_irq(&pool->lock);
+ start_worker(worker);
+ spin_unlock_irq(&pool->lock);
+
+ /* install */
+ spin_lock_irq(&workqueue_lock);
+ hash_add(unbound_pool_hash, &pool->hash_node, hash);
+out_unlock:
+ spin_unlock_irq(&workqueue_lock);
+ mutex_unlock(&create_mutex);
+ return pool;
+fail:
+ mutex_unlock(&create_mutex);
+ if (pool)
+ put_unbound_pool(pool);
+ return NULL;
+}
+
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
bool highpri = wq->flags & WQ_HIGHPRI;
@@ -3217,7 +3408,12 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
if (!pwq)
return -ENOMEM;

- pwq->pool = get_std_worker_pool(WORK_CPU_UNBOUND, highpri);
+ pwq->pool = get_unbound_pool(unbound_std_wq_attrs[highpri]);
+ if (!pwq->pool) {
+ kmem_cache_free(pwq_cache, pwq);
+ return -ENOMEM;
+ }
+
list_add_tail_rcu(&pwq->pwqs_node, &wq->pwqs);
}

@@ -3395,6 +3591,15 @@ void destroy_workqueue(struct workqueue_struct *wq)
kfree(wq->rescuer);
}

+ /*
+ * We're the sole accessor of @wq at this point. Directly access
+ * the first pwq and put its pool.
+ */
+ if (wq->flags & WQ_UNBOUND) {
+ pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
+ pwqs_node);
+ put_unbound_pool(pwq->pool);
+ }
free_pwqs(wq);
kfree(wq);
}
@@ -3857,19 +4062,14 @@ static int __init init_workqueues(void)
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);

/* initialize CPU pools */
- for_each_wq_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
struct worker_pool *pool;

i = 0;
for_each_std_worker_pool(pool, cpu) {
BUG_ON(init_worker_pool(pool));
pool->cpu = cpu;
-
- if (cpu != WORK_CPU_UNBOUND)
- cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
- else
- cpumask_setall(pool->attrs->cpumask);
-
+ cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];

/* alloc pool ID */
@@ -3878,14 +4078,13 @@ static int __init init_workqueues(void)
}

/* create the initial worker */
- for_each_online_wq_cpu(cpu) {
+ for_each_online_cpu(cpu) {
struct worker_pool *pool;

for_each_std_worker_pool(pool, cpu) {
struct worker *worker;

- if (cpu != WORK_CPU_UNBOUND)
- pool->flags &= ~POOL_DISASSOCIATED;
+ pool->flags &= ~POOL_DISASSOCIATED;

worker = create_worker(pool);
BUG_ON(!worker);
@@ -3895,6 +4094,18 @@ static int __init init_workqueues(void)
}
}

+ /* create default unbound wq attrs */
+ for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
+ struct workqueue_attrs *attrs;
+
+ BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
+
+ attrs->nice = std_nice[i];
+ cpumask_setall(attrs->cpumask);
+
+ unbound_std_wq_attrs[i] = attrs;
+ }
+
system_wq = alloc_workqueue("events", 0, 0);
system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
system_long_wq = alloc_workqueue("events_long", 0, 0);
--
1.8.1.4

2013-03-12 18:34:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

On Fri, Mar 01, 2013 at 07:23:51PM -0800, Tejun Heo wrote:
> Finally, here's the unbound workqueue with custom worker attributes
> patchset I've been talking about. The goal is simple. We want
> unbound workqueues with custom worker attributes with a mechanism to
> expose the knobs to userland.

Applied to wq/for-3.10 with the updated patches. Some patches needed
minor cosmetic updates because wq/for-3.9-fixes hasn't been merged
yet. If somebody wants the full patchset posted again, please holler.

Thanks.

--
tejun

2013-03-12 18:40:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/31] driver/base: implement subsys_virtual_register()

On Sun, Mar 10, 2013 at 11:34:00AM -0700, Tejun Heo wrote:
> On Sun, Mar 10, 2013 at 06:50:34PM +0100, Kay Sievers wrote:
> > > Yes, but I can keep Tejun's patch in my local queue for now, dbus is
> > > going to not make 3.10, right?
> >
> > No, sure not. It's just something we will need there too, but there is
> > no hurry, it's only a cosmetic issue anyway and nothing that matters
> > functionality-wise.
>
> In that case, I'll just route it together with the rest of workqueue
> changes with Greg's ack added.

The patch is applied to wq/for-3.10-subsys_virtual_register which
contains only this patch on top of v3.9-rc1. If somebody ever needs
it in this cycle, please feel free to pull the following branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.10-subsys_virtual_register

Thanks.

--
tejun