2010-06-14 21:41:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] workqueue: concurrency managed workqueue, take#5

Hello, all.

This is the fifth take of cmwq (concurrency managed workqueue)
patchset. It's on top of v2.6.35-rc3 + sched/core patches. Git tree
is available at

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

Changes from the last take[L] are...

* fscache patches are omitted for now.

* The patchset is rebased on cpu_stop + sched/core, which now includes
all the necessary scheduler patches. cpu_stop already reimplements
stop_machine so that it doesn't use RT workqueue, so this patchset
simply drops RT wq support.

* __set_cpus_allowed() was determined to be unnecessary with recent
scheduler changes. On cpu re-onlining, cmwq now kills all idle
workers and tells busy ones to rebind after finishing the current
work by scheduling a dedicated work. This allows managing proper
cpu binding without adding overhead to hotpath.

* Oleg's clear work->data patch moved at the head of the queue and now
lives in the for-next branch which will be pushed to mainline on the
next merge window.

* Applied Oleg's review.

* Comments updated as suggested.

* work_flags_to_color() replaced w/ get_work_color()

* nr_cwqs_to_flush bug which could cause premature flush completion
fixed.

* Replace rewind + list_for_each_entry_safe_continue() w/
list_for_each_entry_safe_from().

* Don't directly write to *work_data_bits() but use __set_bit()
instead.

* Fixed cpu hotplug exclusion bug.

* Other misc tweaks.

Now that all scheduler bits are in place, I'll keep the tree stable
and publish it to linux-next soonish, so this hopefully is the last of
exhausting massive postings of this patchset.

Jeff, Arjan, I think it'll be best to route the libata and async
patches through wq tree. Would that be okay?

This patchset contains the following patches.

0001-sched-consult-online-mask-instead-of-active-in-selec.patch
0002-sched-rename-preempt_notifiers-to-sched_notifiers-an.patch
0003-sched-refactor-try_to_wake_up.patch
0004-sched-implement-__set_cpus_allowed.patch
0005-sched-make-sched_notifiers-unconditional.patch
0006-sched-add-wakeup-sleep-sched_notifiers-and-allow-NUL.patch
0007-sched-implement-try_to_wake_up_local.patch
0008-workqueue-change-cancel_work_sync-to-clear-work-data.patch
0009-acpi-use-queue_work_on-instead-of-binding-workqueue-.patch
0010-stop_machine-reimplement-without-using-workqueue.patch
0011-workqueue-misc-cosmetic-updates.patch
0012-workqueue-merge-feature-parameters-into-flags.patch
0013-workqueue-define-masks-for-work-flags-and-conditiona.patch
0014-workqueue-separate-out-process_one_work.patch
0015-workqueue-temporarily-disable-workqueue-tracing.patch
0016-workqueue-kill-cpu_populated_map.patch
0017-workqueue-update-cwq-alignement.patch
0018-workqueue-reimplement-workqueue-flushing-using-color.patch
0019-workqueue-introduce-worker.patch
0020-workqueue-reimplement-work-flushing-using-linked-wor.patch
0021-workqueue-implement-per-cwq-active-work-limit.patch
0022-workqueue-reimplement-workqueue-freeze-using-max_act.patch
0023-workqueue-introduce-global-cwq-and-unify-cwq-locks.patch
0024-workqueue-implement-worker-states.patch
0025-workqueue-reimplement-CPU-hotplugging-support-using-.patch
0026-workqueue-make-single-thread-workqueue-shared-worker.patch
0027-workqueue-add-find_worker_executing_work-and-track-c.patch
0028-workqueue-carry-cpu-number-in-work-data-once-executi.patch
0029-workqueue-implement-WQ_NON_REENTRANT.patch
0030-workqueue-use-shared-worklist-and-pool-all-workers-p.patch
0031-workqueue-implement-concurrency-managed-dynamic-work.patch
0032-workqueue-increase-max_active-of-keventd-and-kill-cu.patch
0033-workqueue-add-system_wq-system_long_wq-and-system_nr.patch
0034-workqueue-implement-DEBUGFS-workqueue.patch
0035-workqueue-implement-several-utility-APIs.patch
0036-libata-take-advantage-of-cmwq-and-remove-concurrency.patch
0037-async-use-workqueue-for-worker-pool.patch
0038-fscache-convert-object-to-use-workqueue-instead-of-s.patch
0039-fscache-convert-operation-to-use-workqueue-instead-o.patch
0040-fscache-drop-references-to-slow-work.patch
0041-cifs-use-workqueue-instead-of-slow-work.patch
0042-gfs2-use-workqueue-instead-of-slow-work.patch
0043-slow-work-kill-it.patch

diffstat.

arch/ia64/kernel/smpboot.c | 2
arch/x86/kernel/smpboot.c | 2
drivers/acpi/osl.c | 40
drivers/ata/libata-core.c | 20
drivers/ata/libata-eh.c | 4
drivers/ata/libata-scsi.c | 10
drivers/ata/libata-sff.c | 9
drivers/ata/libata.h | 1
include/linux/cpu.h | 2
include/linux/kthread.h | 1
include/linux/libata.h | 1
include/linux/workqueue.h | 146 +
kernel/async.c | 140 -
kernel/kthread.c | 15
kernel/power/process.c | 21
kernel/trace/Kconfig | 4
kernel/workqueue.c | 3313 +++++++++++++++++++++++++++++++++++++++------
kernel/workqueue_sched.h | 13
lib/Kconfig.debug | 7
19 files changed, 3128 insertions(+), 623 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/954759


2010-06-14 21:39:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/30] workqueue: separate out process_one_work()

Separate out process_one_work() out of run_workqueue(). This patch
doesn't cause any behavior change.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 100 +++++++++++++++++++++++++++++++--------------------
1 files changed, 61 insertions(+), 39 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5c49d76..8e3082b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -402,51 +402,73 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+/**
+ * process_one_work - process single work
+ * @cwq: cwq to process work for
+ * @work: work to process
+ *
+ * Process @work. This function contains all the logics necessary to
+ * process a single work including synchronization against and
+ * interaction with other workers on the same cpu, queueing and
+ * flushing. As long as context requirement is met, any worker can
+ * call this function to process a work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ */
+static void process_one_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work)
+{
+ work_func_t f = work->func;
+#ifdef CONFIG_LOCKDEP
+ /*
+ * It is permissible to free the struct work_struct from
+ * inside the function that is called from it, this we need to
+ * take into account for lockdep too. To avoid bogus "held
+ * lock freed" warnings as well as problems when looking into
+ * work->lockdep_map, make a copy and use that here.
+ */
+ struct lockdep_map lockdep_map = work->lockdep_map;
+#endif
+ /* claim and process */
+ trace_workqueue_execution(cwq->thread, work);
+ debug_work_deactivate(work);
+ cwq->current_work = work;
+ list_del_init(&work->entry);
+
+ spin_unlock_irq(&cwq->lock);
+
+ BUG_ON(get_wq_data(work) != cwq);
+ work_clear_pending(work);
+ lock_map_acquire(&cwq->wq->lockdep_map);
+ lock_map_acquire(&lockdep_map);
+ f(work);
+ lock_map_release(&lockdep_map);
+ lock_map_release(&cwq->wq->lockdep_map);
+
+ if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+ printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
+ "%s/0x%08x/%d\n",
+ current->comm, preempt_count(), task_pid_nr(current));
+ printk(KERN_ERR " last function: ");
+ print_symbol("%s\n", (unsigned long)f);
+ debug_show_held_locks(current);
+ dump_stack();
+ }
+
+ spin_lock_irq(&cwq->lock);
+
+ /* we're done with it, release */
+ cwq->current_work = NULL;
+}
+
static void run_workqueue(struct cpu_workqueue_struct *cwq)
{
spin_lock_irq(&cwq->lock);
while (!list_empty(&cwq->worklist)) {
struct work_struct *work = list_entry(cwq->worklist.next,
struct work_struct, entry);
- work_func_t f = work->func;
-#ifdef CONFIG_LOCKDEP
- /*
- * It is permissible to free the struct work_struct
- * from inside the function that is called from it,
- * this we need to take into account for lockdep too.
- * To avoid bogus "held lock freed" warnings as well
- * as problems when looking into work->lockdep_map,
- * make a copy and use that here.
- */
- struct lockdep_map lockdep_map = work->lockdep_map;
-#endif
- trace_workqueue_execution(cwq->thread, work);
- debug_work_deactivate(work);
- cwq->current_work = work;
- list_del_init(cwq->worklist.next);
- spin_unlock_irq(&cwq->lock);
-
- BUG_ON(get_wq_data(work) != cwq);
- work_clear_pending(work);
- lock_map_acquire(&cwq->wq->lockdep_map);
- lock_map_acquire(&lockdep_map);
- f(work);
- lock_map_release(&lockdep_map);
- lock_map_release(&cwq->wq->lockdep_map);
-
- if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
- printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
- "%s/0x%08x/%d\n",
- current->comm, preempt_count(),
- task_pid_nr(current));
- printk(KERN_ERR " last function: ");
- print_symbol("%s\n", (unsigned long)f);
- debug_show_held_locks(current);
- dump_stack();
- }
-
- spin_lock_irq(&cwq->lock);
- cwq->current_work = NULL;
+ process_one_work(cwq, work);
}
spin_unlock_irq(&cwq->lock);
}
--
1.6.4.2

2010-06-14 21:39:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/30] workqueue: reimplement workqueue freeze using max_active

Currently, workqueue freezing is implemented by marking the worker
freezeable and calling try_to_freeze() from dispatch loop.
Reimplement it using cwq->limit so that the workqueue is frozen
instead of the worker.

* workqueue_struct->saved_max_active is added which stores the
specified max_active on initialization.

* On freeze, all cwq->max_active's are quenched to zero. Freezing is
complete when nr_active on all cwqs reach zero.

* On thaw, all cwq->max_active's are restored to wq->saved_max_active
and the worklist is repopulated.

This new implementation allows having single shared pool of workers
per cpu.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 7 ++
kernel/power/process.c | 21 +++++-
kernel/workqueue.c | 163 ++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 179 insertions(+), 12 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index eb753b7..ab0b7fb 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -340,4 +340,11 @@ static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
#else
long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
#endif /* CONFIG_SMP */
+
+#ifdef CONFIG_FREEZER
+extern void freeze_workqueues_begin(void);
+extern bool freeze_workqueues_busy(void);
+extern void thaw_workqueues(void);
+#endif /* CONFIG_FREEZER */
+
#endif
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 71ae290..028a995 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -15,6 +15,7 @@
#include <linux/syscalls.h>
#include <linux/freezer.h>
#include <linux/delay.h>
+#include <linux/workqueue.h>

/*
* Timeout for stopping processes
@@ -35,6 +36,7 @@ static int try_to_freeze_tasks(bool sig_only)
struct task_struct *g, *p;
unsigned long end_time;
unsigned int todo;
+ bool wq_busy = false;
struct timeval start, end;
u64 elapsed_csecs64;
unsigned int elapsed_csecs;
@@ -42,6 +44,10 @@ static int try_to_freeze_tasks(bool sig_only)
do_gettimeofday(&start);

end_time = jiffies + TIMEOUT;
+
+ if (!sig_only)
+ freeze_workqueues_begin();
+
while (true) {
todo = 0;
read_lock(&tasklist_lock);
@@ -63,6 +69,12 @@ static int try_to_freeze_tasks(bool sig_only)
todo++;
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
+
+ if (!sig_only) {
+ wq_busy = freeze_workqueues_busy();
+ todo += wq_busy;
+ }
+
if (!todo || time_after(jiffies, end_time))
break;

@@ -86,8 +98,12 @@ static int try_to_freeze_tasks(bool sig_only)
*/
printk("\n");
printk(KERN_ERR "Freezing of tasks failed after %d.%02d seconds "
- "(%d tasks refusing to freeze):\n",
- elapsed_csecs / 100, elapsed_csecs % 100, todo);
+ "(%d tasks refusing to freeze, wq_busy=%d):\n",
+ elapsed_csecs / 100, elapsed_csecs % 100,
+ todo - wq_busy, wq_busy);
+
+ thaw_workqueues();
+
read_lock(&tasklist_lock);
do_each_thread(g, p) {
task_lock(p);
@@ -157,6 +173,7 @@ void thaw_processes(void)
oom_killer_enable();

printk("Restarting tasks ... ");
+ thaw_workqueues();
thaw_tasks(true);
thaw_tasks(false);
schedule();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 101b92e..44c0fb2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -78,7 +78,7 @@ struct cpu_workqueue_struct {
int nr_in_flight[WORK_NR_COLORS];
/* L: nr of in_flight works */
int nr_active; /* L: nr of active works */
- int max_active; /* I: max active works */
+ int max_active; /* L: max active works */
struct list_head delayed_works; /* L: delayed works */
};

@@ -108,6 +108,7 @@ struct workqueue_struct {
struct list_head flusher_queue; /* F: flush waiters */
struct list_head flusher_overflow; /* F: flush overflow list */

+ int saved_max_active; /* I: saved cwq max_active */
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
@@ -228,6 +229,7 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
static DEFINE_PER_CPU(struct ida, worker_ida);
+static bool workqueue_freezing; /* W: have wqs started freezing? */

static int worker_thread(void *__worker);

@@ -745,19 +747,13 @@ static int worker_thread(void *__worker)
struct cpu_workqueue_struct *cwq = worker->cwq;
DEFINE_WAIT(wait);

- if (cwq->wq->flags & WQ_FREEZEABLE)
- set_freezable();
-
for (;;) {
prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
- if (!freezing(current) &&
- !kthread_should_stop() &&
+ if (!kthread_should_stop() &&
list_empty(&cwq->worklist))
schedule();
finish_wait(&cwq->more_work, &wait);

- try_to_freeze();
-
if (kthread_should_stop())
break;

@@ -1547,6 +1543,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
goto err;

wq->flags = flags;
+ wq->saved_max_active = max_active;
mutex_init(&wq->flush_mutex);
atomic_set(&wq->nr_cwqs_to_flush, 0);
INIT_LIST_HEAD(&wq->flusher_queue);
@@ -1585,8 +1582,19 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
failed = true;
}

+ /*
+ * workqueue_lock protects global freeze state and workqueues
+ * list. Grab it, set max_active accordingly and add the new
+ * workqueue to workqueues list.
+ */
spin_lock(&workqueue_lock);
+
+ if (workqueue_freezing && wq->flags & WQ_FREEZEABLE)
+ for_each_possible_cpu(cpu)
+ get_cwq(cpu, wq)->max_active = 0;
+
list_add(&wq->list, &workqueues);
+
spin_unlock(&workqueue_lock);

cpu_maps_update_done();
@@ -1615,14 +1623,18 @@ void destroy_workqueue(struct workqueue_struct *wq)
{
int cpu;

+ flush_workqueue(wq);
+
+ /*
+ * wq list is used to freeze wq, remove from list after
+ * flushing is complete in case freeze races us.
+ */
cpu_maps_update_begin();
spin_lock(&workqueue_lock);
list_del(&wq->list);
spin_unlock(&workqueue_lock);
cpu_maps_update_done();

- flush_workqueue(wq);
-
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
int i;
@@ -1716,6 +1728,137 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
EXPORT_SYMBOL_GPL(work_on_cpu);
#endif /* CONFIG_SMP */

+#ifdef CONFIG_FREEZER
+
+/**
+ * freeze_workqueues_begin - begin freezing workqueues
+ *
+ * Start freezing workqueues. After this function returns, all
+ * freezeable workqueues will queue new works to their frozen_works
+ * list instead of the cwq ones.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void freeze_workqueues_begin(void)
+{
+ struct workqueue_struct *wq;
+ unsigned int cpu;
+
+ spin_lock(&workqueue_lock);
+
+ BUG_ON(workqueue_freezing);
+ workqueue_freezing = true;
+
+ for_each_possible_cpu(cpu) {
+ list_for_each_entry(wq, &workqueues, list) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ spin_lock_irq(&cwq->lock);
+
+ if (wq->flags & WQ_FREEZEABLE)
+ cwq->max_active = 0;
+
+ spin_unlock_irq(&cwq->lock);
+ }
+ }
+
+ spin_unlock(&workqueue_lock);
+}
+
+/**
+ * freeze_workqueues_busy - are freezeable workqueues still busy?
+ *
+ * Check whether freezing is complete. This function must be called
+ * between freeze_workqueues_begin() and thaw_workqueues().
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock.
+ *
+ * RETURNS:
+ * %true if some freezeable workqueues are still busy. %false if
+ * freezing is complete.
+ */
+bool freeze_workqueues_busy(void)
+{
+ struct workqueue_struct *wq;
+ unsigned int cpu;
+ bool busy = false;
+
+ spin_lock(&workqueue_lock);
+
+ BUG_ON(!workqueue_freezing);
+
+ for_each_possible_cpu(cpu) {
+ /*
+ * nr_active is monotonically decreasing. It's safe
+ * to peek without lock.
+ */
+ list_for_each_entry(wq, &workqueues, list) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ if (!(wq->flags & WQ_FREEZEABLE))
+ continue;
+
+ BUG_ON(cwq->nr_active < 0);
+ if (cwq->nr_active) {
+ busy = true;
+ goto out_unlock;
+ }
+ }
+ }
+out_unlock:
+ spin_unlock(&workqueue_lock);
+ return busy;
+}
+
+/**
+ * thaw_workqueues - thaw workqueues
+ *
+ * Thaw workqueues. Normal queueing is restored and all collected
+ * frozen works are transferred to their respective cwq worklists.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void thaw_workqueues(void)
+{
+ struct workqueue_struct *wq;
+ unsigned int cpu;
+
+ spin_lock(&workqueue_lock);
+
+ if (!workqueue_freezing)
+ goto out_unlock;
+
+ for_each_possible_cpu(cpu) {
+ list_for_each_entry(wq, &workqueues, list) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ if (!(wq->flags & WQ_FREEZEABLE))
+ continue;
+
+ spin_lock_irq(&cwq->lock);
+
+ /* restore max_active and repopulate worklist */
+ cwq->max_active = wq->saved_max_active;
+
+ while (!list_empty(&cwq->delayed_works) &&
+ cwq->nr_active < cwq->max_active)
+ cwq_activate_first_delayed(cwq);
+
+ wake_up(&cwq->more_work);
+
+ spin_unlock_irq(&cwq->lock);
+ }
+ }
+
+ workqueue_freezing = false;
+out_unlock:
+ spin_unlock(&workqueue_lock);
+}
+#endif /* CONFIG_FREEZER */
+
void __init init_workqueues(void)
{
unsigned int cpu;
--
1.6.4.2

2010-06-14 21:39:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/30] workqueue: temporarily disable workqueue tracing

Strip tracing code from workqueue and disable workqueue tracing. This
is temporary measure till concurrency managed workqueue is complete.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/trace/Kconfig | 4 +++-
kernel/workqueue.c | 14 +++-----------
2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8b1797c..74f0260 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -392,7 +392,9 @@ config KMEMTRACE
If unsure, say N.

config WORKQUEUE_TRACER
- bool "Trace workqueues"
+# Temporarily disabled during workqueue reimplementation
+# bool "Trace workqueues"
+ def_bool n
select GENERIC_TRACER
help
The workqueue tracer provides some statistical information
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8e3082b..f7ab703 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,8 +33,6 @@
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
-#define CREATE_TRACE_POINTS
-#include <trace/events/workqueue.h>

/*
* Structure fields follow one of the following exclusion rules.
@@ -243,10 +241,10 @@ static inline void clear_wq_data(struct work_struct *work)
atomic_long_set(&work->data, work_static(work));
}

-static inline
-struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
+static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
{
- return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
+ return (void *)(atomic_long_read(&work->data) &
+ WORK_STRUCT_WQ_DATA_MASK);
}

/**
@@ -265,8 +263,6 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head,
unsigned int extra_flags)
{
- trace_workqueue_insertion(cwq->thread, work);
-
/* we own @work, set data and link */
set_wq_data(work, cwq, extra_flags);

@@ -431,7 +427,6 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
/* claim and process */
- trace_workqueue_execution(cwq->thread, work);
debug_work_deactivate(work);
cwq->current_work = work;
list_del_init(&work->entry);
@@ -1017,8 +1012,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
return PTR_ERR(p);
cwq->thread = p;

- trace_workqueue_creation(cwq->thread, cpu);
-
return 0;
}

@@ -1123,7 +1116,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
* checks list_empty(), and a "normal" queue_work() can't use
* a dead CPU.
*/
- trace_workqueue_destruction(cwq->thread);
kthread_stop(cwq->thread);
cwq->thread = NULL;
}
--
1.6.4.2

2010-06-14 21:39:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/30] workqueue: reimplement work flushing using linked works

A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained. When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.

Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference. This change is to
prepare for multiple shared workers per cpu.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 4 +-
kernel/workqueue.c | 152 ++++++++++++++++++++++++++++++++++++++------
2 files changed, 134 insertions(+), 22 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 8762f62..4f4fdba 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -24,8 +24,9 @@ typedef void (*work_func_t)(struct work_struct *work);

enum {
WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
+ WORK_STRUCT_LINKED_BIT = 1, /* next work is linked to this one */
#ifdef CONFIG_DEBUG_OBJECTS_WORK
- WORK_STRUCT_STATIC_BIT = 1, /* static initializer (debugobjects) */
+ WORK_STRUCT_STATIC_BIT = 2, /* static initializer (debugobjects) */
WORK_STRUCT_COLOR_SHIFT = 3, /* color for workqueue flushing */
#else
WORK_STRUCT_COLOR_SHIFT = 2, /* color for workqueue flushing */
@@ -34,6 +35,7 @@ enum {
WORK_STRUCT_COLOR_BITS = 4,

WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
+ WORK_STRUCT_LINKED = 1 << WORK_STRUCT_LINKED_BIT,
#ifdef CONFIG_DEBUG_OBJECTS_WORK
WORK_STRUCT_STATIC = 1 << WORK_STRUCT_STATIC_BIT,
#else
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0b0c360..74b399b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@ struct cpu_workqueue_struct;

struct worker {
struct work_struct *current_work; /* L: work being processed */
+ struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
int id; /* I: worker id */
@@ -445,6 +446,8 @@ static struct worker *alloc_worker(void)
struct worker *worker;

worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+ if (worker)
+ INIT_LIST_HEAD(&worker->scheduled);
return worker;
}

@@ -530,6 +533,7 @@ static void destroy_worker(struct worker *worker)

/* sanity check frenzy */
BUG_ON(worker->current_work);
+ BUG_ON(!list_empty(&worker->scheduled));

kthread_stop(worker->task);
kfree(worker);
@@ -540,6 +544,47 @@ static void destroy_worker(struct worker *worker)
}

/**
+ * move_linked_works - move linked works to a list
+ * @work: start of series of works to be scheduled
+ * @head: target list to append @work to
+ * @nextp: out paramter for nested worklist walking
+ *
+ * Schedule linked works starting from @work to @head. Work series to
+ * be scheduled starts at @work and includes any consecutive work with
+ * WORK_STRUCT_LINKED set in its predecessor.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of
+ * the last scheduled work. This allows move_linked_works() to be
+ * nested inside outer list_for_each_entry_safe().
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void move_linked_works(struct work_struct *work, struct list_head *head,
+ struct work_struct **nextp)
+{
+ struct work_struct *n;
+
+ /*
+ * Linked worklist will always end before the end of the list,
+ * use NULL for list head.
+ */
+ list_for_each_entry_safe_from(work, n, NULL, entry) {
+ list_move_tail(&work->entry, head);
+ if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
+ break;
+ }
+
+ /*
+ * If we're already inside safe list traversal and have moved
+ * multiple works to the scheduled queue, the next position
+ * needs to be updated.
+ */
+ if (nextp)
+ *nextp = n;
+}
+
+/**
* cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
* @cwq: cwq of interest
* @color: color of work which left the queue
@@ -639,17 +684,25 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
cwq_dec_nr_in_flight(cwq, work_color);
}

-static void run_workqueue(struct worker *worker)
+/**
+ * process_scheduled_works - process scheduled works
+ * @worker: self
+ *
+ * Process all scheduled works. Please note that the scheduled list
+ * may change while processing a work, so this function repeatedly
+ * fetches a work from the top and executes it.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * multiple times.
+ */
+static void process_scheduled_works(struct worker *worker)
{
- struct cpu_workqueue_struct *cwq = worker->cwq;
-
- spin_lock_irq(&cwq->lock);
- while (!list_empty(&cwq->worklist)) {
- struct work_struct *work = list_entry(cwq->worklist.next,
+ while (!list_empty(&worker->scheduled)) {
+ struct work_struct *work = list_first_entry(&worker->scheduled,
struct work_struct, entry);
process_one_work(worker, work);
}
- spin_unlock_irq(&cwq->lock);
}

/**
@@ -684,7 +737,28 @@ static int worker_thread(void *__worker)
get_cpu_mask(cwq->cpu))))
set_cpus_allowed_ptr(worker->task,
get_cpu_mask(cwq->cpu));
- run_workqueue(worker);
+
+ spin_lock_irq(&cwq->lock);
+
+ while (!list_empty(&cwq->worklist)) {
+ struct work_struct *work =
+ list_first_entry(&cwq->worklist,
+ struct work_struct, entry);
+
+ if (likely(!(*work_data_bits(work) &
+ WORK_STRUCT_LINKED))) {
+ /* optimization path, not strictly necessary */
+ process_one_work(worker, work);
+ if (unlikely(!list_empty(&worker->scheduled)))
+ process_scheduled_works(worker);
+ } else {
+ move_linked_works(work, &worker->scheduled,
+ NULL);
+ process_scheduled_works(worker);
+ }
+ }
+
+ spin_unlock_irq(&cwq->lock);
}

return 0;
@@ -705,16 +779,33 @@ static void wq_barrier_func(struct work_struct *work)
* insert_wq_barrier - insert a barrier work
* @cwq: cwq to insert barrier into
* @barr: wq_barrier to insert
- * @head: insertion point
+ * @target: target work to attach @barr to
+ * @worker: worker currently executing @target, NULL if @target is not executing
*
- * Insert barrier @barr into @cwq before @head.
+ * @barr is linked to @target such that @barr is completed only after
+ * @target finishes execution. Please note that the ordering
+ * guarantee is observed only with respect to @target and on the local
+ * cpu.
+ *
+ * Currently, a queued barrier can't be canceled. This is because
+ * try_to_grab_pending() can't determine whether the work to be
+ * grabbed is at the head of the queue and thus can't clear LINKED
+ * flag of the previous work while there must be a valid next work
+ * after a work with LINKED flag set.
+ *
+ * Note that when @worker is non-NULL, @target may be modified
+ * underneath us, so we can't reliably determine cwq from @target.
*
* CONTEXT:
* spin_lock_irq(cwq->lock).
*/
static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
- struct wq_barrier *barr, struct list_head *head)
+ struct wq_barrier *barr,
+ struct work_struct *target, struct worker *worker)
{
+ struct list_head *head;
+ unsigned int linked = 0;
+
/*
* debugobject calls are safe here even with cwq->lock locked
* as we know for sure that this will not trigger any of the
@@ -725,8 +816,24 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
init_completion(&barr->done);

+ /*
+ * If @target is currently being executed, schedule the
+ * barrier to the worker; otherwise, put it after @target.
+ */
+ if (worker)
+ head = worker->scheduled.next;
+ else {
+ unsigned long *bits = work_data_bits(target);
+
+ head = target->entry.next;
+ /* there can already be other linked works, inherit and set */
+ linked = *bits & WORK_STRUCT_LINKED;
+ __set_bit(WORK_STRUCT_LINKED_BIT, bits);
+ }
+
debug_work_activate(&barr->work);
- insert_work(cwq, &barr->work, head, work_color_to_flags(WORK_NO_COLOR));
+ insert_work(cwq, &barr->work, head,
+ work_color_to_flags(WORK_NO_COLOR) | linked);
}

/**
@@ -964,8 +1071,8 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
*/
int flush_work(struct work_struct *work)
{
+ struct worker *worker = NULL;
struct cpu_workqueue_struct *cwq;
- struct list_head *prev;
struct wq_barrier barr;

might_sleep();
@@ -985,14 +1092,14 @@ int flush_work(struct work_struct *work)
smp_rmb();
if (unlikely(cwq != get_wq_data(work)))
goto already_gone;
- prev = &work->entry;
} else {
- if (!cwq->worker || cwq->worker->current_work != work)
+ if (cwq->worker && cwq->worker->current_work == work)
+ worker = cwq->worker;
+ if (!worker)
goto already_gone;
- prev = &cwq->worklist;
}
- insert_wq_barrier(cwq, &barr, prev->next);

+ insert_wq_barrier(cwq, &barr, work, worker);
spin_unlock_irq(&cwq->lock);
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
@@ -1048,16 +1155,19 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
struct wq_barrier barr;
- int running = 0;
+ struct worker *worker;

spin_lock_irq(&cwq->lock);
+
+ worker = NULL;
if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
- insert_wq_barrier(cwq, &barr, cwq->worklist.next);
- running = 1;
+ worker = cwq->worker;
+ insert_wq_barrier(cwq, &barr, work, worker);
}
+
spin_unlock_irq(&cwq->lock);

- if (unlikely(running)) {
+ if (unlikely(worker)) {
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
}
--
1.6.4.2

2010-06-14 21:39:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/30] workqueue: define masks for work flags and conditionalize STATIC flags

Work flags are about to see more traditional mask handling. Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks. Also, make WORK_STRUCT_STATIC_* flags
conditional

While at it, re-define these constants as enums and use
WORK_STRUCT_STATIC instead of hard-coding 2 in
WORK_DATA_STATIC_INIT().

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 29 +++++++++++++++++++++--------
kernel/workqueue.c | 12 ++++++------
2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d89cfc1..d60c570 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -22,12 +22,25 @@ typedef void (*work_func_t)(struct work_struct *work);
*/
#define work_data_bits(work) ((unsigned long *)(&(work)->data))

+enum {
+ WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
+#ifdef CONFIG_DEBUG_OBJECTS_WORK
+ WORK_STRUCT_STATIC_BIT = 1, /* static initializer (debugobjects) */
+#endif
+
+ WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
+#ifdef CONFIG_DEBUG_OBJECTS_WORK
+ WORK_STRUCT_STATIC = 1 << WORK_STRUCT_STATIC_BIT,
+#else
+ WORK_STRUCT_STATIC = 0,
+#endif
+
+ WORK_STRUCT_FLAG_MASK = 3UL,
+ WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+};
+
struct work_struct {
atomic_long_t data;
-#define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
-#define WORK_STRUCT_STATIC 1 /* static initializer (debugobjects) */
-#define WORK_STRUCT_FLAG_MASK (3UL)
-#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
@@ -36,7 +49,7 @@ struct work_struct {
};

#define WORK_DATA_INIT() ATOMIC_LONG_INIT(0)
-#define WORK_DATA_STATIC_INIT() ATOMIC_LONG_INIT(2)
+#define WORK_DATA_STATIC_INIT() ATOMIC_LONG_INIT(WORK_STRUCT_STATIC)

struct delayed_work {
struct work_struct work;
@@ -98,7 +111,7 @@ extern void __init_work(struct work_struct *work, int onstack);
extern void destroy_work_on_stack(struct work_struct *work);
static inline unsigned int work_static(struct work_struct *work)
{
- return *work_data_bits(work) & (1 << WORK_STRUCT_STATIC);
+ return *work_data_bits(work) & WORK_STRUCT_STATIC;
}
#else
static inline void __init_work(struct work_struct *work, int onstack) { }
@@ -167,7 +180,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* @work: The work item in question
*/
#define work_pending(work) \
- test_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+ test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

/**
* delayed_work_pending - Find out whether a delayable work item is currently
@@ -182,7 +195,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* @work: The work item in question
*/
#define work_clear_pending(work) \
- clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 68e4dd8..5c49d76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -115,7 +115,7 @@ static int work_fixup_activate(void *addr, enum debug_obj_state state)
* statically initialized. We just make sure that it
* is tracked in the object tracker.
*/
- if (test_bit(WORK_STRUCT_STATIC, work_data_bits(work))) {
+ if (test_bit(WORK_STRUCT_STATIC_BIT, work_data_bits(work))) {
debug_object_init(work, &work_debug_descr);
debug_object_activate(work, &work_debug_descr);
return 0;
@@ -232,7 +232,7 @@ static inline void set_wq_data(struct work_struct *work,
BUG_ON(!work_pending(work));

atomic_long_set(&work->data, (unsigned long)cwq | work_static(work) |
- (1UL << WORK_STRUCT_PENDING) | extra_flags);
+ WORK_STRUCT_PENDING | extra_flags);
}

/*
@@ -330,7 +330,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
{
int ret = 0;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
__queue_work(cpu, wq, work);
ret = 1;
}
@@ -380,7 +380,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct timer_list *timer = &dwork->timer;
struct work_struct *work = &dwork->work;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
BUG_ON(timer_pending(timer));
BUG_ON(!list_empty(&work->entry));

@@ -516,7 +516,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
* might deadlock.
*/
INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
- __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
+ __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
init_completion(&barr->done);

debug_work_activate(&barr->work);
@@ -628,7 +628,7 @@ static int try_to_grab_pending(struct work_struct *work)
struct cpu_workqueue_struct *cwq;
int ret = -1;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
return 0;

/*
--
1.6.4.2

2010-06-14 21:40:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 26/30] workqueue: add system_wq, system_long_wq and system_nrt_wq

Rename keventd_wq to system_wq and export it. Also add system_long_wq
and system_nrt_wq. The former is to host long running works
separately (so that flush_scheduled_work() dosen't take so long) and
the latter guarantees any queued work item is never executed in
parallel by multiple CPUs. These workqueues will be used by future
patches to update workqueue users.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 19 +++++++++++++++++++
kernel/workqueue.c | 30 +++++++++++++++++++-----------
2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 33e24e7..e8c3410 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -232,6 +232,25 @@ enum {
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
};

+/*
+ * System-wide workqueues which are always present.
+ *
+ * system_wq is the one used by schedule[_delayed]_work[_on]().
+ * Multi-CPU multi-threaded. There are users which expect relatively
+ * short queue flush time. Don't queue works which can run for too
+ * long.
+ *
+ * system_long_wq is similar to system_wq but may host long running
+ * works. Queue flushing might take relatively long.
+ *
+ * system_nrt_wq is non-reentrant and guarantees that any given work
+ * item is never executed in parallel by multiple CPUs. Queue
+ * flushing might take relatively long.
+ */
+extern struct workqueue_struct *system_wq;
+extern struct workqueue_struct *system_long_wq;
+extern struct workqueue_struct *system_nrt_wq;
+
extern struct workqueue_struct *
__create_workqueue_key(const char *name, unsigned int flags, int max_active,
struct lock_class_key *key, const char *lock_name);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f22dbd..b829ddb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -210,6 +210,13 @@ struct workqueue_struct {
#endif
};

+struct workqueue_struct *system_wq __read_mostly;
+struct workqueue_struct *system_long_wq __read_mostly;
+struct workqueue_struct *system_nrt_wq __read_mostly;
+EXPORT_SYMBOL_GPL(system_wq);
+EXPORT_SYMBOL_GPL(system_long_wq);
+EXPORT_SYMBOL_GPL(system_nrt_wq);
+
#define for_each_busy_worker(worker, i, pos, gcwq) \
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
@@ -2251,8 +2258,6 @@ int cancel_delayed_work_sync(struct delayed_work *dwork)
}
EXPORT_SYMBOL(cancel_delayed_work_sync);

-static struct workqueue_struct *keventd_wq __read_mostly;
-
/**
* schedule_work - put work task in global workqueue
* @work: job to be done
@@ -2266,7 +2271,7 @@ static struct workqueue_struct *keventd_wq __read_mostly;
*/
int schedule_work(struct work_struct *work)
{
- return queue_work(keventd_wq, work);
+ return queue_work(system_wq, work);
}
EXPORT_SYMBOL(schedule_work);

@@ -2279,7 +2284,7 @@ EXPORT_SYMBOL(schedule_work);
*/
int schedule_work_on(int cpu, struct work_struct *work)
{
- return queue_work_on(cpu, keventd_wq, work);
+ return queue_work_on(cpu, system_wq, work);
}
EXPORT_SYMBOL(schedule_work_on);

@@ -2294,7 +2299,7 @@ EXPORT_SYMBOL(schedule_work_on);
int schedule_delayed_work(struct delayed_work *dwork,
unsigned long delay)
{
- return queue_delayed_work(keventd_wq, dwork, delay);
+ return queue_delayed_work(system_wq, dwork, delay);
}
EXPORT_SYMBOL(schedule_delayed_work);

@@ -2327,7 +2332,7 @@ EXPORT_SYMBOL(flush_delayed_work);
int schedule_delayed_work_on(int cpu,
struct delayed_work *dwork, unsigned long delay)
{
- return queue_delayed_work_on(cpu, keventd_wq, dwork, delay);
+ return queue_delayed_work_on(cpu, system_wq, dwork, delay);
}
EXPORT_SYMBOL(schedule_delayed_work_on);

@@ -2392,7 +2397,7 @@ int schedule_on_each_cpu(work_func_t func)
*/
void flush_scheduled_work(void)
{
- flush_workqueue(keventd_wq);
+ flush_workqueue(system_wq);
}
EXPORT_SYMBOL(flush_scheduled_work);

@@ -2424,7 +2429,7 @@ EXPORT_SYMBOL_GPL(execute_in_process_context);

int keventd_up(void)
{
- return keventd_wq != NULL;
+ return system_wq != NULL;
}

static struct cpu_workqueue_struct *alloc_cwqs(void)
@@ -2850,7 +2855,7 @@ static int __cpuinit trustee_thread(void *__gcwq)
continue;

debug_work_activate(rebind_work);
- insert_work(get_cwq(gcwq->cpu, keventd_wq), rebind_work,
+ insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
worker->scheduled.next,
work_color_to_flags(WORK_NO_COLOR));
}
@@ -3235,6 +3240,9 @@ void __init init_workqueues(void)
spin_unlock_irq(&gcwq->lock);
}

- keventd_wq = __create_workqueue("events", 0, WQ_DFL_ACTIVE);
- BUG_ON(!keventd_wq);
+ system_wq = __create_workqueue("events", 0, WQ_DFL_ACTIVE);
+ system_long_wq = __create_workqueue("events_long", 0, WQ_DFL_ACTIVE);
+ system_nrt_wq = __create_workqueue("events_nrt", WQ_NON_REENTRANT,
+ WQ_DFL_ACTIVE);
+ BUG_ON(!system_wq || !system_long_wq || !system_nrt_wq);
}
--
1.6.4.2

2010-06-14 21:40:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/30] workqueue: kill RT workqueue

With stop_machine() converted to use cpu_stop, RT workqueue doesn't
have any user left. Kill RT workqueue support.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 20 +++++++++-----------
kernel/workqueue.c | 6 ------
2 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9466e86..0697946 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -181,12 +181,11 @@ static inline void destroy_work_on_stack(struct work_struct *work) { }


extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
- int freezeable, int rt, struct lock_class_key *key,
- const char *lock_name);
+__create_workqueue_key(const char *name, int singlethread, int freezeable,
+ struct lock_class_key *key, const char *lock_name);

#ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt) \
+#define __create_workqueue(name, singlethread, freezeable) \
({ \
static struct lock_class_key __key; \
const char *__lock_name; \
@@ -197,19 +196,18 @@ __create_workqueue_key(const char *name, int singlethread,
__lock_name = #name; \
\
__create_workqueue_key((name), (singlethread), \
- (freezeable), (rt), &__key, \
+ (freezeable), &__key, \
__lock_name); \
})
#else
-#define __create_workqueue(name, singlethread, freezeable, rt) \
- __create_workqueue_key((name), (singlethread), (freezeable), (rt), \
+#define __create_workqueue(name, singlethread, freezeable) \
+ __create_workqueue_key((name), (singlethread), (freezeable), \
NULL, NULL)
#endif

-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 327d2de..1a47fbf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -62,7 +62,6 @@ struct workqueue_struct {
const char *name;
int singlethread;
int freezeable; /* Freeze threads during suspend */
- int rt;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -947,7 +946,6 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)

static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
@@ -963,8 +961,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
*/
if (IS_ERR(p))
return PTR_ERR(p);
- if (cwq->wq->rt)
- sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
cwq->thread = p;

trace_workqueue_creation(cwq->thread, cpu);
@@ -986,7 +982,6 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
struct workqueue_struct *__create_workqueue_key(const char *name,
int singlethread,
int freezeable,
- int rt,
struct lock_class_key *key,
const char *lock_name)
{
@@ -1008,7 +1003,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
wq->singlethread = singlethread;
wq->freezeable = freezeable;
- wq->rt = rt;
INIT_LIST_HEAD(&wq->list);

if (singlethread) {
--
1.6.4.2

2010-06-14 21:40:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/30] workqueue: merge feature parameters into flags

Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 25 +++++++++++++++----------
kernel/workqueue.c | 17 +++++++----------
2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index e724daf..d89cfc1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -184,13 +184,17 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
#define work_clear_pending(work) \
clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))

+enum {
+ WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
+ WQ_SINGLE_THREAD = 1 << 1, /* no per-cpu worker */
+};

extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread, int freezeable,
+__create_workqueue_key(const char *name, unsigned int flags,
struct lock_class_key *key, const char *lock_name);

#ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable) \
+#define __create_workqueue(name, flags) \
({ \
static struct lock_class_key __key; \
const char *__lock_name; \
@@ -200,19 +204,20 @@ __create_workqueue_key(const char *name, int singlethread, int freezeable,
else \
__lock_name = #name; \
\
- __create_workqueue_key((name), (singlethread), \
- (freezeable), &__key, \
+ __create_workqueue_key((name), (flags), &__key, \
__lock_name); \
})
#else
-#define __create_workqueue(name, singlethread, freezeable) \
- __create_workqueue_key((name), (singlethread), (freezeable), \
- NULL, NULL)
+#define __create_workqueue(name, flags) \
+ __create_workqueue_key((name), (flags), NULL, NULL)
#endif

-#define create_workqueue(name) __create_workqueue((name), 0, 0)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
+#define create_workqueue(name) \
+ __create_workqueue((name), 0)
+#define create_freezeable_workqueue(name) \
+ __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+#define create_singlethread_workqueue(name) \
+ __create_workqueue((name), WQ_SINGLE_THREAD)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c56146a..68e4dd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -67,11 +67,10 @@ struct cpu_workqueue_struct {
* per-CPU workqueues:
*/
struct workqueue_struct {
+ unsigned int flags; /* I: WQ_* flags */
struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
struct list_head list; /* W: list of all workqueues */
const char *name; /* I: workqueue name */
- int singlethread;
- int freezeable; /* Freeze threads during suspend */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -203,9 +202,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
static cpumask_var_t cpu_populated_map __read_mostly;

/* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
{
- return wq->singlethread;
+ return wq->flags & WQ_SINGLE_THREAD;
}

static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -463,7 +462,7 @@ static int worker_thread(void *__cwq)
struct cpu_workqueue_struct *cwq = __cwq;
DEFINE_WAIT(wait);

- if (cwq->wq->freezeable)
+ if (cwq->wq->flags & WQ_FREEZEABLE)
set_freezable();

for (;;) {
@@ -1013,8 +1012,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
}

struct workqueue_struct *__create_workqueue_key(const char *name,
- int singlethread,
- int freezeable,
+ unsigned int flags,
struct lock_class_key *key,
const char *lock_name)
{
@@ -1030,13 +1028,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
if (!wq->cpu_wq)
goto err;

+ wq->flags = flags;
wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
- wq->singlethread = singlethread;
- wq->freezeable = freezeable;
INIT_LIST_HEAD(&wq->list);

- if (singlethread) {
+ if (flags & WQ_SINGLE_THREAD) {
cwq = init_cpu_workqueue(wq, singlethread_cpu);
err = create_workqueue_thread(cwq, singlethread_cpu);
start_workqueue_thread(cwq, -1);
--
1.6.4.2

2010-06-14 21:40:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/30] workqueue: misc/cosmetic updates

Make the following updates in preparation of concurrency managed
workqueue. None of these changes causes any visible behavior
difference.

* Add comments and adjust indentations to data structures and several
functions.

* Rename wq_per_cpu() to get_cwq() and swap the position of two
parameters for consistency. Convert a direct per_cpu_ptr() access
to wq->cpu_wq to get_cwq().

* Add work_static() and Update set_wq_data() such that it sets the
flags part to WORK_STRUCT_PENDING | WORK_STRUCT_STATIC if static |
@extra_flags.

* Move santiy check on work->entry emptiness from queue_work_on() to
__queue_work() which all queueing paths share.

* Make __queue_work() take @cpu and @wq instead of @cwq.

* Restructure flush_work() and __create_workqueue_key() to make them
easier to modify.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 5 ++
kernel/workqueue.c | 131 +++++++++++++++++++++++++++++----------------
2 files changed, 89 insertions(+), 47 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0697946..e724daf 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -96,9 +96,14 @@ struct execute_work {
#ifdef CONFIG_DEBUG_OBJECTS_WORK
extern void __init_work(struct work_struct *work, int onstack);
extern void destroy_work_on_stack(struct work_struct *work);
+static inline unsigned int work_static(struct work_struct *work)
+{
+ return *work_data_bits(work) & (1 << WORK_STRUCT_STATIC);
+}
#else
static inline void __init_work(struct work_struct *work, int onstack) { }
static inline void destroy_work_on_stack(struct work_struct *work) { }
+static inline unsigned int work_static(struct work_struct *work) { return 0; }
#endif

/*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1a47fbf..c56146a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -37,6 +37,16 @@
#include <trace/events/workqueue.h>

/*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Set during initialization and read-only afterwards.
+ *
+ * L: cwq->lock protected. Access with cwq->lock held.
+ *
+ * W: workqueue_lock protected.
+ */
+
+/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
*/
@@ -48,8 +58,8 @@ struct cpu_workqueue_struct {
wait_queue_head_t more_work;
struct work_struct *current_work;

- struct workqueue_struct *wq;
- struct task_struct *thread;
+ struct workqueue_struct *wq; /* I: the owning workqueue */
+ struct task_struct *thread;
} ____cacheline_aligned;

/*
@@ -57,13 +67,13 @@ struct cpu_workqueue_struct {
* per-CPU workqueues:
*/
struct workqueue_struct {
- struct cpu_workqueue_struct *cpu_wq;
- struct list_head list;
- const char *name;
+ struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
+ struct list_head list; /* W: list of all workqueues */
+ const char *name; /* I: workqueue name */
int singlethread;
int freezeable; /* Freeze threads during suspend */
#ifdef CONFIG_LOCKDEP
- struct lockdep_map lockdep_map;
+ struct lockdep_map lockdep_map;
#endif
};

@@ -204,8 +214,8 @@ static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
? cpu_singlethread_map : cpu_populated_map;
}

-static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+ struct workqueue_struct *wq)
{
if (unlikely(is_wq_single_threaded(wq)))
cpu = singlethread_cpu;
@@ -217,15 +227,13 @@ struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
* - Must *only* be called if the pending flag is set
*/
static inline void set_wq_data(struct work_struct *work,
- struct cpu_workqueue_struct *cwq)
+ struct cpu_workqueue_struct *cwq,
+ unsigned long extra_flags)
{
- unsigned long new;
-
BUG_ON(!work_pending(work));

- new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
- new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
- atomic_long_set(&work->data, new);
+ atomic_long_set(&work->data, (unsigned long)cwq | work_static(work) |
+ (1UL << WORK_STRUCT_PENDING) | extra_flags);
}

/*
@@ -233,9 +241,7 @@ static inline void set_wq_data(struct work_struct *work,
*/
static inline void clear_wq_data(struct work_struct *work)
{
- unsigned long flags = *work_data_bits(work) &
- (1UL << WORK_STRUCT_STATIC);
- atomic_long_set(&work->data, flags);
+ atomic_long_set(&work->data, work_static(work));
}

static inline
@@ -244,29 +250,47 @@ struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
}

+/**
+ * insert_work - insert a work into cwq
+ * @cwq: cwq @work belongs to
+ * @work: work to insert
+ * @head: insertion point
+ * @extra_flags: extra WORK_STRUCT_* flags to set
+ *
+ * Insert @work into @cwq after @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
static void insert_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work, struct list_head *head)
+ struct work_struct *work, struct list_head *head,
+ unsigned int extra_flags)
{
trace_workqueue_insertion(cwq->thread, work);

- set_wq_data(work, cwq);
+ /* we own @work, set data and link */
+ set_wq_data(work, cwq, extra_flags);
+
/*
* Ensure that we get the right work->data if we see the
* result of list_add() below, see try_to_grab_pending().
*/
smp_wmb();
+
list_add_tail(&work->entry, head);
wake_up(&cwq->more_work);
}

-static void __queue_work(struct cpu_workqueue_struct *cwq,
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
unsigned long flags;

debug_work_activate(work);
spin_lock_irqsave(&cwq->lock, flags);
- insert_work(cwq, work, &cwq->worklist);
+ BUG_ON(!list_empty(&work->entry));
+ insert_work(cwq, work, &cwq->worklist, 0);
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -308,8 +332,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
int ret = 0;

if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
- BUG_ON(!list_empty(&work->entry));
- __queue_work(wq_per_cpu(wq, cpu), work);
+ __queue_work(cpu, wq, work);
ret = 1;
}
return ret;
@@ -320,9 +343,8 @@ static void delayed_work_timer_fn(unsigned long __data)
{
struct delayed_work *dwork = (struct delayed_work *)__data;
struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
- struct workqueue_struct *wq = cwq->wq;

- __queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+ __queue_work(smp_processor_id(), cwq->wq, &dwork->work);
}

/**
@@ -366,7 +388,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
timer_stats_timer_set_start_info(&dwork->timer);

/* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+ set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
@@ -430,6 +452,12 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
spin_unlock_irq(&cwq->lock);
}

+/**
+ * worker_thread - the worker thread function
+ * @__cwq: cwq to serve
+ *
+ * The cwq worker thread function.
+ */
static int worker_thread(void *__cwq)
{
struct cpu_workqueue_struct *cwq = __cwq;
@@ -468,6 +496,17 @@ static void wq_barrier_func(struct work_struct *work)
complete(&barr->done);
}

+/**
+ * insert_wq_barrier - insert a barrier work
+ * @cwq: cwq to insert barrier into
+ * @barr: wq_barrier to insert
+ * @head: insertion point
+ *
+ * Insert barrier @barr into @cwq before @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
struct wq_barrier *barr, struct list_head *head)
{
@@ -479,11 +518,10 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
*/
INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
-
init_completion(&barr->done);

debug_work_activate(&barr->work);
- insert_work(cwq, &barr->work, head);
+ insert_work(cwq, &barr->work, head, 0);
}

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -517,9 +555,6 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*
* We sleep until all works which were queued on entry have been handled,
* but we are not livelocked by new incoming ones.
- *
- * This function used to run the workqueues itself. Now we just wait for the
- * helper threads to do it.
*/
void flush_workqueue(struct workqueue_struct *wq)
{
@@ -558,7 +593,6 @@ int flush_work(struct work_struct *work)
lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_release(&cwq->wq->lockdep_map);

- prev = NULL;
spin_lock_irq(&cwq->lock);
if (!list_empty(&work->entry)) {
/*
@@ -567,22 +601,22 @@ int flush_work(struct work_struct *work)
*/
smp_rmb();
if (unlikely(cwq != get_wq_data(work)))
- goto out;
+ goto already_gone;
prev = &work->entry;
} else {
if (cwq->current_work != work)
- goto out;
+ goto already_gone;
prev = &cwq->worklist;
}
insert_wq_barrier(cwq, &barr, prev->next);
-out:
- spin_unlock_irq(&cwq->lock);
- if (!prev)
- return 0;

+ spin_unlock_irq(&cwq->lock);
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
return 1;
+already_gone:
+ spin_unlock_irq(&cwq->lock);
+ return 0;
}
EXPORT_SYMBOL_GPL(flush_work);

@@ -665,7 +699,7 @@ static void wait_on_work(struct work_struct *work)
cpu_map = wq_cpu_map(wq);

for_each_cpu(cpu, cpu_map)
- wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+ wait_on_cpu_work(get_cwq(cpu, wq), work);
}

static int __cancel_work_timer(struct work_struct *work,
@@ -782,9 +816,8 @@ EXPORT_SYMBOL(schedule_delayed_work);
void flush_delayed_work(struct delayed_work *dwork)
{
if (del_timer_sync(&dwork->timer)) {
- struct cpu_workqueue_struct *cwq;
- cwq = wq_per_cpu(get_wq_data(&dwork->work)->wq, get_cpu());
- __queue_work(cwq, &dwork->work);
+ __queue_work(get_cpu(), get_wq_data(&dwork->work)->wq,
+ &dwork->work);
put_cpu();
}
flush_work(&dwork->work);
@@ -991,13 +1024,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
- return NULL;
+ goto err;

wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
- if (!wq->cpu_wq) {
- kfree(wq);
- return NULL;
- }
+ if (!wq->cpu_wq)
+ goto err;

wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
@@ -1041,6 +1072,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
wq = NULL;
}
return wq;
+err:
+ if (wq) {
+ free_percpu(wq->cpu_wq);
+ kfree(wq);
+ }
+ return NULL;
}
EXPORT_SYMBOL_GPL(__create_workqueue_key);

--
1.6.4.2

2010-06-14 21:41:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 29/30] libata: take advantage of cmwq and remove concurrency limitations

libata has two concurrency related limitations.

a. ata_wq which is used for polling PIO has single thread per CPU. If
there are multiple devices doing polling PIO on the same CPU, they
can't be executed simultaneously.

b. ata_aux_wq which is used for SCSI probing has single thread. In
cases where SCSI probing is stalled for extended period of time
which is possible for ATAPI devices, this will stall all probing.

#a is solved by increasing maximum concurrency of ata_wq. Please note
that polling PIO might be used under allocation path and thus needs to
be served by a separate wq with a rescuer.

#b is solved by using the default wq instead and achieving exclusion
via per-port mutex.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jeff Garzik <[email protected]>
---
drivers/ata/libata-core.c | 20 +++++---------------
drivers/ata/libata-eh.c | 4 ++--
drivers/ata/libata-scsi.c | 10 ++++++----
drivers/ata/libata-sff.c | 9 +--------
drivers/ata/libata.h | 1 -
include/linux/libata.h | 1 +
6 files changed, 15 insertions(+), 30 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index ddf8e48..4f78741 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -98,8 +98,6 @@ static unsigned long ata_dev_blacklisted(const struct ata_device *dev);

unsigned int ata_print_id = 1;

-struct workqueue_struct *ata_aux_wq;
-
struct ata_force_param {
const char *name;
unsigned int cbl;
@@ -5611,6 +5609,7 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
ap->msg_enable = ATA_MSG_DRV | ATA_MSG_ERR | ATA_MSG_WARN;
#endif

+ mutex_init(&ap->scsi_scan_mutex);
INIT_DELAYED_WORK(&ap->hotplug_task, ata_scsi_hotplug);
INIT_WORK(&ap->scsi_rescan_task, ata_scsi_dev_rescan);
INIT_LIST_HEAD(&ap->eh_done_q);
@@ -6549,29 +6548,20 @@ static int __init ata_init(void)

ata_parse_force_param();

- ata_aux_wq = create_singlethread_workqueue("ata_aux");
- if (!ata_aux_wq)
- goto fail;
-
rc = ata_sff_init();
- if (rc)
- goto fail;
+ if (rc) {
+ kfree(ata_force_tbl);
+ return rc;
+ }

printk(KERN_DEBUG "libata version " DRV_VERSION " loaded.\n");
return 0;
-
-fail:
- kfree(ata_force_tbl);
- if (ata_aux_wq)
- destroy_workqueue(ata_aux_wq);
- return rc;
}

static void __exit ata_exit(void)
{
ata_sff_exit();
kfree(ata_force_tbl);
- destroy_workqueue(ata_aux_wq);
}

subsys_initcall(ata_init);
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index f77a673..4d2af82 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -727,7 +727,7 @@ void ata_scsi_error(struct Scsi_Host *host)
if (ap->pflags & ATA_PFLAG_LOADING)
ap->pflags &= ~ATA_PFLAG_LOADING;
else if (ap->pflags & ATA_PFLAG_SCSI_HOTPLUG)
- queue_delayed_work(ata_aux_wq, &ap->hotplug_task, 0);
+ schedule_delayed_work(&ap->hotplug_task, 0);

if (ap->pflags & ATA_PFLAG_RECOVERED)
ata_port_printk(ap, KERN_INFO, "EH complete\n");
@@ -2944,7 +2944,7 @@ static int ata_eh_revalidate_and_attach(struct ata_link *link,
ehc->i.flags |= ATA_EHI_SETMODE;

/* schedule the scsi_rescan_device() here */
- queue_work(ata_aux_wq, &(ap->scsi_rescan_task));
+ schedule_work(&(ap->scsi_rescan_task));
} else if (dev->class == ATA_DEV_UNKNOWN &&
ehc->tries[dev->devno] &&
ata_class_enabled(ehc->classes[dev->devno])) {
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a54273d..d75c9c4 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3435,7 +3435,7 @@ void ata_scsi_scan_host(struct ata_port *ap, int sync)
" switching to async\n");
}

- queue_delayed_work(ata_aux_wq, &ap->hotplug_task,
+ queue_delayed_work(system_long_wq, &ap->hotplug_task,
round_jiffies_relative(HZ));
}

@@ -3582,6 +3582,7 @@ void ata_scsi_hotplug(struct work_struct *work)
}

DPRINTK("ENTER\n");
+ mutex_lock(&ap->scsi_scan_mutex);

/* Unplug detached devices. We cannot use link iterator here
* because PMP links have to be scanned even if PMP is
@@ -3595,6 +3596,7 @@ void ata_scsi_hotplug(struct work_struct *work)
/* scan for new ones */
ata_scsi_scan_host(ap, 0);

+ mutex_unlock(&ap->scsi_scan_mutex);
DPRINTK("EXIT\n");
}

@@ -3673,9 +3675,7 @@ static int ata_scsi_user_scan(struct Scsi_Host *shost, unsigned int channel,
* @work: Pointer to ATA port to perform scsi_rescan_device()
*
* After ATA pass thru (SAT) commands are executed successfully,
- * libata need to propagate the changes to SCSI layer. This
- * function must be executed from ata_aux_wq such that sdev
- * attach/detach don't race with rescan.
+ * libata need to propagate the changes to SCSI layer.
*
* LOCKING:
* Kernel thread context (may sleep).
@@ -3688,6 +3688,7 @@ void ata_scsi_dev_rescan(struct work_struct *work)
struct ata_device *dev;
unsigned long flags;

+ mutex_lock(&ap->scsi_scan_mutex);
spin_lock_irqsave(ap->lock, flags);

ata_for_each_link(link, ap, EDGE) {
@@ -3707,6 +3708,7 @@ void ata_scsi_dev_rescan(struct work_struct *work)
}

spin_unlock_irqrestore(ap->lock, flags);
+ mutex_unlock(&ap->scsi_scan_mutex);
}

/**
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index efa4a18..dd57815 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -3318,14 +3318,7 @@ void ata_sff_port_init(struct ata_port *ap)

int __init ata_sff_init(void)
{
- /*
- * FIXME: In UP case, there is only one workqueue thread and if you
- * have more than one PIO device, latency is bloody awful, with
- * occasional multi-second "hiccups" as one PIO device waits for
- * another. It's an ugly wart that users DO occasionally complain
- * about; luckily most users have at most one PIO polled device.
- */
- ata_sff_wq = create_workqueue("ata_sff");
+ ata_sff_wq = __create_workqueue("ata_sff", WQ_RESCUER, WQ_MAX_ACTIVE);
if (!ata_sff_wq)
return -ENOMEM;

diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 4b84ed6..9ce1ecc 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -54,7 +54,6 @@ enum {
};

extern unsigned int ata_print_id;
-extern struct workqueue_struct *ata_aux_wq;
extern int atapi_passthru16;
extern int libata_fua;
extern int libata_noacpi;
diff --git a/include/linux/libata.h b/include/linux/libata.h
index b85f3ff..f010f18 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -751,6 +751,7 @@ struct ata_port {
struct ata_host *host;
struct device *dev;

+ struct mutex scsi_scan_mutex;
struct delayed_work hotplug_task;
struct work_struct scsi_rescan_task;

--
1.6.4.2

2010-06-14 21:39:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 23/30] workqueue: use shared worklist and pool all workers per cpu

Use gcwq->worklist instead of cwq->worklist and break the strict
association between a cwq and its worker. All works queued on a cpu
are queued on gcwq->worklist and processed by any available worker on
the gcwq.

As there no longer is strict association between a cwq and its worker,
whether a work is executing can now only be determined by calling
[__]find_worker_executing_work().

After this change, the only association between a cwq and its worker
is that a cwq puts a worker into shared worker pool on creation and
kills it on destruction. As all workqueues are still limited to
max_active of one, this means that there are always at least as many
workers as active works and thus there's no danger for deadlock.

The break of strong association between cwqs and workers requires
somewhat clumsy changes to current_is_keventd() and
destroy_workqueue(). Dynamic worker pool management will remove both
clumsy changes. current_is_keventd() won't be necessary at all as the
only reason it exists is to avoid queueing a work from a work which
will be allowed just fine. The clumsy part of destroy_workqueue() is
added because a worker can only be destroyed while idle and there's no
guarantee a worker is idle when its wq is going down. With dynamic
pool management, workers are not associated with workqueues at all and
only idle ones will be submitted to destroy_workqueue() so the code
won't be necessary anymore.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 131 +++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 99 insertions(+), 32 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7994edb..e0a7609 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -34,6 +34,7 @@
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
#include <linux/idr.h>
+#include <linux/delay.h>

enum {
/* global_cwq flags */
@@ -72,7 +73,6 @@ enum {
*/

struct global_cwq;
-struct cpu_workqueue_struct;

struct worker {
/* on idle list while idle, on busy hash table while busy */
@@ -86,7 +86,6 @@ struct worker {
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */
- struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
unsigned int flags; /* L: flags */
int id; /* I: worker id */
};
@@ -96,6 +95,7 @@ struct worker {
*/
struct global_cwq {
spinlock_t lock; /* the gcwq lock */
+ struct list_head worklist; /* L: list of pending works */
unsigned int cpu; /* I: the associated cpu */
unsigned int flags; /* L: GCWQ_* flags */

@@ -121,7 +121,6 @@ struct global_cwq {
*/
struct cpu_workqueue_struct {
struct global_cwq *gcwq; /* I: the associated gcwq */
- struct list_head worklist;
struct worker *worker;
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
@@ -386,6 +385,32 @@ static struct global_cwq *get_work_gcwq(struct work_struct *work)
return get_gcwq(cpu);
}

+/* Return the first worker. Safe with preemption disabled */
+static struct worker *first_worker(struct global_cwq *gcwq)
+{
+ if (unlikely(list_empty(&gcwq->idle_list)))
+ return NULL;
+
+ return list_first_entry(&gcwq->idle_list, struct worker, entry);
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void wake_up_worker(struct global_cwq *gcwq)
+{
+ struct worker *worker = first_worker(gcwq);
+
+ if (likely(worker))
+ wake_up_process(worker->task);
+}
+
/**
* busy_worker_head - return the busy hash head for a work
* @gcwq: gcwq of interest
@@ -467,13 +492,14 @@ static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
}

/**
- * insert_work - insert a work into cwq
+ * insert_work - insert a work into gcwq
* @cwq: cwq @work belongs to
* @work: work to insert
* @head: insertion point
* @extra_flags: extra WORK_STRUCT_* flags to set
*
- * Insert @work into @cwq after @head.
+ * Insert @work which belongs to @cwq into @gcwq after @head.
+ * @extra_flags is or'd to work_struct flags.
*
* CONTEXT:
* spin_lock_irq(gcwq->lock).
@@ -492,7 +518,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
smp_wmb();

list_add_tail(&work->entry, head);
- wake_up_process(cwq->worker->task);
+ wake_up_worker(cwq->gcwq);
}

/**
@@ -608,7 +634,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,

if (likely(cwq->nr_active < cwq->max_active)) {
cwq->nr_active++;
- worklist = &cwq->worklist;
+ worklist = &gcwq->worklist;
} else
worklist = &cwq->delayed_works;

@@ -793,10 +819,10 @@ static struct worker *alloc_worker(void)

/**
* create_worker - create a new workqueue worker
- * @cwq: cwq the new worker will belong to
+ * @gcwq: gcwq the new worker will belong to
* @bind: whether to set affinity to @cpu or not
*
- * Create a new worker which is bound to @cwq. The returned worker
+ * Create a new worker which is bound to @gcwq. The returned worker
* can be started by calling start_worker() or destroyed using
* destroy_worker().
*
@@ -806,9 +832,8 @@ static struct worker *alloc_worker(void)
* RETURNS:
* Pointer to the newly created worker.
*/
-static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
{
- struct global_cwq *gcwq = cwq->gcwq;
int id = -1;
struct worker *worker = NULL;

@@ -826,7 +851,6 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
goto fail;

worker->gcwq = gcwq;
- worker->cwq = cwq;
worker->id = id;

worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
@@ -953,7 +977,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
struct work_struct *work = list_first_entry(&cwq->delayed_works,
struct work_struct, entry);

- move_linked_works(work, &cwq->worklist, NULL);
+ move_linked_works(work, &cwq->gcwq->worklist, NULL);
cwq->nr_active++;
}

@@ -1021,11 +1045,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
*/
static void process_one_work(struct worker *worker, struct work_struct *work)
{
- struct cpu_workqueue_struct *cwq = worker->cwq;
+ struct cpu_workqueue_struct *cwq = get_work_cwq(work);
struct global_cwq *gcwq = cwq->gcwq;
struct hlist_head *bwh = busy_worker_head(gcwq, work);
work_func_t f = work->func;
int work_color;
+ struct worker *collision;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -1036,6 +1061,18 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
*/
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
+ /*
+ * A single work shouldn't be executed concurrently by
+ * multiple workers on a single cpu. Check whether anyone is
+ * already processing the work. If so, defer the work to the
+ * currently executing one.
+ */
+ collision = __find_worker_executing_work(gcwq, bwh, work);
+ if (unlikely(collision)) {
+ move_linked_works(work, &collision->scheduled, NULL);
+ return;
+ }
+
/* claim and process */
debug_work_deactivate(work);
hlist_add_head(&worker->hentry, bwh);
@@ -1043,7 +1080,6 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
worker->current_cwq = cwq;
work_color = get_work_color(work);

- BUG_ON(get_work_cwq(work) != cwq);
/* record the current cpu number in the work data and dequeue */
set_work_cpu(work, gcwq->cpu);
list_del_init(&work->entry);
@@ -1107,7 +1143,6 @@ static int worker_thread(void *__worker)
{
struct worker *worker = __worker;
struct global_cwq *gcwq = worker->gcwq;
- struct cpu_workqueue_struct *cwq = worker->cwq;

woke_up:
spin_lock_irq(&gcwq->lock);
@@ -1127,9 +1162,9 @@ recheck:
*/
BUG_ON(!list_empty(&worker->scheduled));

- while (!list_empty(&cwq->worklist)) {
+ while (!list_empty(&gcwq->worklist)) {
struct work_struct *work =
- list_first_entry(&cwq->worklist,
+ list_first_entry(&gcwq->worklist,
struct work_struct, entry);

/*
@@ -1844,18 +1879,37 @@ int keventd_up(void)

int current_is_keventd(void)
{
- struct cpu_workqueue_struct *cwq;
- int cpu = raw_smp_processor_id(); /* preempt-safe: keventd is per-cpu */
- int ret = 0;
+ bool found = false;
+ unsigned int cpu;

- BUG_ON(!keventd_wq);
+ /*
+ * There no longer is one-to-one relation between worker and
+ * work queue and a worker task might be unbound from its cpu
+ * if the cpu was offlined. Match all busy workers. This
+ * function will go away once dynamic pool is implemented.
+ */
+ for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ struct worker *worker;
+ struct hlist_node *pos;
+ unsigned long flags;
+ int i;

- cwq = get_cwq(cpu, keventd_wq);
- if (current == cwq->worker->task)
- ret = 1;
+ spin_lock_irqsave(&gcwq->lock, flags);

- return ret;
+ for_each_busy_worker(worker, i, pos, gcwq) {
+ if (worker->task == current) {
+ found = true;
+ break;
+ }
+ }
+
+ spin_unlock_irqrestore(&gcwq->lock, flags);
+ if (found)
+ break;
+ }

+ return found;
}

static struct cpu_workqueue_struct *alloc_cwqs(void)
@@ -1947,12 +2001,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
cwq->wq = wq;
cwq->flush_color = -1;
cwq->max_active = max_active;
- INIT_LIST_HEAD(&cwq->worklist);
INIT_LIST_HEAD(&cwq->delayed_works);

if (failed)
continue;
- cwq->worker = create_worker(cwq, cpu_online(cpu));
+ cwq->worker = create_worker(gcwq, cpu_online(cpu));
if (cwq->worker)
start_worker(cwq->worker);
else
@@ -2014,13 +2067,26 @@ void destroy_workqueue(struct workqueue_struct *wq)

for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;
int i;

if (cwq->worker) {
- spin_lock_irq(&cwq->gcwq->lock);
+ retry:
+ spin_lock_irq(&gcwq->lock);
+ /*
+ * Worker can only be destroyed while idle.
+ * Wait till it becomes idle. This is ugly
+ * and prone to starvation. It will go away
+ * once dynamic worker pool is implemented.
+ */
+ if (!(cwq->worker->flags & WORKER_IDLE)) {
+ spin_unlock_irq(&gcwq->lock);
+ msleep(100);
+ goto retry;
+ }
destroy_worker(cwq->worker);
cwq->worker = NULL;
- spin_unlock_irq(&cwq->gcwq->lock);
+ spin_unlock_irq(&gcwq->lock);
}

for (i = 0; i < WORK_NR_COLORS; i++)
@@ -2318,7 +2384,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
*
* Start freezing workqueues. After this function returns, all
* freezeable workqueues will queue new works to their frozen_works
- * list instead of the cwq ones.
+ * list instead of gcwq->worklist.
*
* CONTEXT:
* Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2404,7 +2470,7 @@ out_unlock:
* thaw_workqueues - thaw workqueues
*
* Thaw workqueues. Normal queueing is restored and all collected
- * frozen works are transferred to their respective cwq worklists.
+ * frozen works are transferred to their respective gcwq worklists.
*
* CONTEXT:
* Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2485,6 +2551,7 @@ void __init init_workqueues(void)
struct global_cwq *gcwq = get_gcwq(cpu);

spin_lock_init(&gcwq->lock);
+ INIT_LIST_HEAD(&gcwq->worklist);
gcwq->cpu = cpu;

INIT_LIST_HEAD(&gcwq->idle_list);
--
1.6.4.2

2010-06-14 21:42:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 30/30] async: use workqueue for worker pool

Replace private worker pool with system_long_wq.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Arjan van de Ven <[email protected]>
---
kernel/async.c | 140 ++++++++-----------------------------------------------
1 files changed, 21 insertions(+), 119 deletions(-)

diff --git a/kernel/async.c b/kernel/async.c
index 15319d6..c285258 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -49,40 +49,32 @@ asynchronous and synchronous parts of the kernel.
*/

#include <linux/async.h>
-#include <linux/bug.h>
#include <linux/module.h>
#include <linux/wait.h>
#include <linux/sched.h>
-#include <linux/init.h>
-#include <linux/kthread.h>
-#include <linux/delay.h>
#include <linux/slab.h>
#include <asm/atomic.h>

static async_cookie_t next_cookie = 1;

-#define MAX_THREADS 256
#define MAX_WORK 32768

static LIST_HEAD(async_pending);
static LIST_HEAD(async_running);
static DEFINE_SPINLOCK(async_lock);

-static int async_enabled = 0;
-
struct async_entry {
- struct list_head list;
- async_cookie_t cookie;
- async_func_ptr *func;
- void *data;
- struct list_head *running;
+ struct list_head list;
+ struct work_struct work;
+ async_cookie_t cookie;
+ async_func_ptr *func;
+ void *data;
+ struct list_head *running;
};

static DECLARE_WAIT_QUEUE_HEAD(async_done);
-static DECLARE_WAIT_QUEUE_HEAD(async_new);

static atomic_t entry_count;
-static atomic_t thread_count;

extern int initcall_debug;

@@ -117,27 +109,23 @@ static async_cookie_t lowest_in_progress(struct list_head *running)
spin_unlock_irqrestore(&async_lock, flags);
return ret;
}
+
/*
* pick the first pending entry and run it
*/
-static void run_one_entry(void)
+static void async_run_entry_fn(struct work_struct *work)
{
+ struct async_entry *entry =
+ container_of(work, struct async_entry, work);
unsigned long flags;
- struct async_entry *entry;
ktime_t calltime, delta, rettime;

- /* 1) pick one task from the pending queue */
-
+ /* 1) move self to the running queue */
spin_lock_irqsave(&async_lock, flags);
- if (list_empty(&async_pending))
- goto out;
- entry = list_first_entry(&async_pending, struct async_entry, list);
-
- /* 2) move it to the running queue */
list_move_tail(&entry->list, entry->running);
spin_unlock_irqrestore(&async_lock, flags);

- /* 3) run it (and print duration)*/
+ /* 2) run (and print duration) */
if (initcall_debug && system_state == SYSTEM_BOOTING) {
printk("calling %lli_%pF @ %i\n", (long long)entry->cookie,
entry->func, task_pid_nr(current));
@@ -153,31 +141,25 @@ static void run_one_entry(void)
(long long)ktime_to_ns(delta) >> 10);
}

- /* 4) remove it from the running queue */
+ /* 3) remove self from the running queue */
spin_lock_irqsave(&async_lock, flags);
list_del(&entry->list);

- /* 5) free the entry */
+ /* 4) free the entry */
kfree(entry);
atomic_dec(&entry_count);

spin_unlock_irqrestore(&async_lock, flags);

- /* 6) wake up any waiters. */
+ /* 5) wake up any waiters */
wake_up(&async_done);
- return;
-
-out:
- spin_unlock_irqrestore(&async_lock, flags);
}

-
static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct list_head *running)
{
struct async_entry *entry;
unsigned long flags;
async_cookie_t newcookie;
-

/* allow irq-off callers */
entry = kzalloc(sizeof(struct async_entry), GFP_ATOMIC);
@@ -186,7 +168,7 @@ static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct l
* If we're out of memory or if there's too much work
* pending already, we execute synchronously.
*/
- if (!async_enabled || !entry || atomic_read(&entry_count) > MAX_WORK) {
+ if (!entry || atomic_read(&entry_count) > MAX_WORK) {
kfree(entry);
spin_lock_irqsave(&async_lock, flags);
newcookie = next_cookie++;
@@ -196,6 +178,7 @@ static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct l
ptr(data, newcookie);
return newcookie;
}
+ INIT_WORK(&entry->work, async_run_entry_fn);
entry->func = ptr;
entry->data = data;
entry->running = running;
@@ -205,7 +188,10 @@ static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct l
list_add_tail(&entry->list, &async_pending);
atomic_inc(&entry_count);
spin_unlock_irqrestore(&async_lock, flags);
- wake_up(&async_new);
+
+ /* schedule for execution */
+ queue_work(system_long_wq, &entry->work);
+
return newcookie;
}

@@ -312,87 +298,3 @@ void async_synchronize_cookie(async_cookie_t cookie)
async_synchronize_cookie_domain(cookie, &async_running);
}
EXPORT_SYMBOL_GPL(async_synchronize_cookie);
-
-
-static int async_thread(void *unused)
-{
- DECLARE_WAITQUEUE(wq, current);
- add_wait_queue(&async_new, &wq);
-
- while (!kthread_should_stop()) {
- int ret = HZ;
- set_current_state(TASK_INTERRUPTIBLE);
- /*
- * check the list head without lock.. false positives
- * are dealt with inside run_one_entry() while holding
- * the lock.
- */
- rmb();
- if (!list_empty(&async_pending))
- run_one_entry();
- else
- ret = schedule_timeout(HZ);
-
- if (ret == 0) {
- /*
- * we timed out, this means we as thread are redundant.
- * we sign off and die, but we to avoid any races there
- * is a last-straw check to see if work snuck in.
- */
- atomic_dec(&thread_count);
- wmb(); /* manager must see our departure first */
- if (list_empty(&async_pending))
- break;
- /*
- * woops work came in between us timing out and us
- * signing off; we need to stay alive and keep working.
- */
- atomic_inc(&thread_count);
- }
- }
- remove_wait_queue(&async_new, &wq);
-
- return 0;
-}
-
-static int async_manager_thread(void *unused)
-{
- DECLARE_WAITQUEUE(wq, current);
- add_wait_queue(&async_new, &wq);
-
- while (!kthread_should_stop()) {
- int tc, ec;
-
- set_current_state(TASK_INTERRUPTIBLE);
-
- tc = atomic_read(&thread_count);
- rmb();
- ec = atomic_read(&entry_count);
-
- while (tc < ec && tc < MAX_THREADS) {
- if (IS_ERR(kthread_run(async_thread, NULL, "async/%i",
- tc))) {
- msleep(100);
- continue;
- }
- atomic_inc(&thread_count);
- tc++;
- }
-
- schedule();
- }
- remove_wait_queue(&async_new, &wq);
-
- return 0;
-}
-
-static int __init async_init(void)
-{
- async_enabled =
- !IS_ERR(kthread_run(async_manager_thread, NULL, "async/mgr"));
-
- WARN_ON(!async_enabled);
- return 0;
-}
-
-core_initcall(async_init);
--
1.6.4.2

2010-06-14 21:41:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 27/30] workqueue: implement DEBUGFS/workqueue

Implement DEBUGFS/workqueue which lists all workers and works for
debugging. Workqueues can also have ->show_work() callback which
describes a pending or running work in custom way. If ->show_work()
is missing or returns %false, wchan is printed.

# cat /sys/kernel/debug/workqueue

CPU ID PID WORK ADDR WORKQUEUE TIME DESC
==== ==== ===== ================ ============ ===== ============================
0 0 15 ffffffffa0004708 test-wq-04 1 s test_work_fn+0x469/0x690 [test_wq]
0 2 4146 <IDLE> 0us
0 1 21 <IDLE> 4 s
0 DELA ffffffffa00047b0 test-wq-04 1 s test work 2
1 1 418 <IDLE> 780ms
1 0 16 <IDLE> 40 s
1 2 443 <IDLE> 40 s

Workqueue debugfs support is suggested by David Howells and
implementation mostly mimics that of slow-work.

* Anton Blanchard spotted that ITER_* constants are overflowing w/
high cpu configuration. This was caused by using
powerup_power_of_two() where order_base_2() should have been used.
Fixed.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Howells <[email protected]>
Cc: Anton Blanchard <[email protected]>
---
include/linux/workqueue.h | 12 ++
kernel/workqueue.c | 369 ++++++++++++++++++++++++++++++++++++++++++++-
lib/Kconfig.debug | 7 +
3 files changed, 384 insertions(+), 4 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index e8c3410..850942a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -17,6 +17,10 @@ struct workqueue_struct;
struct work_struct;
typedef void (*work_func_t)(struct work_struct *work);

+struct seq_file;
+typedef bool (*show_work_func_t)(struct seq_file *m, struct work_struct *work,
+ bool running);
+
/*
* The first word is the work queue pointer and the flags rolled into
* one
@@ -70,6 +74,9 @@ struct work_struct {
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
+#ifdef CONFIG_WORKQUEUE_DEBUGFS
+ unsigned long timestamp; /* timestamp for debugfs */
+#endif
};

#define WORK_DATA_INIT() ATOMIC_LONG_INIT(WORK_STRUCT_NO_CPU)
@@ -282,6 +289,11 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
#define create_singlethread_workqueue(name) \
__create_workqueue((name), WQ_SINGLE_CPU | WQ_RESCUER, 1)

+#ifdef CONFIG_WORKQUEUE_DEBUGFS
+extern void workqueue_set_show_work(struct workqueue_struct *wq,
+ show_work_func_t show);
+#endif
+
extern void destroy_workqueue(struct workqueue_struct *wq);

extern int queue_work(struct workqueue_struct *wq, struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b829ddb..ae6e4c7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -119,7 +119,7 @@ struct worker {
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */
/* 64 bytes boundary on 64bit, 32 on 32bit */
- unsigned long last_active; /* L: last active timestamp */
+ unsigned long timestamp; /* L: last active timestamp */
unsigned int flags; /* ?: flags */
int id; /* I: worker id */
struct work_struct rebind_work; /* L: rebind worker to cpu */
@@ -153,6 +153,9 @@ struct global_cwq {
unsigned int trustee_state; /* L: trustee state */
wait_queue_head_t trustee_wait; /* trustee wait */
struct worker *first_idle; /* L: first idle worker */
+#ifdef CONFIG_WORKQUEUE_DEBUGFS
+ struct worker *manager; /* L: the current manager */
+#endif
} ____cacheline_aligned_in_smp;

/*
@@ -208,6 +211,9 @@ struct workqueue_struct {
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
+#ifdef CONFIG_WORKQUEUE_DEBUGFS
+ show_work_func_t show_work; /* I: show work to debugfs */
+#endif
};

struct workqueue_struct *system_wq __read_mostly;
@@ -331,6 +337,27 @@ static inline void debug_work_activate(struct work_struct *work) { }
static inline void debug_work_deactivate(struct work_struct *work) { }
#endif

+#ifdef CONFIG_WORKQUEUE_DEBUGFS
+static void work_set_queued_at(struct work_struct *work)
+{
+ work->timestamp = jiffies;
+}
+
+static void worker_set_started_at(struct worker *worker)
+{
+ worker->timestamp = jiffies;
+}
+
+static void gcwq_set_manager(struct global_cwq *gcwq, struct worker *worker)
+{
+ gcwq->manager = worker;
+}
+#else
+static void work_set_queued_at(struct work_struct *work) { }
+static void worker_set_started_at(struct worker *worker) { }
+static void gcwq_set_manager(struct global_cwq *gcwq, struct worker *worker) { }
+#endif
+
/* Serializes the accesses to the list of workqueues. */
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
@@ -685,6 +712,8 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
{
struct global_cwq *gcwq = cwq->gcwq;

+ work_set_queued_at(work);
+
/* we own @work, set data and link */
set_work_cwq(work, cwq, extra_flags);

@@ -964,7 +993,7 @@ static void worker_enter_idle(struct worker *worker)

worker->flags |= WORKER_IDLE;
gcwq->nr_idle++;
- worker->last_active = jiffies;
+ worker->timestamp = jiffies;

/* idle_list is LIFO */
list_add(&worker->entry, &gcwq->idle_list);
@@ -1219,7 +1248,7 @@ static void idle_worker_timeout(unsigned long __gcwq)

/* idle_list is kept in LIFO order, check the last one */
worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
- expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+ expires = worker->timestamp + IDLE_WORKER_TIMEOUT;

if (time_before(jiffies, expires))
mod_timer(&gcwq->idle_timer, expires);
@@ -1357,7 +1386,7 @@ static bool maybe_destroy_workers(struct global_cwq *gcwq)
unsigned long expires;

worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
- expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+ expires = worker->timestamp + IDLE_WORKER_TIMEOUT;

if (time_before(jiffies, expires)) {
mod_timer(&gcwq->idle_timer, expires);
@@ -1401,6 +1430,7 @@ static bool manage_workers(struct worker *worker)

gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
gcwq->flags |= GCWQ_MANAGING_WORKERS;
+ gcwq_set_manager(gcwq, worker);

/*
* Destroy and then create so that may_start_working() is true
@@ -1410,6 +1440,7 @@ static bool manage_workers(struct worker *worker)
ret |= maybe_create_worker(gcwq);

gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+ gcwq_set_manager(gcwq, NULL);

/*
* The trustee might be waiting to take over the manager
@@ -1574,6 +1605,8 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
set_work_cpu(work, gcwq->cpu);
list_del_init(&work->entry);

+ worker_set_started_at(worker);
+
spin_unlock_irq(&gcwq->lock);

work_clear_pending(work);
@@ -3180,6 +3213,334 @@ out_unlock:
}
#endif /* CONFIG_FREEZER */

+#ifdef CONFIG_WORKQUEUE_DEBUGFS
+
+#include <linux/seq_file.h>
+#include <linux/log2.h>
+#include <linux/debugfs.h>
+
+/**
+ * workqueue_set_show_work - set show_work callback for a workqueue
+ * @wq: workqueue of interest
+ * @show: show_work callback
+ *
+ * Set show_work callback of @wq to @show. This is used by workqueue
+ * debugfs support to allow workqueue users to describe the work with
+ * specific details.
+ *
+ * bool (*@show)(struct seq_file *m, struct work_struct *work, bool running);
+ *
+ * It should print to @m without new line. If @running is set, @show
+ * is responsible for ensuring @work is still accessible. %true
+ * return suppresses wchan printout.
+ */
+void workqueue_set_show_work(struct workqueue_struct *wq, show_work_func_t show)
+{
+ wq->show_work = show;
+}
+EXPORT_SYMBOL_GPL(workqueue_set_show_work);
+
+enum {
+ ITER_TYPE_MANAGER = 0,
+ ITER_TYPE_BUSY_WORKER,
+ ITER_TYPE_IDLE_WORKER,
+ ITER_TYPE_PENDING_WORK,
+ ITER_TYPE_DELAYED_WORK,
+ ITER_NR_TYPES,
+
+ /* iter: sign [started bit] [cpu bits] [type bits] [idx bits] */
+ ITER_CPU_BITS = order_base_2(NR_CPUS),
+ ITER_CPU_MASK = ((loff_t)1 << ITER_CPU_BITS) - 1,
+ ITER_TYPE_BITS = order_base_2(ITER_NR_TYPES),
+ ITER_TYPE_MASK = ((loff_t)1 << ITER_TYPE_BITS) - 1,
+
+ ITER_BITS = BITS_PER_BYTE * sizeof(loff_t) - 1,
+ ITER_STARTED_BIT = ITER_BITS - 1,
+ ITER_CPU_SHIFT = ITER_STARTED_BIT - ITER_CPU_BITS,
+ ITER_TYPE_SHIFT = ITER_CPU_SHIFT - ITER_TYPE_BITS,
+
+ ITER_IDX_MASK = ((loff_t)1 << ITER_TYPE_SHIFT) - 1,
+};
+
+struct wq_debugfs_token {
+ struct global_cwq *gcwq;
+ struct worker *worker;
+ struct work_struct *work;
+ bool work_delayed;
+};
+
+static void wq_debugfs_decode_pos(loff_t pos, unsigned int *cpup, int *typep,
+ int *idxp)
+{
+ *cpup = (pos >> ITER_CPU_SHIFT) & ITER_CPU_MASK;
+ *typep = (pos >> ITER_TYPE_SHIFT) & ITER_TYPE_MASK;
+ *idxp = pos & ITER_IDX_MASK;
+}
+
+/* try to dereference @pos and set @tok accordingly, %true if successful */
+static bool wq_debugfs_deref_pos(loff_t pos, struct wq_debugfs_token *tok)
+{
+ int type, idx, i;
+ unsigned int cpu;
+ struct global_cwq *gcwq;
+ struct worker *worker;
+ struct work_struct *work;
+ struct hlist_node *hnode;
+ struct workqueue_struct *wq;
+ struct cpu_workqueue_struct *cwq;
+
+ wq_debugfs_decode_pos(pos, &cpu, &type, &idx);
+
+ /* make sure the right gcwq is locked and init @tok */
+ gcwq = get_gcwq(cpu);
+ if (tok->gcwq != gcwq) {
+ if (tok->gcwq)
+ spin_unlock_irq(&tok->gcwq->lock);
+ if (gcwq)
+ spin_lock_irq(&gcwq->lock);
+ }
+ memset(tok, 0, sizeof(*tok));
+ tok->gcwq = gcwq;
+
+ /* dereference index@type and record it in @tok */
+ switch (type) {
+ case ITER_TYPE_MANAGER:
+ if (!idx)
+ tok->worker = gcwq->manager;
+ return tok->worker;
+
+ case ITER_TYPE_BUSY_WORKER:
+ if (idx < gcwq->nr_workers - gcwq->nr_idle)
+ for_each_busy_worker(worker, i, hnode, gcwq)
+ if (!idx--) {
+ tok->worker = worker;
+ return true;
+ }
+ break;
+
+ case ITER_TYPE_IDLE_WORKER:
+ if (idx < gcwq->nr_idle)
+ list_for_each_entry(worker, &gcwq->idle_list, entry)
+ if (!idx--) {
+ tok->worker = worker;
+ return true;
+ }
+ break;
+
+ case ITER_TYPE_PENDING_WORK:
+ list_for_each_entry(work, &gcwq->worklist, entry)
+ if (!idx--) {
+ tok->work = work;
+ return true;
+ }
+ break;
+
+ case ITER_TYPE_DELAYED_WORK:
+ list_for_each_entry(wq, &workqueues, list) {
+ cwq = get_cwq(gcwq->cpu, wq);
+ list_for_each_entry(work, &cwq->delayed_works, entry)
+ if (!idx--) {
+ tok->work = work;
+ tok->work_delayed = true;
+ return true;
+ }
+ }
+ break;
+ }
+ return false;
+}
+
+static bool wq_debugfs_next_pos(loff_t *ppos, bool next_type)
+{
+ int type, idx;
+ unsigned int cpu;
+
+ wq_debugfs_decode_pos(*ppos, &cpu, &type, &idx);
+
+ if (next_type) {
+ /* proceed to the next type */
+ if (++type >= ITER_NR_TYPES) {
+ /* oops, was the last type, to the next cpu */
+ cpu = cpumask_next(cpu, cpu_possible_mask);
+ if (cpu >= nr_cpu_ids)
+ return false;
+ type = ITER_TYPE_MANAGER;
+ }
+ idx = 0;
+ } else /* bump up the index */
+ idx++;
+
+ *ppos = ((loff_t)1 << ITER_STARTED_BIT) |
+ ((loff_t)cpu << ITER_CPU_SHIFT) |
+ ((loff_t)type << ITER_TYPE_SHIFT) | idx;
+ return true;
+}
+
+static void wq_debugfs_free_tok(struct wq_debugfs_token *tok)
+{
+ if (tok && tok->gcwq)
+ spin_unlock_irq(&tok->gcwq->lock);
+ kfree(tok);
+}
+
+static void *wq_debugfs_start(struct seq_file *m, loff_t *ppos)
+{
+ struct wq_debugfs_token *tok;
+
+ if (*ppos == 0) {
+ seq_puts(m, "CPU ID PID WORK ADDR WORKQUEUE TIME DESC\n");
+ seq_puts(m, "==== ==== ===== ================ ============ ===== ============================\n");
+ *ppos = (loff_t)1 << ITER_STARTED_BIT |
+ (loff_t)cpumask_first(cpu_possible_mask) << ITER_CPU_BITS;
+ }
+
+ tok = kzalloc(sizeof(*tok), GFP_KERNEL);
+ if (!tok)
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock(&workqueue_lock);
+
+ while (!wq_debugfs_deref_pos(*ppos, tok)) {
+ if (!wq_debugfs_next_pos(ppos, true)) {
+ wq_debugfs_free_tok(tok);
+ return NULL;
+ }
+ }
+ return tok;
+}
+
+static void *wq_debugfs_next(struct seq_file *m, void *p, loff_t *ppos)
+{
+ struct wq_debugfs_token *tok = p;
+
+ wq_debugfs_next_pos(ppos, false);
+
+ while (!wq_debugfs_deref_pos(*ppos, tok)) {
+ if (!wq_debugfs_next_pos(ppos, true)) {
+ wq_debugfs_free_tok(tok);
+ return NULL;
+ }
+ }
+ return tok;
+}
+
+static void wq_debugfs_stop(struct seq_file *m, void *p)
+{
+ wq_debugfs_free_tok(p);
+ spin_unlock(&workqueue_lock);
+}
+
+static void wq_debugfs_print_duration(struct seq_file *m,
+ unsigned long timestamp)
+{
+ const char *units[] = { "us", "ms", " s", " m", " h", " d" };
+ const int factors[] = { 1000, 1000, 60, 60, 24, 0 };
+ unsigned long v = jiffies_to_usecs(jiffies - timestamp);
+ int i;
+
+ for (i = 0; v > 999 && i < ARRAY_SIZE(units) - 1; i++)
+ v /= factors[i];
+
+ seq_printf(m, "%3lu%s ", v, units[i]);
+}
+
+static int wq_debugfs_show(struct seq_file *m, void *p)
+{
+ struct wq_debugfs_token *tok = p;
+ struct worker *worker = NULL;
+ struct global_cwq *gcwq;
+ struct work_struct *work;
+ struct workqueue_struct *wq;
+ const char *name;
+ unsigned long timestamp;
+ char id_buf[11] = "", pid_buf[11] = "", addr_buf[17] = "";
+ bool showed = false;
+
+ if (tok->work) {
+ work = tok->work;
+ gcwq = get_work_gcwq(work);
+ wq = get_work_cwq(work)->wq;
+ name = wq->name;
+ timestamp = work->timestamp;
+
+ if (tok->work_delayed)
+ strncpy(id_buf, "DELA", sizeof(id_buf));
+ else
+ strncpy(id_buf, "PEND", sizeof(id_buf));
+ } else {
+ worker = tok->worker;
+ gcwq = worker->gcwq;
+ work = worker->current_work;
+ timestamp = worker->timestamp;
+
+ snprintf(id_buf, sizeof(id_buf), "%4d", worker->id);
+ snprintf(pid_buf, sizeof(pid_buf), "%4d",
+ task_pid_nr(worker->task));
+
+ if (work) {
+ wq = worker->current_cwq->wq;
+ name = wq->name;
+ } else {
+ wq = NULL;
+ if (worker->gcwq->manager == worker)
+ name = "<MANAGER>";
+ else
+ name = "<IDLE>";
+ }
+ }
+
+ if (work)
+ snprintf(addr_buf, sizeof(addr_buf), "%16p", work);
+
+ seq_printf(m, "%4d %4s %5s %16s %-12s ",
+ gcwq->cpu, id_buf, pid_buf, addr_buf, name);
+
+ wq_debugfs_print_duration(m, timestamp);
+
+ if (wq && work && wq->show_work)
+ showed = wq->show_work(m, work, worker != NULL);
+ if (!showed && worker && work) {
+ char buf[KSYM_SYMBOL_LEN];
+
+ sprint_symbol(buf, get_wchan(worker->task));
+ seq_printf(m, "%s", buf);
+ }
+
+ seq_putc(m, '\n');
+
+ return 0;
+}
+
+static const struct seq_operations wq_debugfs_ops = {
+ .start = wq_debugfs_start,
+ .next = wq_debugfs_next,
+ .stop = wq_debugfs_stop,
+ .show = wq_debugfs_show,
+};
+
+static int wq_debugfs_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &wq_debugfs_ops);
+}
+
+static const struct file_operations wq_debugfs_fops = {
+ .owner = THIS_MODULE,
+ .open = wq_debugfs_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init init_workqueue_debugfs(void)
+{
+ debugfs_create_file("workqueue", S_IFREG | 0400, NULL, NULL,
+ &wq_debugfs_fops);
+ return 0;
+}
+late_initcall(init_workqueue_debugfs);
+
+#endif /* CONFIG_WORKQUEUE_DEBUGFS */
+
void __init init_workqueues(void)
{
unsigned int cpu;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e722e9d..99b1690 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1117,6 +1117,13 @@ config ATOMIC64_SELFTEST

If unsure, say N.

+config WORKQUEUE_DEBUGFS
+ bool "Enable workqueue debugging info via debugfs"
+ depends on DEBUG_FS
+ help
+ Enable debug FS support for workqueue. Information about all the
+ current workers and works is available through <debugfs>/workqueue.
+
source "samples/Kconfig"

source "lib/Kconfig.kgdb"
--
1.6.4.2

2010-06-14 21:39:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/30] workqueue: introduce global cwq and unify cwq locks

There is one gcwq (global cwq) per each cpu and all cwqs on an cpu
point to it. A gcwq contains a lock to be used by all cwqs on the cpu
and an ida to give IDs to workers belonging to the cpu.

This patch introduces gcwq, moves worker_ida into gcwq and make all
cwqs on the same cpu use the cpu's gcwq->lock instead of separate
locks. gcwq->ida is now protected by gcwq->lock too.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 160 ++++++++++++++++++++++++++++++++--------------------
1 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 44c0fb2..d0ca750 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -40,38 +40,45 @@
*
* I: Set during initialization and read-only afterwards.
*
- * L: cwq->lock protected. Access with cwq->lock held.
+ * L: gcwq->lock protected. Access with gcwq->lock held.
*
* F: wq->flush_mutex protected.
*
* W: workqueue_lock protected.
*/

+struct global_cwq;
struct cpu_workqueue_struct;

struct worker {
struct work_struct *current_work; /* L: work being processed */
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
+ struct global_cwq *gcwq; /* I: the associated gcwq */
struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
int id; /* I: worker id */
};

/*
+ * Global per-cpu workqueue.
+ */
+struct global_cwq {
+ spinlock_t lock; /* the gcwq lock */
+ unsigned int cpu; /* I: the associated cpu */
+ struct ida worker_ida; /* L: for worker IDs */
+} ____cacheline_aligned_in_smp;
+
+/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu). The lower WORK_STRUCT_FLAG_BITS of
* work_struct->data are used for flags and thus cwqs need to be
* aligned at two's power of the number of flag bits.
*/
struct cpu_workqueue_struct {
-
- spinlock_t lock;
-
+ struct global_cwq *gcwq; /* I: the associated gcwq */
struct list_head worklist;
wait_queue_head_t more_work;
- unsigned int cpu;
struct worker *worker;
-
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
int flush_color; /* L: flushing color */
@@ -228,13 +235,19 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
/* Serializes the accesses to the list of workqueues. */
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
-static DEFINE_PER_CPU(struct ida, worker_ida);
static bool workqueue_freezing; /* W: have wqs started freezing? */

+static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+
static int worker_thread(void *__worker);

static int singlethread_cpu __read_mostly;

+static struct global_cwq *get_gcwq(unsigned int cpu)
+{
+ return &per_cpu(global_cwq, cpu);
+}
+
static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
struct workqueue_struct *wq)
{
@@ -303,7 +316,7 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
* Insert @work into @cwq after @head.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head,
@@ -326,12 +339,13 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;
struct list_head *worklist;
unsigned long flags;

debug_work_activate(work);

- spin_lock_irqsave(&cwq->lock, flags);
+ spin_lock_irqsave(&gcwq->lock, flags);
BUG_ON(!list_empty(&work->entry));

cwq->nr_in_flight[cwq->work_color]++;
@@ -344,7 +358,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,

insert_work(cwq, work, worklist, work_color_to_flags(cwq->work_color));

- spin_unlock_irqrestore(&cwq->lock, flags);
+ spin_unlock_irqrestore(&gcwq->lock, flags);
}

/**
@@ -483,39 +497,41 @@ static struct worker *alloc_worker(void)
*/
static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
{
+ struct global_cwq *gcwq = cwq->gcwq;
int id = -1;
struct worker *worker = NULL;

- spin_lock(&workqueue_lock);
- while (ida_get_new(&per_cpu(worker_ida, cwq->cpu), &id)) {
- spin_unlock(&workqueue_lock);
- if (!ida_pre_get(&per_cpu(worker_ida, cwq->cpu), GFP_KERNEL))
+ spin_lock_irq(&gcwq->lock);
+ while (ida_get_new(&gcwq->worker_ida, &id)) {
+ spin_unlock_irq(&gcwq->lock);
+ if (!ida_pre_get(&gcwq->worker_ida, GFP_KERNEL))
goto fail;
- spin_lock(&workqueue_lock);
+ spin_lock_irq(&gcwq->lock);
}
- spin_unlock(&workqueue_lock);
+ spin_unlock_irq(&gcwq->lock);

worker = alloc_worker();
if (!worker)
goto fail;

+ worker->gcwq = gcwq;
worker->cwq = cwq;
worker->id = id;

worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
- cwq->cpu, id);
+ gcwq->cpu, id);
if (IS_ERR(worker->task))
goto fail;

if (bind)
- kthread_bind(worker->task, cwq->cpu);
+ kthread_bind(worker->task, gcwq->cpu);

return worker;
fail:
if (id >= 0) {
- spin_lock(&workqueue_lock);
- ida_remove(&per_cpu(worker_ida, cwq->cpu), id);
- spin_unlock(&workqueue_lock);
+ spin_lock_irq(&gcwq->lock);
+ ida_remove(&gcwq->worker_ida, id);
+ spin_unlock_irq(&gcwq->lock);
}
kfree(worker);
return NULL;
@@ -528,7 +544,7 @@ fail:
* Start @worker.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void start_worker(struct worker *worker)
{
@@ -543,7 +559,7 @@ static void start_worker(struct worker *worker)
*/
static void destroy_worker(struct worker *worker)
{
- int cpu = worker->cwq->cpu;
+ struct global_cwq *gcwq = worker->gcwq;
int id = worker->id;

/* sanity check frenzy */
@@ -553,9 +569,9 @@ static void destroy_worker(struct worker *worker)
kthread_stop(worker->task);
kfree(worker);

- spin_lock(&workqueue_lock);
- ida_remove(&per_cpu(worker_ida, cpu), id);
- spin_unlock(&workqueue_lock);
+ spin_lock_irq(&gcwq->lock);
+ ida_remove(&gcwq->worker_ida, id);
+ spin_unlock_irq(&gcwq->lock);
}

/**
@@ -573,7 +589,7 @@ static void destroy_worker(struct worker *worker)
* nested inside outer list_for_each_entry_safe().
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void move_linked_works(struct work_struct *work, struct list_head *head,
struct work_struct **nextp)
@@ -617,7 +633,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
* decrement nr_in_flight of its cwq and handle workqueue flushing.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
{
@@ -664,11 +680,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
* call this function to process a work.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
*/
static void process_one_work(struct worker *worker, struct work_struct *work)
{
struct cpu_workqueue_struct *cwq = worker->cwq;
+ struct global_cwq *gcwq = cwq->gcwq;
work_func_t f = work->func;
int work_color;
#ifdef CONFIG_LOCKDEP
@@ -687,7 +704,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
work_color = get_work_color(work);
list_del_init(&work->entry);

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);

BUG_ON(get_wq_data(work) != cwq);
work_clear_pending(work);
@@ -707,7 +724,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
dump_stack();
}

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

/* we're done with it, release */
worker->current_work = NULL;
@@ -723,7 +740,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
* fetches a work from the top and executes it.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
* multiple times.
*/
static void process_scheduled_works(struct worker *worker)
@@ -744,6 +761,7 @@ static void process_scheduled_works(struct worker *worker)
static int worker_thread(void *__worker)
{
struct worker *worker = __worker;
+ struct global_cwq *gcwq = worker->gcwq;
struct cpu_workqueue_struct *cwq = worker->cwq;
DEFINE_WAIT(wait);

@@ -758,11 +776,11 @@ static int worker_thread(void *__worker)
break;

if (unlikely(!cpumask_equal(&worker->task->cpus_allowed,
- get_cpu_mask(cwq->cpu))))
+ get_cpu_mask(gcwq->cpu))))
set_cpus_allowed_ptr(worker->task,
- get_cpu_mask(cwq->cpu));
+ get_cpu_mask(gcwq->cpu));

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

while (!list_empty(&cwq->worklist)) {
struct work_struct *work =
@@ -782,7 +800,7 @@ static int worker_thread(void *__worker)
}
}

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
}

return 0;
@@ -821,7 +839,7 @@ static void wq_barrier_func(struct work_struct *work)
* underneath us, so we can't reliably determine cwq from @target.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
struct wq_barrier *barr,
@@ -831,7 +849,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
unsigned int linked = 0;

/*
- * debugobject calls are safe here even with cwq->lock locked
+ * debugobject calls are safe here even with gcwq->lock locked
* as we know for sure that this will not trigger any of the
* checks and call back into the fixup functions where we
* might deadlock.
@@ -904,8 +922,9 @@ static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,

for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

if (flush_color >= 0) {
BUG_ON(cwq->flush_color != -1);
@@ -922,7 +941,7 @@ static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
cwq->work_color = work_color;
}

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
}

if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_cwqs_to_flush))
@@ -1097,17 +1116,19 @@ int flush_work(struct work_struct *work)
{
struct worker *worker = NULL;
struct cpu_workqueue_struct *cwq;
+ struct global_cwq *gcwq;
struct wq_barrier barr;

might_sleep();
cwq = get_wq_data(work);
if (!cwq)
return 0;
+ gcwq = cwq->gcwq;

lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_release(&cwq->wq->lockdep_map);

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);
if (!list_empty(&work->entry)) {
/*
* See the comment near try_to_grab_pending()->smp_rmb().
@@ -1124,12 +1145,12 @@ int flush_work(struct work_struct *work)
}

insert_wq_barrier(cwq, &barr, work, worker);
- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
return 1;
already_gone:
- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
return 0;
}
EXPORT_SYMBOL_GPL(flush_work);
@@ -1140,6 +1161,7 @@ EXPORT_SYMBOL_GPL(flush_work);
*/
static int try_to_grab_pending(struct work_struct *work)
{
+ struct global_cwq *gcwq;
struct cpu_workqueue_struct *cwq;
int ret = -1;

@@ -1154,8 +1176,9 @@ static int try_to_grab_pending(struct work_struct *work)
cwq = get_wq_data(work);
if (!cwq)
return ret;
+ gcwq = cwq->gcwq;

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);
if (!list_empty(&work->entry)) {
/*
* This work is queued, but perhaps we locked the wrong cwq.
@@ -1170,7 +1193,7 @@ static int try_to_grab_pending(struct work_struct *work)
ret = 1;
}
}
- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);

return ret;
}
@@ -1178,10 +1201,11 @@ static int try_to_grab_pending(struct work_struct *work)
static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
+ struct global_cwq *gcwq = cwq->gcwq;
struct wq_barrier barr;
struct worker *worker;

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

worker = NULL;
if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
@@ -1189,7 +1213,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
insert_wq_barrier(cwq, &barr, work, worker);
}

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);

if (unlikely(worker)) {
wait_for_completion(&barr.done);
@@ -1561,13 +1585,13 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
*/
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = get_gcwq(cpu);

BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
- cwq->cpu = cpu;
+ cwq->gcwq = gcwq;
cwq->wq = wq;
cwq->flush_color = -1;
cwq->max_active = max_active;
- spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
INIT_LIST_HEAD(&cwq->delayed_works);
init_waitqueue_head(&cwq->more_work);
@@ -1738,7 +1762,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
* list instead of the cwq ones.
*
* CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
*/
void freeze_workqueues_begin(void)
{
@@ -1751,16 +1775,18 @@ void freeze_workqueues_begin(void)
workqueue_freezing = true;

for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_irq(&gcwq->lock);
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

- spin_lock_irq(&cwq->lock);
-
if (wq->flags & WQ_FREEZEABLE)
cwq->max_active = 0;
-
- spin_unlock_irq(&cwq->lock);
}
+
+ spin_unlock_irq(&gcwq->lock);
}

spin_unlock(&workqueue_lock);
@@ -1819,7 +1845,7 @@ out_unlock:
* frozen works are transferred to their respective cwq worklists.
*
* CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
*/
void thaw_workqueues(void)
{
@@ -1832,14 +1858,16 @@ void thaw_workqueues(void)
goto out_unlock;

for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_irq(&gcwq->lock);
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

if (!(wq->flags & WQ_FREEZEABLE))
continue;

- spin_lock_irq(&cwq->lock);
-
/* restore max_active and repopulate worklist */
cwq->max_active = wq->saved_max_active;

@@ -1848,9 +1876,9 @@ void thaw_workqueues(void)
cwq_activate_first_delayed(cwq);

wake_up(&cwq->more_work);
-
- spin_unlock_irq(&cwq->lock);
}
+
+ spin_unlock_irq(&gcwq->lock);
}

workqueue_freezing = false;
@@ -1871,11 +1899,19 @@ void __init init_workqueues(void)
BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
__alignof__(unsigned long long));

- for_each_possible_cpu(cpu)
- ida_init(&per_cpu(worker_ida, cpu));
-
singlethread_cpu = cpumask_first(cpu_possible_mask);
hotcpu_notifier(workqueue_cpu_callback, 0);
+
+ /* initialize gcwqs */
+ for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_init(&gcwq->lock);
+ gcwq->cpu = cpu;
+
+ ida_init(&gcwq->worker_ida);
+ }
+
keventd_wq = create_workqueue("events");
BUG_ON(!keventd_wq);
}
--
1.6.4.2

2010-06-14 21:43:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/30] workqueue: add find_worker_executing_work() and track current_cwq

Now that all the workers are tracked by gcwq, we can find which worker
is executing a work from gcwq. Implement find_worker_executing_work()
and make worker track its current_cwq so that we can find things the
other way around. This will be used to implement non-reentrant wqs.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2ce895e..7111683 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -82,6 +82,7 @@ struct worker {
};

struct work_struct *current_work; /* L: work being processed */
+ struct cpu_workqueue_struct *current_cwq; /* L: current_work's cwq */
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */
@@ -373,6 +374,59 @@ static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
}

/**
+ * __find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @bwh: hash head as returned by busy_worker_head()
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq. @bwh should be
+ * the hash head obtained by calling busy_worker_head() with the same
+ * work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *__find_worker_executing_work(struct global_cwq *gcwq,
+ struct hlist_head *bwh,
+ struct work_struct *work)
+{
+ struct worker *worker;
+ struct hlist_node *tmp;
+
+ hlist_for_each_entry(worker, tmp, bwh, hentry)
+ if (worker->current_work == work)
+ return worker;
+ return NULL;
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq. This function is
+ * identical to __find_worker_executing_work() except that this
+ * function calculates @bwh itself.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
+ struct work_struct *work)
+{
+ return __find_worker_executing_work(gcwq, busy_worker_head(gcwq, work),
+ work);
+}
+
+/**
* insert_work - insert a work into cwq
* @cwq: cwq @work belongs to
* @work: work to insert
@@ -914,6 +968,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
debug_work_deactivate(work);
hlist_add_head(&worker->hentry, bwh);
worker->current_work = work;
+ worker->current_cwq = cwq;
work_color = get_work_color(work);
list_del_init(&work->entry);

@@ -942,6 +997,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
/* we're done with it, release */
hlist_del_init(&worker->hentry);
worker->current_work = NULL;
+ worker->current_cwq = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}

--
1.6.4.2

2010-06-14 21:42:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 28/30] workqueue: implement several utility APIs

Implement the following utility APIs.

workqueue_set_max_active() : adjust max_active of a wq
workqueue_congested() : test whether a wq is contested
work_cpu() : determine the last / current cpu of a work
work_busy() : query whether a work is busy

* Anton Blanchard fixed missing ret initialization in work_busy().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Anton Blanchard <[email protected]>
---
include/linux/workqueue.h | 11 ++++-
kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 850942a..5d1d9be 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -65,6 +65,10 @@ enum {
WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
WORK_STRUCT_NO_CPU = NR_CPUS << WORK_STRUCT_FLAG_BITS,
+
+ /* bit mask for work_busy() return values */
+ WORK_BUSY_PENDING = 1 << 0,
+ WORK_BUSY_RUNNING = 1 << 1,
};

struct work_struct {
@@ -320,9 +324,14 @@ extern void init_workqueues(void);
int execute_in_process_context(work_func_t fn, struct execute_work *);

extern int flush_work(struct work_struct *work);
-
extern int cancel_work_sync(struct work_struct *work);

+extern void workqueue_set_max_active(struct workqueue_struct *wq,
+ int max_active);
+extern bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq);
+extern unsigned int work_cpu(struct work_struct *work);
+extern unsigned int work_busy(struct work_struct *work);
+
/*
* Kill off a pending schedule_delayed_work(). Note that the work callback
* function may still be running on return from cancel_delayed_work(), unless
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ae6e4c7..aad64f5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -206,7 +206,7 @@ struct workqueue_struct {
cpumask_var_t mayday_mask; /* cpus requesting rescue */
struct worker *rescuer; /* I: rescue worker */

- int saved_max_active; /* I: saved cwq max_active */
+ int saved_max_active; /* W: saved cwq max_active */
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
@@ -2646,6 +2646,112 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
EXPORT_SYMBOL_GPL(destroy_workqueue);

+/**
+ * workqueue_set_max_active - adjust max_active of a workqueue
+ * @wq: target workqueue
+ * @max_active: new max_active value.
+ *
+ * Set max_active of @wq to @max_active.
+ *
+ * CONTEXT:
+ * Don't call from IRQ context.
+ */
+void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
+{
+ unsigned int cpu;
+
+ max_active = wq_clamp_max_active(max_active, wq->name);
+
+ spin_lock(&workqueue_lock);
+
+ wq->saved_max_active = max_active;
+
+ for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_irq(&gcwq->lock);
+
+ if (!(wq->flags & WQ_FREEZEABLE) ||
+ !(gcwq->flags & GCWQ_FREEZING))
+ get_cwq(gcwq->cpu, wq)->max_active = max_active;
+
+ spin_unlock_irq(&gcwq->lock);
+ }
+
+ spin_unlock(&workqueue_lock);
+}
+EXPORT_SYMBOL_GPL(workqueue_set_max_active);
+
+/**
+ * workqueue_congested - test whether a workqueue is congested
+ * @cpu: CPU in question
+ * @wq: target workqueue
+ *
+ * Test whether @wq's cpu workqueue for @cpu is congested. There is
+ * no synchronization around this function and the test result is
+ * unreliable and only useful as advisory hints or for debugging.
+ *
+ * RETURNS:
+ * %true if congested, %false otherwise.
+ */
+bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq)
+{
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ return !list_empty(&cwq->delayed_works);
+}
+EXPORT_SYMBOL_GPL(workqueue_congested);
+
+/**
+ * work_cpu - return the last known associated cpu for @work
+ * @work: the work of interest
+ *
+ * RETURNS:
+ * CPU number if @work was ever queued. NR_CPUS otherwise.
+ */
+unsigned int work_cpu(struct work_struct *work)
+{
+ struct global_cwq *gcwq = get_work_gcwq(work);
+
+ return gcwq ? gcwq->cpu : NR_CPUS;
+}
+EXPORT_SYMBOL_GPL(work_cpu);
+
+/**
+ * work_busy - test whether a work is currently pending or running
+ * @work: the work to be tested
+ *
+ * Test whether @work is currently pending or running. There is no
+ * synchronization around this function and the test result is
+ * unreliable and only useful as advisory hints or for debugging.
+ * Especially for reentrant wqs, the pending state might hide the
+ * running state.
+ *
+ * RETURNS:
+ * OR'd bitmask of WORK_BUSY_* bits.
+ */
+unsigned int work_busy(struct work_struct *work)
+{
+ struct global_cwq *gcwq = get_work_gcwq(work);
+ unsigned long flags;
+ unsigned int ret = 0;
+
+ if (!gcwq)
+ return false;
+
+ spin_lock_irqsave(&gcwq->lock, flags);
+
+ if (work_pending(work))
+ ret |= WORK_BUSY_PENDING;
+ if (find_worker_executing_work(gcwq, work))
+ ret |= WORK_BUSY_RUNNING;
+
+ spin_unlock_irqrestore(&gcwq->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(work_busy);
+
/*
* CPU hotplug.
*
--
1.6.4.2

2010-06-14 21:41:54

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 21/30] workqueue: carry cpu number in work data once execution starts

To implement non-reentrant workqueue, the last gcwq a work was
executed on must be reliably obtainable as long as the work structure
is valid even if the previous workqueue has been destroyed.

To achieve this, work->data will be overloaded to carry the last cpu
number once execution starts so that the previous gcwq can be located
reliably. This means that cwq can't be obtained from work after
execution starts but only gcwq.

Implement set_work_{cwq|cpu}(), get_work_[g]cwq() and
clear_work_data() to set work data to the cpu number when starting
execution, access the overloaded work data and clear it after
cancellation.

queue_delayed_work_on() is updated to preserve the last cpu while
in-flight in timer and other callers which depended on getting cwq
from work after execution starts are converted to depend on gcwq
instead.

* Anton Blanchard fixed compile error on powerpc due to missing
linux/threads.h include.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Anton Blanchard <[email protected]>
---
include/linux/workqueue.h | 7 ++-
kernel/workqueue.c | 163 ++++++++++++++++++++++++++++----------------
2 files changed, 109 insertions(+), 61 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 10611f7..0a78141 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -9,6 +9,7 @@
#include <linux/linkage.h>
#include <linux/bitops.h>
#include <linux/lockdep.h>
+#include <linux/threads.h>
#include <asm/atomic.h>

struct workqueue_struct;
@@ -59,6 +60,7 @@ enum {

WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+ WORK_STRUCT_NO_CPU = NR_CPUS << WORK_STRUCT_FLAG_BITS,
};

struct work_struct {
@@ -70,8 +72,9 @@ struct work_struct {
#endif
};

-#define WORK_DATA_INIT() ATOMIC_LONG_INIT(0)
-#define WORK_DATA_STATIC_INIT() ATOMIC_LONG_INIT(WORK_STRUCT_STATIC)
+#define WORK_DATA_INIT() ATOMIC_LONG_INIT(WORK_STRUCT_NO_CPU)
+#define WORK_DATA_STATIC_INIT() \
+ ATOMIC_LONG_INIT(WORK_STRUCT_NO_CPU | WORK_STRUCT_STATIC)

struct delayed_work {
struct work_struct work;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7111683..f606c44 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -319,31 +319,71 @@ static int work_next_color(int color)
}

/*
- * Set the workqueue on which a work item is to be run
- * - Must *only* be called if the pending flag is set
+ * Work data points to the cwq while a work is on queue. Once
+ * execution starts, it points to the cpu the work was last on. This
+ * can be distinguished by comparing the data value against
+ * PAGE_OFFSET.
+ *
+ * set_work_{cwq|cpu}() and clear_work_data() can be used to set the
+ * cwq, cpu or clear work->data. These functions should only be
+ * called while the work is owned - ie. while the PENDING bit is set.
+ *
+ * get_work_[g]cwq() can be used to obtain the gcwq or cwq
+ * corresponding to a work. gcwq is available once the work has been
+ * queued anywhere after initialization. cwq is available only from
+ * queueing until execution starts.
*/
-static inline void set_wq_data(struct work_struct *work,
- struct cpu_workqueue_struct *cwq,
- unsigned long extra_flags)
+static inline void set_work_data(struct work_struct *work, unsigned long data,
+ unsigned long flags)
{
BUG_ON(!work_pending(work));
+ atomic_long_set(&work->data, data | flags | work_static(work));
+}

- atomic_long_set(&work->data, (unsigned long)cwq | work_static(work) |
- WORK_STRUCT_PENDING | extra_flags);
+static void set_work_cwq(struct work_struct *work,
+ struct cpu_workqueue_struct *cwq,
+ unsigned long extra_flags)
+{
+ set_work_data(work, (unsigned long)cwq,
+ WORK_STRUCT_PENDING | extra_flags);
}

-/*
- * Clear WORK_STRUCT_PENDING and the workqueue on which it was queued.
- */
-static inline void clear_wq_data(struct work_struct *work)
+static void set_work_cpu(struct work_struct *work, unsigned int cpu)
+{
+ set_work_data(work, cpu << WORK_STRUCT_FLAG_BITS, WORK_STRUCT_PENDING);
+}
+
+static void clear_work_data(struct work_struct *work)
+{
+ set_work_data(work, WORK_STRUCT_NO_CPU, 0);
+}
+
+static inline unsigned long get_work_data(struct work_struct *work)
+{
+ return atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK;
+}
+
+static struct cpu_workqueue_struct *get_work_cwq(struct work_struct *work)
{
- atomic_long_set(&work->data, work_static(work));
+ unsigned long data = get_work_data(work);
+
+ return data >= PAGE_OFFSET ? (void *)data : NULL;
}

-static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
+static struct global_cwq *get_work_gcwq(struct work_struct *work)
{
- return (void *)(atomic_long_read(&work->data) &
- WORK_STRUCT_WQ_DATA_MASK);
+ unsigned long data = get_work_data(work);
+ unsigned int cpu;
+
+ if (data >= PAGE_OFFSET)
+ return ((struct cpu_workqueue_struct *)data)->gcwq;
+
+ cpu = data >> WORK_STRUCT_FLAG_BITS;
+ if (cpu == NR_CPUS)
+ return NULL;
+
+ BUG_ON(cpu >= num_possible_cpus());
+ return get_gcwq(cpu);
}

/**
@@ -443,7 +483,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
unsigned int extra_flags)
{
/* we own @work, set data and link */
- set_wq_data(work, cwq, extra_flags);
+ set_work_cwq(work, cwq, extra_flags);

/*
* Ensure that we get the right work->data if we see the
@@ -599,7 +639,7 @@ EXPORT_SYMBOL_GPL(queue_work_on);
static void delayed_work_timer_fn(unsigned long __data)
{
struct delayed_work *dwork = (struct delayed_work *)__data;
- struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
+ struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work);

__queue_work(smp_processor_id(), cwq->wq, &dwork->work);
}
@@ -639,13 +679,19 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work = &dwork->work;

if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ struct global_cwq *gcwq = get_work_gcwq(work);
+ unsigned int lcpu = gcwq ? gcwq->cpu : raw_smp_processor_id();
+
BUG_ON(timer_pending(timer));
BUG_ON(!list_empty(&work->entry));

timer_stats_timer_set_start_info(&dwork->timer);
-
- /* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
+ /*
+ * This stores cwq for the moment, for the timer_fn.
+ * Note that the work's gcwq is preserved to allow
+ * reentrance detection for delayed works.
+ */
+ set_work_cwq(work, get_cwq(lcpu, wq), 0);
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
@@ -970,11 +1016,14 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
worker->current_work = work;
worker->current_cwq = cwq;
work_color = get_work_color(work);
+
+ BUG_ON(get_work_cwq(work) != cwq);
+ /* record the current cpu number in the work data and dequeue */
+ set_work_cpu(work, gcwq->cpu);
list_del_init(&work->entry);

spin_unlock_irq(&gcwq->lock);

- BUG_ON(get_wq_data(work) != cwq);
work_clear_pending(work);
lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_acquire(&lockdep_map);
@@ -1406,37 +1455,39 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
int flush_work(struct work_struct *work)
{
struct worker *worker = NULL;
- struct cpu_workqueue_struct *cwq;
struct global_cwq *gcwq;
+ struct cpu_workqueue_struct *cwq;
struct wq_barrier barr;

might_sleep();
- cwq = get_wq_data(work);
- if (!cwq)
+ gcwq = get_work_gcwq(work);
+ if (!gcwq)
return 0;
- gcwq = cwq->gcwq;
-
- lock_map_acquire(&cwq->wq->lockdep_map);
- lock_map_release(&cwq->wq->lockdep_map);

spin_lock_irq(&gcwq->lock);
if (!list_empty(&work->entry)) {
/*
* See the comment near try_to_grab_pending()->smp_rmb().
- * If it was re-queued under us we are not going to wait.
+ * If it was re-queued to a different gcwq under us, we
+ * are not going to wait.
*/
smp_rmb();
- if (unlikely(cwq != get_wq_data(work)))
+ cwq = get_work_cwq(work);
+ if (unlikely(!cwq || gcwq != cwq->gcwq))
goto already_gone;
} else {
- if (cwq->worker && cwq->worker->current_work == work)
- worker = cwq->worker;
+ worker = find_worker_executing_work(gcwq, work);
if (!worker)
goto already_gone;
+ cwq = worker->current_cwq;
}

insert_wq_barrier(cwq, &barr, work, worker);
spin_unlock_irq(&gcwq->lock);
+
+ lock_map_acquire(&cwq->wq->lockdep_map);
+ lock_map_release(&cwq->wq->lockdep_map);
+
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
return 1;
@@ -1453,7 +1504,6 @@ EXPORT_SYMBOL_GPL(flush_work);
static int try_to_grab_pending(struct work_struct *work)
{
struct global_cwq *gcwq;
- struct cpu_workqueue_struct *cwq;
int ret = -1;

if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
@@ -1463,24 +1513,23 @@ static int try_to_grab_pending(struct work_struct *work)
* The queueing is in progress, or it is already queued. Try to
* steal it from ->worklist without clearing WORK_STRUCT_PENDING.
*/
-
- cwq = get_wq_data(work);
- if (!cwq)
+ gcwq = get_work_gcwq(work);
+ if (!gcwq)
return ret;
- gcwq = cwq->gcwq;

spin_lock_irq(&gcwq->lock);
if (!list_empty(&work->entry)) {
/*
- * This work is queued, but perhaps we locked the wrong cwq.
+ * This work is queued, but perhaps we locked the wrong gcwq.
* In that case we must see the new value after rmb(), see
* insert_work()->wmb().
*/
smp_rmb();
- if (cwq == get_wq_data(work)) {
+ if (gcwq == get_work_gcwq(work)) {
debug_work_deactivate(work);
list_del_init(&work->entry);
- cwq_dec_nr_in_flight(cwq, get_work_color(work));
+ cwq_dec_nr_in_flight(get_work_cwq(work),
+ get_work_color(work));
ret = 1;
}
}
@@ -1489,20 +1538,16 @@ static int try_to_grab_pending(struct work_struct *work)
return ret;
}

-static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work)
+static void wait_on_cpu_work(struct global_cwq *gcwq, struct work_struct *work)
{
- struct global_cwq *gcwq = cwq->gcwq;
struct wq_barrier barr;
struct worker *worker;

spin_lock_irq(&gcwq->lock);

- worker = NULL;
- if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
- worker = cwq->worker;
- insert_wq_barrier(cwq, &barr, work, worker);
- }
+ worker = find_worker_executing_work(gcwq, work);
+ if (unlikely(worker))
+ insert_wq_barrier(worker->current_cwq, &barr, work, worker);

spin_unlock_irq(&gcwq->lock);

@@ -1514,8 +1559,6 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,

static void wait_on_work(struct work_struct *work)
{
- struct cpu_workqueue_struct *cwq;
- struct workqueue_struct *wq;
int cpu;

might_sleep();
@@ -1523,14 +1566,8 @@ static void wait_on_work(struct work_struct *work)
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);

- cwq = get_wq_data(work);
- if (!cwq)
- return;
-
- wq = cwq->wq;
-
for_each_possible_cpu(cpu)
- wait_on_cpu_work(get_cwq(cpu, wq), work);
+ wait_on_cpu_work(get_gcwq(cpu), work);
}

static int __cancel_work_timer(struct work_struct *work,
@@ -1545,7 +1582,7 @@ static int __cancel_work_timer(struct work_struct *work,
wait_on_work(work);
} while (unlikely(ret < 0));

- clear_wq_data(work);
+ clear_work_data(work);
return ret;
}

@@ -1647,7 +1684,7 @@ EXPORT_SYMBOL(schedule_delayed_work);
void flush_delayed_work(struct delayed_work *dwork)
{
if (del_timer_sync(&dwork->timer)) {
- __queue_work(get_cpu(), get_wq_data(&dwork->work)->wq,
+ __queue_work(get_cpu(), get_work_cwq(&dwork->work)->wq,
&dwork->work);
put_cpu();
}
@@ -2407,6 +2444,14 @@ void __init init_workqueues(void)
BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
__alignof__(unsigned long long));

+ /*
+ * The pointer part of work->data is either pointing to the
+ * cwq or contains the cpu number the work ran last on. Make
+ * sure cpu number won't overflow into kernel pointer area so
+ * that they can be distinguished.
+ */
+ BUILD_BUG_ON(NR_CPUS << WORK_STRUCT_FLAG_BITS >= PAGE_OFFSET);
+
hotcpu_notifier(workqueue_cpu_callback, CPU_PRI_WORKQUEUE);

/* initialize gcwqs */
--
1.6.4.2

2010-06-14 21:42:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/30] workqueue: kill cpu_populated_map

Worker management is about to be overhauled. Simplify things by
removing cpu_populated_map, creating workers for all possible cpus and
making single threaded workqueues behave more like multi threaded
ones.

After this patch, all cwqs are always initialized, all workqueues are
linked on the workqueues list and workers for all possibles cpus
always exist. This also makes CPU hotplug support simpler - checking
->cpus_allowed before processing works in worker_thread() and flushing
cwqs on CPU_POST_DEAD are enough.

While at it, make get_cwq() always return the cwq for the specified
cpu, add target_cwq() for cases where single thread distinction is
necessary and drop all direct usage of per_cpu_ptr() on wq->cpu_wq.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 173 ++++++++++++++++++----------------------------------
1 files changed, 59 insertions(+), 114 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f7ab703..dc78956 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -55,6 +55,7 @@ struct cpu_workqueue_struct {
struct list_head worklist;
wait_queue_head_t more_work;
struct work_struct *current_work;
+ unsigned int cpu;

struct workqueue_struct *wq; /* I: the owning workqueue */
struct task_struct *thread;
@@ -189,34 +190,19 @@ static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);

static int singlethread_cpu __read_mostly;
-static const struct cpumask *cpu_singlethread_map __read_mostly;
-/*
- * _cpu_down() first removes CPU from cpu_online_map, then CPU_DEAD
- * flushes cwq->worklist. This means that flush_workqueue/wait_on_work
- * which comes in between can't use for_each_online_cpu(). We could
- * use cpu_possible_map, the cpumask below is more a documentation
- * than optimization.
- */
-static cpumask_var_t cpu_populated_map __read_mostly;
-
-/* If it's single threaded, it isn't in the list of workqueues. */
-static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
-{
- return wq->flags & WQ_SINGLE_THREAD;
-}

-static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+ struct workqueue_struct *wq)
{
- return is_wq_single_threaded(wq)
- ? cpu_singlethread_map : cpu_populated_map;
+ return per_cpu_ptr(wq->cpu_wq, cpu);
}

-static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
- struct workqueue_struct *wq)
+static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
+ struct workqueue_struct *wq)
{
- if (unlikely(is_wq_single_threaded(wq)))
+ if (unlikely(wq->flags & WQ_SINGLE_THREAD))
cpu = singlethread_cpu;
- return per_cpu_ptr(wq->cpu_wq, cpu);
+ return get_cwq(cpu, wq);
}

/*
@@ -279,7 +265,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
- struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
unsigned long flags;

debug_work_activate(work);
@@ -383,7 +369,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
timer_stats_timer_set_start_info(&dwork->timer);

/* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
+ set_wq_data(work, target_cwq(raw_smp_processor_id(), wq), 0);
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
@@ -495,6 +481,10 @@ static int worker_thread(void *__cwq)
if (kthread_should_stop())
break;

+ if (unlikely(!cpumask_equal(&cwq->thread->cpus_allowed,
+ get_cpu_mask(cwq->cpu))))
+ set_cpus_allowed_ptr(cwq->thread,
+ get_cpu_mask(cwq->cpu));
run_workqueue(cwq);
}

@@ -574,14 +564,13 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*/
void flush_workqueue(struct workqueue_struct *wq)
{
- const struct cpumask *cpu_map = wq_cpu_map(wq);
int cpu;

might_sleep();
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
- for_each_cpu(cpu, cpu_map)
- flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+ for_each_possible_cpu(cpu)
+ flush_cpu_workqueue(get_cwq(cpu, wq));
}
EXPORT_SYMBOL_GPL(flush_workqueue);

@@ -699,7 +688,6 @@ static void wait_on_work(struct work_struct *work)
{
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
- const struct cpumask *cpu_map;
int cpu;

might_sleep();
@@ -712,9 +700,8 @@ static void wait_on_work(struct work_struct *work)
return;

wq = cwq->wq;
- cpu_map = wq_cpu_map(wq);

- for_each_cpu(cpu, cpu_map)
+ for_each_possible_cpu(cpu)
wait_on_cpu_work(get_cwq(cpu, wq), work);
}

@@ -972,7 +959,7 @@ int current_is_keventd(void)

BUG_ON(!keventd_wq);

- cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
+ cwq = get_cwq(cpu, keventd_wq);
if (current == cwq->thread)
ret = 1;

@@ -980,26 +967,12 @@ int current_is_keventd(void)

}

-static struct cpu_workqueue_struct *
-init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
-{
- struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
-
- cwq->wq = wq;
- spin_lock_init(&cwq->lock);
- INIT_LIST_HEAD(&cwq->worklist);
- init_waitqueue_head(&cwq->more_work);
-
- return cwq;
-}
-
static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
struct workqueue_struct *wq = cwq->wq;
- const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;

- p = kthread_create(worker_thread, cwq, fmt, wq->name, cpu);
+ p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
/*
* Nobody can add the work_struct to this cwq,
* if (caller is __create_workqueue)
@@ -1031,8 +1004,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
struct lock_class_key *key,
const char *lock_name)
{
+ bool singlethread = flags & WQ_SINGLE_THREAD;
struct workqueue_struct *wq;
- struct cpu_workqueue_struct *cwq;
int err = 0, cpu;

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -1048,37 +1021,37 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);

- if (flags & WQ_SINGLE_THREAD) {
- cwq = init_cpu_workqueue(wq, singlethread_cpu);
- err = create_workqueue_thread(cwq, singlethread_cpu);
- start_workqueue_thread(cwq, -1);
- } else {
- cpu_maps_update_begin();
- /*
- * We must place this wq on list even if the code below fails.
- * cpu_down(cpu) can remove cpu from cpu_populated_map before
- * destroy_workqueue() takes the lock, in that case we leak
- * cwq[cpu]->thread.
- */
- spin_lock(&workqueue_lock);
- list_add(&wq->list, &workqueues);
- spin_unlock(&workqueue_lock);
- /*
- * We must initialize cwqs for each possible cpu even if we
- * are going to call destroy_workqueue() finally. Otherwise
- * cpu_up() can hit the uninitialized cwq once we drop the
- * lock.
- */
- for_each_possible_cpu(cpu) {
- cwq = init_cpu_workqueue(wq, cpu);
- if (err || !cpu_online(cpu))
- continue;
- err = create_workqueue_thread(cwq, cpu);
+ cpu_maps_update_begin();
+ /*
+ * We must initialize cwqs for each possible cpu even if we
+ * are going to call destroy_workqueue() finally. Otherwise
+ * cpu_up() can hit the uninitialized cwq once we drop the
+ * lock.
+ */
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ cwq->wq = wq;
+ cwq->cpu = cpu;
+ spin_lock_init(&cwq->lock);
+ INIT_LIST_HEAD(&cwq->worklist);
+ init_waitqueue_head(&cwq->more_work);
+
+ if (err)
+ continue;
+ err = create_workqueue_thread(cwq, cpu);
+ if (cpu_online(cpu) && !singlethread)
start_workqueue_thread(cwq, cpu);
- }
- cpu_maps_update_done();
+ else
+ start_workqueue_thread(cwq, -1);
}

+ spin_lock(&workqueue_lock);
+ list_add(&wq->list, &workqueues);
+ spin_unlock(&workqueue_lock);
+
+ cpu_maps_update_done();
+
if (err) {
destroy_workqueue(wq);
wq = NULL;
@@ -1128,17 +1101,16 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- const struct cpumask *cpu_map = wq_cpu_map(wq);
int cpu;

cpu_maps_update_begin();
spin_lock(&workqueue_lock);
list_del(&wq->list);
spin_unlock(&workqueue_lock);
+ cpu_maps_update_done();

- for_each_cpu(cpu, cpu_map)
- cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
- cpu_maps_update_done();
+ for_each_possible_cpu(cpu)
+ cleanup_workqueue_thread(get_cwq(cpu, wq));

free_percpu(wq->cpu_wq);
kfree(wq);
@@ -1152,48 +1124,25 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
unsigned int cpu = (unsigned long)hcpu;
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
- int err = 0;

action &= ~CPU_TASKS_FROZEN;

- switch (action) {
- case CPU_UP_PREPARE:
- cpumask_set_cpu(cpu, cpu_populated_map);
- }
-undo:
list_for_each_entry(wq, &workqueues, list) {
- cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+ if (wq->flags & WQ_SINGLE_THREAD)
+ continue;

- switch (action) {
- case CPU_UP_PREPARE:
- err = create_workqueue_thread(cwq, cpu);
- if (!err)
- break;
- printk(KERN_ERR "workqueue [%s] for %i failed\n",
- wq->name, cpu);
- action = CPU_UP_CANCELED;
- err = -ENOMEM;
- goto undo;
-
- case CPU_ONLINE:
- start_workqueue_thread(cwq, cpu);
- break;
+ cwq = get_cwq(cpu, wq);

- case CPU_UP_CANCELED:
- start_workqueue_thread(cwq, -1);
+ switch (action) {
case CPU_POST_DEAD:
- cleanup_workqueue_thread(cwq);
+ lock_map_acquire(&cwq->wq->lockdep_map);
+ lock_map_release(&cwq->wq->lockdep_map);
+ flush_cpu_workqueue(cwq);
break;
}
}

- switch (action) {
- case CPU_UP_CANCELED:
- case CPU_POST_DEAD:
- cpumask_clear_cpu(cpu, cpu_populated_map);
- }
-
- return notifier_from_errno(err);
+ return notifier_from_errno(0);
}

#ifdef CONFIG_SMP
@@ -1245,11 +1194,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);

void __init init_workqueues(void)
{
- alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-
- cpumask_copy(cpu_populated_map, cpu_online_mask);
singlethread_cpu = cpumask_first(cpu_possible_mask);
- cpu_singlethread_map = cpumask_of(singlethread_cpu);
hotcpu_notifier(workqueue_cpu_callback, 0);
keventd_wq = create_workqueue("events");
BUG_ON(!keventd_wq);
--
1.6.4.2

2010-06-14 21:39:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/30] workqueue: update cwq alignement

work->data field is used for two purposes. It points to cwq it's
queued on and the lower bits are used for flags. Currently, two bits
are reserved which is always safe as 4 byte alignment is guaranteed on
every architecture. However, future changes will need more flag bits.

On SMP, the percpu allocator is capable of honoring larger alignment
(there are other users which depend on it) and larger alignment works
just fine. On UP, percpu allocator is a thin wrapper around
kzalloc/kfree() and don't honor alignment request.

This patch introduces WORK_STRUCT_FLAG_BITS and implements
alloc/free_cwqs() which guarantees (1 << WORK_STRUCT_FLAG_BITS)
alignment both on SMP and UP. On SMP, simply wrapping percpu
allocator is enouhg. On UP, extra space is allocated so that cwq can
be aligned and the original pointer can be stored after it which is
used in the free path.

While at it, as cwqs are now forced aligned, make sure the resulting
alignment is at least equal to or larger than that of long long.

Alignment problem on UP is reported by Michal Simek.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Reported-by: Michal Simek <[email protected]>
---
include/linux/workqueue.h | 5 +++-
kernel/workqueue.c | 62 +++++++++++++++++++++++++++++++++++++++++---
2 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d60c570..b90958a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -26,6 +26,9 @@ enum {
WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
#ifdef CONFIG_DEBUG_OBJECTS_WORK
WORK_STRUCT_STATIC_BIT = 1, /* static initializer (debugobjects) */
+ WORK_STRUCT_FLAG_BITS = 2,
+#else
+ WORK_STRUCT_FLAG_BITS = 1,
#endif

WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
@@ -35,7 +38,7 @@ enum {
WORK_STRUCT_STATIC = 0,
#endif

- WORK_STRUCT_FLAG_MASK = 3UL,
+ WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
};

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dc78956..878546e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -46,7 +46,9 @@

/*
* The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).
+ * possible cpu). The lower WORK_STRUCT_FLAG_BITS of
+ * work_struct->data are used for flags and thus cwqs need to be
+ * aligned at two's power of the number of flag bits.
*/
struct cpu_workqueue_struct {

@@ -59,7 +61,7 @@ struct cpu_workqueue_struct {

struct workqueue_struct *wq; /* I: the owning workqueue */
struct task_struct *thread;
-} ____cacheline_aligned;
+};

/*
* The externally visible workqueue abstraction is an array of
@@ -967,6 +969,47 @@ int current_is_keventd(void)

}

+static struct cpu_workqueue_struct *alloc_cwqs(void)
+{
+ const size_t size = sizeof(struct cpu_workqueue_struct);
+ const size_t align = 1 << WORK_STRUCT_FLAG_BITS;
+ struct cpu_workqueue_struct *cwqs;
+#ifndef CONFIG_SMP
+ void *ptr;
+
+ /*
+ * On UP, percpu allocator doesn't honor alignment parameter
+ * and simply uses arch-dependent default. Allocate enough
+ * room to align cwq and put an extra pointer at the end
+ * pointing back to the originally allocated pointer which
+ * will be used for free.
+ *
+ * FIXME: This really belongs to UP percpu code. Update UP
+ * percpu code to honor alignment and remove this ugliness.
+ */
+ ptr = __alloc_percpu(size + align + sizeof(void *), 1);
+ cwqs = PTR_ALIGN(ptr, align);
+ *(void **)per_cpu_ptr(cwqs + 1, 0) = ptr;
+#else
+ /* On SMP, percpu allocator can do it itself */
+ cwqs = __alloc_percpu(size, align);
+#endif
+ /* just in case, make sure it's actually aligned */
+ BUG_ON(!IS_ALIGNED((unsigned long)cwqs, align));
+ return cwqs;
+}
+
+static void free_cwqs(struct cpu_workqueue_struct *cwqs)
+{
+#ifndef CONFIG_SMP
+ /* on UP, the pointer to free is stored right after the cwq */
+ if (cwqs)
+ free_percpu(*(void **)per_cpu_ptr(cwqs + 1, 0));
+#else
+ free_percpu(cwqs);
+#endif
+}
+
static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
struct workqueue_struct *wq = cwq->wq;
@@ -1012,7 +1055,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
if (!wq)
goto err;

- wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
+ wq->cpu_wq = alloc_cwqs();
if (!wq->cpu_wq)
goto err;

@@ -1031,6 +1074,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

+ BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
cwq->wq = wq;
cwq->cpu = cpu;
spin_lock_init(&cwq->lock);
@@ -1059,7 +1103,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
return wq;
err:
if (wq) {
- free_percpu(wq->cpu_wq);
+ free_cwqs(wq->cpu_wq);
kfree(wq);
}
return NULL;
@@ -1112,7 +1156,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for_each_possible_cpu(cpu)
cleanup_workqueue_thread(get_cwq(cpu, wq));

- free_percpu(wq->cpu_wq);
+ free_cwqs(wq->cpu_wq);
kfree(wq);
}
EXPORT_SYMBOL_GPL(destroy_workqueue);
@@ -1194,6 +1238,14 @@ EXPORT_SYMBOL_GPL(work_on_cpu);

void __init init_workqueues(void)
{
+ /*
+ * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
+ * Make sure that the alignment isn't lower than that of
+ * unsigned long long.
+ */
+ BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
+ __alignof__(unsigned long long));
+
singlethread_cpu = cpumask_first(cpu_possible_mask);
hotcpu_notifier(workqueue_cpu_callback, 0);
keventd_wq = create_workqueue("events");
--
1.6.4.2

2010-06-14 21:43:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/30] kthread: implement kthread_data()

Implement kthread_data() which takes @task pointing to a kthread and
returns @data specified when creating the kthread. The caller is
responsible for ensuring the validity of @task when calling this
function.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 15 +++++++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..14f63e8 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -30,6 +30,7 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
void kthread_bind(struct task_struct *k, unsigned int cpu);
int kthread_stop(struct task_struct *k);
int kthread_should_stop(void);
+void *kthread_data(struct task_struct *k);

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..d176202 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,6 +35,7 @@ struct kthread_create_info

struct kthread {
int should_stop;
+ void *data;
struct completion exited;
};

@@ -54,6 +55,19 @@ int kthread_should_stop(void)
}
EXPORT_SYMBOL(kthread_should_stop);

+/**
+ * kthread_data - return data value specified on kthread creation
+ * @task: kthread task in question
+ *
+ * Return the data value specified when kthread @task was created.
+ * The caller is responsible for ensuring the validity of @task when
+ * calling this function.
+ */
+void *kthread_data(struct task_struct *task)
+{
+ return to_kthread(task)->data;
+}
+
static int kthread(void *_create)
{
/* Copy data: it's on kthread's stack */
@@ -64,6 +78,7 @@ static int kthread(void *_create)
int ret;

self.should_stop = 0;
+ self.data = data;
init_completion(&self.exited);
current->vfork_done = &self.exited;

--
1.6.4.2

2010-06-14 21:43:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/30] workqueue: reimplement CPU hotplugging support using trustee

Reimplement CPU hotplugging support using trustee thread. On CPU
down, a trustee thread is created and each step of CPU down is
executed by the trustee and workqueue_cpu_callback() simply drives and
waits for trustee state transitions.

CPU down operation no longer waits for works to be drained but trustee
sticks around till all pending works have been completed. If CPU is
brought back up while works are still draining,
workqueue_cpu_callback() tells trustee to step down and tell workers
to rebind to the cpu.

As it's difficult to tell whether cwqs are empty if it's freezing or
frozen, trustee doesn't consider draining to be complete while a gcwq
is freezing or frozen (tracked by new GCWQ_FREEZING flag). Also,
workers which get unbound from their cpu are marked with WORKER_ROGUE.

Trustee based implementation doesn't bring any new feature at this
point but it will be used to manage worker pool when dynamic shared
worker pool is implemented.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cpu.h | 2 +
kernel/workqueue.c | 293 ++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 279 insertions(+), 16 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index de6b172..4823af6 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -71,6 +71,8 @@ enum {
/* migration should happen before other stuff but after perf */
CPU_PRI_PERF = 20,
CPU_PRI_MIGRATION = 10,
+ /* prepare workqueues for other notifiers */
+ CPU_PRI_WORKQUEUE = 5,
};

#ifdef CONFIG_SMP
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 62d7cfd..5cd155d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -36,14 +36,27 @@
#include <linux/idr.h>

enum {
+ /* global_cwq flags */
+ GCWQ_FREEZING = 1 << 3, /* freeze in progress */
+
/* worker flags */
WORKER_STARTED = 1 << 0, /* started */
WORKER_DIE = 1 << 1, /* die die die */
WORKER_IDLE = 1 << 2, /* is idle */
+ WORKER_ROGUE = 1 << 4, /* not bound to any cpu */
+
+ /* gcwq->trustee_state */
+ TRUSTEE_START = 0, /* start */
+ TRUSTEE_IN_CHARGE = 1, /* trustee in charge of gcwq */
+ TRUSTEE_BUTCHER = 2, /* butcher workers */
+ TRUSTEE_RELEASE = 3, /* release workers */
+ TRUSTEE_DONE = 4, /* trustee is done */

BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
BUSY_WORKER_HASH_SIZE = 1 << BUSY_WORKER_HASH_ORDER,
BUSY_WORKER_HASH_MASK = BUSY_WORKER_HASH_SIZE - 1,
+
+ TRUSTEE_COOLDOWN = HZ / 10, /* for trustee draining */
};

/*
@@ -83,6 +96,7 @@ struct worker {
struct global_cwq {
spinlock_t lock; /* the gcwq lock */
unsigned int cpu; /* I: the associated cpu */
+ unsigned int flags; /* L: GCWQ_* flags */

int nr_workers; /* L: total number of workers */
int nr_idle; /* L: currently idle ones */
@@ -93,6 +107,10 @@ struct global_cwq {
/* L: hash of busy workers */

struct ida worker_ida; /* L: for worker IDs */
+
+ struct task_struct *trustee; /* L: for gcwq shutdown */
+ unsigned int trustee_state; /* L: trustee state */
+ wait_queue_head_t trustee_wait; /* trustee wait */
} ____cacheline_aligned_in_smp;

/*
@@ -148,6 +166,10 @@ struct workqueue_struct {
#endif
};

+#define for_each_busy_worker(worker, i, pos, gcwq) \
+ for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
+ hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
+
#ifdef CONFIG_DEBUG_OBJECTS_WORK

static struct debug_obj_descr work_debug_descr;
@@ -546,6 +568,9 @@ static void worker_enter_idle(struct worker *worker)

/* idle_list is LIFO */
list_add(&worker->entry, &gcwq->idle_list);
+
+ if (unlikely(worker->flags & WORKER_ROGUE))
+ wake_up_all(&gcwq->trustee_wait);
}

/**
@@ -622,8 +647,15 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
if (IS_ERR(worker->task))
goto fail;

+ /*
+ * A rogue worker will become a regular one if CPU comes
+ * online later on. Make sure every worker has
+ * PF_THREAD_BOUND set.
+ */
if (bind)
kthread_bind(worker->task, gcwq->cpu);
+ else
+ worker->task->flags |= PF_THREAD_BOUND;

return worker;
fail:
@@ -882,10 +914,6 @@ static int worker_thread(void *__worker)
struct cpu_workqueue_struct *cwq = worker->cwq;

woke_up:
- if (unlikely(!cpumask_equal(&worker->task->cpus_allowed,
- get_cpu_mask(gcwq->cpu))))
- set_cpus_allowed_ptr(worker->task, get_cpu_mask(gcwq->cpu));
-
spin_lock_irq(&gcwq->lock);

/* DIE can be set only while we're idle, checking here is enough */
@@ -895,7 +923,7 @@ woke_up:
}

worker_leave_idle(worker);
-
+recheck:
/*
* ->scheduled list can only be filled while a worker is
* preparing to process a work or actually processing it.
@@ -908,6 +936,22 @@ woke_up:
list_first_entry(&cwq->worklist,
struct work_struct, entry);

+ /*
+ * The following is a rather inefficient way to close
+ * race window against cpu hotplug operations. Will
+ * be replaced soon.
+ */
+ if (unlikely(!(worker->flags & WORKER_ROGUE) &&
+ !cpumask_equal(&worker->task->cpus_allowed,
+ get_cpu_mask(gcwq->cpu)))) {
+ spin_unlock_irq(&gcwq->lock);
+ set_cpus_allowed_ptr(worker->task,
+ get_cpu_mask(gcwq->cpu));
+ cpu_relax();
+ spin_lock_irq(&gcwq->lock);
+ goto recheck;
+ }
+
if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
/* optimization path, not strictly necessary */
process_one_work(worker, work);
@@ -1806,29 +1850,237 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
EXPORT_SYMBOL_GPL(destroy_workqueue);

+/*
+ * CPU hotplug.
+ *
+ * CPU hotplug is implemented by allowing cwqs to be detached from
+ * CPU, running with unbound workers and allowing them to be
+ * reattached later if the cpu comes back online. A separate thread
+ * is created to govern cwqs in such state and is called the trustee.
+ *
+ * Trustee states and their descriptions.
+ *
+ * START Command state used on startup. On CPU_DOWN_PREPARE, a
+ * new trustee is started with this state.
+ *
+ * IN_CHARGE Once started, trustee will enter this state after
+ * making all existing workers rogue. DOWN_PREPARE waits
+ * for trustee to enter this state. After reaching
+ * IN_CHARGE, trustee tries to execute the pending
+ * worklist until it's empty and the state is set to
+ * BUTCHER, or the state is set to RELEASE.
+ *
+ * BUTCHER Command state which is set by the cpu callback after
+ * the cpu has went down. Once this state is set trustee
+ * knows that there will be no new works on the worklist
+ * and once the worklist is empty it can proceed to
+ * killing idle workers.
+ *
+ * RELEASE Command state which is set by the cpu callback if the
+ * cpu down has been canceled or it has come online
+ * again. After recognizing this state, trustee stops
+ * trying to drain or butcher and transits to DONE.
+ *
+ * DONE Trustee will enter this state after BUTCHER or RELEASE
+ * is complete.
+ *
+ * trustee CPU draining
+ * took over down complete
+ * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
+ * | | ^
+ * | CPU is back online v return workers |
+ * ----------------> RELEASE --------------
+ */
+
+/**
+ * trustee_wait_event_timeout - timed event wait for trustee
+ * @cond: condition to wait for
+ * @timeout: timeout in jiffies
+ *
+ * wait_event_timeout() for trustee to use. Handles locking and
+ * checks for RELEASE request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. To be used by trustee.
+ *
+ * RETURNS:
+ * Positive indicating left time if @cond is satisfied, 0 if timed
+ * out, -1 if canceled.
+ */
+#define trustee_wait_event_timeout(cond, timeout) ({ \
+ long __ret = (timeout); \
+ while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) && \
+ __ret) { \
+ spin_unlock_irq(&gcwq->lock); \
+ __wait_event_timeout(gcwq->trustee_wait, (cond) || \
+ (gcwq->trustee_state == TRUSTEE_RELEASE), \
+ __ret); \
+ spin_lock_irq(&gcwq->lock); \
+ } \
+ gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret); \
+})
+
+/**
+ * trustee_wait_event - event wait for trustee
+ * @cond: condition to wait for
+ *
+ * wait_event() for trustee to use. Automatically handles locking and
+ * checks for CANCEL request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. To be used by trustee.
+ *
+ * RETURNS:
+ * 0 if @cond is satisfied, -1 if canceled.
+ */
+#define trustee_wait_event(cond) ({ \
+ long __ret1; \
+ __ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
+ __ret1 < 0 ? -1 : 0; \
+})
+
+static int __cpuinit trustee_thread(void *__gcwq)
+{
+ struct global_cwq *gcwq = __gcwq;
+ struct worker *worker;
+ struct hlist_node *pos;
+ int i;
+
+ BUG_ON(gcwq->cpu != smp_processor_id());
+
+ spin_lock_irq(&gcwq->lock);
+ /*
+ * Make all multithread workers rogue. Trustee must be bound
+ * to the target cpu and can't be cancelled.
+ */
+ BUG_ON(gcwq->cpu != smp_processor_id());
+
+ list_for_each_entry(worker, &gcwq->idle_list, entry)
+ if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+ worker->flags |= WORKER_ROGUE;
+
+ for_each_busy_worker(worker, i, pos, gcwq)
+ if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+ worker->flags |= WORKER_ROGUE;
+
+ /*
+ * We're now in charge. Notify and proceed to drain. We need
+ * to keep the gcwq running during the whole CPU down
+ * procedure as other cpu hotunplug callbacks may need to
+ * flush currently running tasks.
+ */
+ gcwq->trustee_state = TRUSTEE_IN_CHARGE;
+ wake_up_all(&gcwq->trustee_wait);
+
+ /*
+ * The original cpu is in the process of dying and may go away
+ * anytime now. When that happens, we and all workers would
+ * be migrated to other cpus. Try draining any left work.
+ * Note that if the gcwq is frozen, there may be frozen works
+ * in freezeable cwqs. Don't declare completion while frozen.
+ */
+ while (gcwq->nr_workers != gcwq->nr_idle ||
+ gcwq->flags & GCWQ_FREEZING ||
+ gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+ /* give a breather */
+ if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
+ break;
+ }
+
+ /* notify completion */
+ gcwq->trustee = NULL;
+ gcwq->trustee_state = TRUSTEE_DONE;
+ wake_up_all(&gcwq->trustee_wait);
+ spin_unlock_irq(&gcwq->lock);
+ return 0;
+}
+
+/**
+ * wait_trustee_state - wait for trustee to enter the specified state
+ * @gcwq: gcwq the trustee of interest belongs to
+ * @state: target state to wait for
+ *
+ * Wait for the trustee to reach @state. DONE is already matched.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. To be used by cpu_callback.
+ */
+static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
+{
+ if (!(gcwq->trustee_state == state ||
+ gcwq->trustee_state == TRUSTEE_DONE)) {
+ spin_unlock_irq(&gcwq->lock);
+ __wait_event(gcwq->trustee_wait,
+ gcwq->trustee_state == state ||
+ gcwq->trustee_state == TRUSTEE_DONE);
+ spin_lock_irq(&gcwq->lock);
+ }
+}
+
static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
unsigned long action,
void *hcpu)
{
unsigned int cpu = (unsigned long)hcpu;
- struct cpu_workqueue_struct *cwq;
- struct workqueue_struct *wq;
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ struct task_struct *new_trustee = NULL;
+ struct worker *worker;
+ struct hlist_node *pos;
+ unsigned long flags;
+ int i;

action &= ~CPU_TASKS_FROZEN;

- list_for_each_entry(wq, &workqueues, list) {
- if (wq->flags & WQ_SINGLE_THREAD)
- continue;
+ switch (action) {
+ case CPU_DOWN_PREPARE:
+ new_trustee = kthread_create(trustee_thread, gcwq,
+ "workqueue_trustee/%d\n", cpu);
+ if (IS_ERR(new_trustee))
+ return notifier_from_errno(PTR_ERR(new_trustee));
+ kthread_bind(new_trustee, cpu);
+ }

- cwq = get_cwq(cpu, wq);
+ /* some are called w/ irq disabled, don't disturb irq status */
+ spin_lock_irqsave(&gcwq->lock, flags);

- switch (action) {
- case CPU_POST_DEAD:
- flush_workqueue(wq);
- break;
+ switch (action) {
+ case CPU_DOWN_PREPARE:
+ /* initialize trustee and tell it to acquire the gcwq */
+ BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
+ gcwq->trustee = new_trustee;
+ gcwq->trustee_state = TRUSTEE_START;
+ wake_up_process(gcwq->trustee);
+ wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+ break;
+
+ case CPU_POST_DEAD:
+ gcwq->trustee_state = TRUSTEE_BUTCHER;
+ break;
+
+ case CPU_DOWN_FAILED:
+ case CPU_ONLINE:
+ if (gcwq->trustee_state != TRUSTEE_DONE) {
+ gcwq->trustee_state = TRUSTEE_RELEASE;
+ wake_up_process(gcwq->trustee);
+ wait_trustee_state(gcwq, TRUSTEE_DONE);
}
+
+ /* clear ROGUE from all multithread workers */
+ list_for_each_entry(worker, &gcwq->idle_list, entry)
+ if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+ worker->flags &= ~WORKER_ROGUE;
+
+ for_each_busy_worker(worker, i, pos, gcwq)
+ if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+ worker->flags &= ~WORKER_ROGUE;
+ break;
}

+ spin_unlock_irqrestore(&gcwq->lock, flags);
+
return notifier_from_errno(0);
}

@@ -1906,6 +2158,9 @@ void freeze_workqueues_begin(void)

spin_lock_irq(&gcwq->lock);

+ BUG_ON(gcwq->flags & GCWQ_FREEZING);
+ gcwq->flags |= GCWQ_FREEZING;
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

@@ -1989,6 +2244,9 @@ void thaw_workqueues(void)

spin_lock_irq(&gcwq->lock);

+ BUG_ON(!(gcwq->flags & GCWQ_FREEZING));
+ gcwq->flags &= ~GCWQ_FREEZING;
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

@@ -2028,7 +2286,7 @@ void __init init_workqueues(void)
__alignof__(unsigned long long));

singlethread_cpu = cpumask_first(cpu_possible_mask);
- hotcpu_notifier(workqueue_cpu_callback, 0);
+ hotcpu_notifier(workqueue_cpu_callback, CPU_PRI_WORKQUEUE);

/* initialize gcwqs */
for_each_possible_cpu(cpu) {
@@ -2042,6 +2300,9 @@ void __init init_workqueues(void)
INIT_HLIST_HEAD(&gcwq->busy_hash[i]);

ida_init(&gcwq->worker_ida);
+
+ gcwq->trustee_state = TRUSTEE_DONE;
+ init_waitqueue_head(&gcwq->trustee_wait);
}

keventd_wq = create_workqueue("events");
--
1.6.4.2

2010-06-14 21:43:54

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/30] workqueue: introduce worker

Separate out worker thread related information to struct worker from
struct cpu_workqueue_struct and implement helper functions to deal
with the new struct worker. The only change which is visible outside
is that now workqueue worker are all named "kworker/CPUID:WORKERID"
where WORKERID is allocated from per-cpu ida.

This is in preparation of concurrency managed workqueue where shared
multiple workers would be available per cpu.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 211 +++++++++++++++++++++++++++++++++++++---------------
1 files changed, 150 insertions(+), 61 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index eeec736..0b0c360 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,6 +33,7 @@
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
+#include <linux/idr.h>

/*
* Structure fields follow one of the following exclusion rules.
@@ -46,6 +47,15 @@
* W: workqueue_lock protected.
*/

+struct cpu_workqueue_struct;
+
+struct worker {
+ struct work_struct *current_work; /* L: work being processed */
+ struct task_struct *task; /* I: worker task */
+ struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
+ int id; /* I: worker id */
+};
+
/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu). The lower WORK_STRUCT_FLAG_BITS of
@@ -58,15 +68,14 @@ struct cpu_workqueue_struct {

struct list_head worklist;
wait_queue_head_t more_work;
- struct work_struct *current_work;
unsigned int cpu;
+ struct worker *worker;

struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
int flush_color; /* L: flushing color */
int nr_in_flight[WORK_NR_COLORS];
/* L: nr of in_flight works */
- struct task_struct *thread;
};

/*
@@ -214,6 +223,9 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
/* Serializes the accesses to the list of workqueues. */
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
+static DEFINE_PER_CPU(struct ida, worker_ida);
+
+static int worker_thread(void *__worker);

static int singlethread_cpu __read_mostly;

@@ -428,6 +440,105 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+static struct worker *alloc_worker(void)
+{
+ struct worker *worker;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+ return worker;
+}
+
+/**
+ * create_worker - create a new workqueue worker
+ * @cwq: cwq the new worker will belong to
+ * @bind: whether to set affinity to @cpu or not
+ *
+ * Create a new worker which is bound to @cwq. The returned worker
+ * can be started by calling start_worker() or destroyed using
+ * destroy_worker().
+ *
+ * CONTEXT:
+ * Might sleep. Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * Pointer to the newly created worker.
+ */
+static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+{
+ int id = -1;
+ struct worker *worker = NULL;
+
+ spin_lock(&workqueue_lock);
+ while (ida_get_new(&per_cpu(worker_ida, cwq->cpu), &id)) {
+ spin_unlock(&workqueue_lock);
+ if (!ida_pre_get(&per_cpu(worker_ida, cwq->cpu), GFP_KERNEL))
+ goto fail;
+ spin_lock(&workqueue_lock);
+ }
+ spin_unlock(&workqueue_lock);
+
+ worker = alloc_worker();
+ if (!worker)
+ goto fail;
+
+ worker->cwq = cwq;
+ worker->id = id;
+
+ worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
+ cwq->cpu, id);
+ if (IS_ERR(worker->task))
+ goto fail;
+
+ if (bind)
+ kthread_bind(worker->task, cwq->cpu);
+
+ return worker;
+fail:
+ if (id >= 0) {
+ spin_lock(&workqueue_lock);
+ ida_remove(&per_cpu(worker_ida, cwq->cpu), id);
+ spin_unlock(&workqueue_lock);
+ }
+ kfree(worker);
+ return NULL;
+}
+
+/**
+ * start_worker - start a newly created worker
+ * @worker: worker to start
+ *
+ * Start @worker.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void start_worker(struct worker *worker)
+{
+ wake_up_process(worker->task);
+}
+
+/**
+ * destroy_worker - destroy a workqueue worker
+ * @worker: worker to be destroyed
+ *
+ * Destroy @worker.
+ */
+static void destroy_worker(struct worker *worker)
+{
+ int cpu = worker->cwq->cpu;
+ int id = worker->id;
+
+ /* sanity check frenzy */
+ BUG_ON(worker->current_work);
+
+ kthread_stop(worker->task);
+ kfree(worker);
+
+ spin_lock(&workqueue_lock);
+ ida_remove(&per_cpu(worker_ida, cpu), id);
+ spin_unlock(&workqueue_lock);
+}
+
/**
* cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
* @cwq: cwq of interest
@@ -468,7 +579,7 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)

/**
* process_one_work - process single work
- * @cwq: cwq to process work for
+ * @worker: self
* @work: work to process
*
* Process @work. This function contains all the logics necessary to
@@ -480,9 +591,9 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
* CONTEXT:
* spin_lock_irq(cwq->lock) which is released and regrabbed.
*/
-static void process_one_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work)
+static void process_one_work(struct worker *worker, struct work_struct *work)
{
+ struct cpu_workqueue_struct *cwq = worker->cwq;
work_func_t f = work->func;
int work_color;
#ifdef CONFIG_LOCKDEP
@@ -497,7 +608,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
#endif
/* claim and process */
debug_work_deactivate(work);
- cwq->current_work = work;
+ worker->current_work = work;
work_color = get_work_color(work);
list_del_init(&work->entry);

@@ -524,30 +635,33 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
spin_lock_irq(&cwq->lock);

/* we're done with it, release */
- cwq->current_work = NULL;
+ worker->current_work = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}

-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static void run_workqueue(struct worker *worker)
{
+ struct cpu_workqueue_struct *cwq = worker->cwq;
+
spin_lock_irq(&cwq->lock);
while (!list_empty(&cwq->worklist)) {
struct work_struct *work = list_entry(cwq->worklist.next,
struct work_struct, entry);
- process_one_work(cwq, work);
+ process_one_work(worker, work);
}
spin_unlock_irq(&cwq->lock);
}

/**
* worker_thread - the worker thread function
- * @__cwq: cwq to serve
+ * @__worker: self
*
* The cwq worker thread function.
*/
-static int worker_thread(void *__cwq)
+static int worker_thread(void *__worker)
{
- struct cpu_workqueue_struct *cwq = __cwq;
+ struct worker *worker = __worker;
+ struct cpu_workqueue_struct *cwq = worker->cwq;
DEFINE_WAIT(wait);

if (cwq->wq->flags & WQ_FREEZEABLE)
@@ -566,11 +680,11 @@ static int worker_thread(void *__cwq)
if (kthread_should_stop())
break;

- if (unlikely(!cpumask_equal(&cwq->thread->cpus_allowed,
+ if (unlikely(!cpumask_equal(&worker->task->cpus_allowed,
get_cpu_mask(cwq->cpu))))
- set_cpus_allowed_ptr(cwq->thread,
+ set_cpus_allowed_ptr(worker->task,
get_cpu_mask(cwq->cpu));
- run_workqueue(cwq);
+ run_workqueue(worker);
}

return 0;
@@ -873,7 +987,7 @@ int flush_work(struct work_struct *work)
goto already_gone;
prev = &work->entry;
} else {
- if (cwq->current_work != work)
+ if (!cwq->worker || cwq->worker->current_work != work)
goto already_gone;
prev = &cwq->worklist;
}
@@ -937,7 +1051,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
int running = 0;

spin_lock_irq(&cwq->lock);
- if (unlikely(cwq->current_work == work)) {
+ if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
insert_wq_barrier(cwq, &barr, cwq->worklist.next);
running = 1;
}
@@ -1225,7 +1339,7 @@ int current_is_keventd(void)
BUG_ON(!keventd_wq);

cwq = get_cwq(cpu, keventd_wq);
- if (current == cwq->thread)
+ if (current == cwq->worker->task)
ret = 1;

return ret;
@@ -1273,38 +1387,6 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)
#endif
}

-static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
- struct workqueue_struct *wq = cwq->wq;
- struct task_struct *p;
-
- p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
- /*
- * Nobody can add the work_struct to this cwq,
- * if (caller is __create_workqueue)
- * nobody should see this wq
- * else // caller is CPU_UP_PREPARE
- * cpu is not on cpu_online_map
- * so we can abort safely.
- */
- if (IS_ERR(p))
- return PTR_ERR(p);
- cwq->thread = p;
-
- return 0;
-}
-
-static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
- struct task_struct *p = cwq->thread;
-
- if (p != NULL) {
- if (cpu >= 0)
- kthread_bind(p, cpu);
- wake_up_process(p);
- }
-}
-
struct workqueue_struct *__create_workqueue_key(const char *name,
unsigned int flags,
struct lock_class_key *key,
@@ -1312,7 +1394,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
{
bool singlethread = flags & WQ_SINGLE_THREAD;
struct workqueue_struct *wq;
- int err = 0, cpu;
+ bool failed = false;
+ unsigned int cpu;

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
@@ -1342,20 +1425,21 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
- cwq->wq = wq;
cwq->cpu = cpu;
+ cwq->wq = wq;
cwq->flush_color = -1;
spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
init_waitqueue_head(&cwq->more_work);

- if (err)
+ if (failed)
continue;
- err = create_workqueue_thread(cwq, cpu);
- if (cpu_online(cpu) && !singlethread)
- start_workqueue_thread(cwq, cpu);
+ cwq->worker = create_worker(cwq,
+ cpu_online(cpu) && !singlethread);
+ if (cwq->worker)
+ start_worker(cwq->worker);
else
- start_workqueue_thread(cwq, -1);
+ failed = true;
}

spin_lock(&workqueue_lock);
@@ -1364,7 +1448,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,

cpu_maps_update_done();

- if (err) {
+ if (failed) {
destroy_workqueue(wq);
wq = NULL;
}
@@ -1400,9 +1484,9 @@ void destroy_workqueue(struct workqueue_struct *wq)
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
int i;

- if (cwq->thread) {
- kthread_stop(cwq->thread);
- cwq->thread = NULL;
+ if (cwq->worker) {
+ destroy_worker(cwq->worker);
+ cwq->worker = NULL;
}

for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1489,6 +1573,8 @@ EXPORT_SYMBOL_GPL(work_on_cpu);

void __init init_workqueues(void)
{
+ unsigned int cpu;
+
/*
* cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
* Make sure that the alignment isn't lower than that of
@@ -1497,6 +1583,9 @@ void __init init_workqueues(void)
BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
__alignof__(unsigned long long));

+ for_each_possible_cpu(cpu)
+ ida_init(&per_cpu(worker_ida, cpu));
+
singlethread_cpu = cpumask_first(cpu_possible_mask);
hotcpu_notifier(workqueue_cpu_callback, 0);
keventd_wq = create_workqueue("events");
--
1.6.4.2

2010-06-14 21:39:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/30] workqueue: implement per-cwq active work limit

Add cwq->nr_active, cwq->max_active and cwq->delayed_work. nr_active
counts the number of active works per cwq. A work is active if it's
flushable (colored) and is on cwq's worklist. If nr_active reaches
max_active, new works are queued on cwq->delayed_work and activated
later as works on the cwq complete and decrement nr_active.

cwq->max_active can be specified via the new @max_active parameter to
__create_workqueue() and is set to 1 for all workqueues for now. As
each cwq has only single worker now, this double queueing doesn't
cause any behavior difference visible to its users.

This will be used to reimplement freeze/thaw and implement shared
worker pool.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 18 +++++++++---------
kernel/workqueue.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4f4fdba..eb753b7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -225,11 +225,11 @@ enum {
};

extern struct workqueue_struct *
-__create_workqueue_key(const char *name, unsigned int flags,
+__create_workqueue_key(const char *name, unsigned int flags, int max_active,
struct lock_class_key *key, const char *lock_name);

#ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, flags) \
+#define __create_workqueue(name, flags, max_active) \
({ \
static struct lock_class_key __key; \
const char *__lock_name; \
@@ -239,20 +239,20 @@ __create_workqueue_key(const char *name, unsigned int flags,
else \
__lock_name = #name; \
\
- __create_workqueue_key((name), (flags), &__key, \
- __lock_name); \
+ __create_workqueue_key((name), (flags), (max_active), \
+ &__key, __lock_name); \
})
#else
-#define __create_workqueue(name, flags) \
- __create_workqueue_key((name), (flags), NULL, NULL)
+#define __create_workqueue(name, flags, max_active) \
+ __create_workqueue_key((name), (flags), (max_active), NULL, NULL)
#endif

#define create_workqueue(name) \
- __create_workqueue((name), 0)
+ __create_workqueue((name), 0, 1)
#define create_freezeable_workqueue(name) \
- __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+ __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD, 1)
#define create_singlethread_workqueue(name) \
- __create_workqueue((name), WQ_SINGLE_THREAD)
+ __create_workqueue((name), WQ_SINGLE_THREAD, 1)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 74b399b..101b92e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -77,6 +77,9 @@ struct cpu_workqueue_struct {
int flush_color; /* L: flushing color */
int nr_in_flight[WORK_NR_COLORS];
/* L: nr of in_flight works */
+ int nr_active; /* L: nr of active works */
+ int max_active; /* I: max active works */
+ struct list_head delayed_works; /* L: delayed works */
};

/*
@@ -321,14 +324,24 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+ struct list_head *worklist;
unsigned long flags;

debug_work_activate(work);
+
spin_lock_irqsave(&cwq->lock, flags);
BUG_ON(!list_empty(&work->entry));
+
cwq->nr_in_flight[cwq->work_color]++;
- insert_work(cwq, work, &cwq->worklist,
- work_color_to_flags(cwq->work_color));
+
+ if (likely(cwq->nr_active < cwq->max_active)) {
+ cwq->nr_active++;
+ worklist = &cwq->worklist;
+ } else
+ worklist = &cwq->delayed_works;
+
+ insert_work(cwq, work, worklist, work_color_to_flags(cwq->work_color));
+
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -584,6 +597,15 @@ static void move_linked_works(struct work_struct *work, struct list_head *head,
*nextp = n;
}

+static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
+{
+ struct work_struct *work = list_first_entry(&cwq->delayed_works,
+ struct work_struct, entry);
+
+ move_linked_works(work, &cwq->worklist, NULL);
+ cwq->nr_active++;
+}
+
/**
* cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
* @cwq: cwq of interest
@@ -602,6 +624,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
return;

cwq->nr_in_flight[color]--;
+ cwq->nr_active--;
+
+ /* one down, submit a delayed one */
+ if (!list_empty(&cwq->delayed_works) &&
+ cwq->nr_active < cwq->max_active)
+ cwq_activate_first_delayed(cwq);

/* is flush in progress and are we at the flushing tip? */
if (likely(cwq->flush_color != color))
@@ -1499,6 +1527,7 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)

struct workqueue_struct *__create_workqueue_key(const char *name,
unsigned int flags,
+ int max_active,
struct lock_class_key *key,
const char *lock_name)
{
@@ -1507,6 +1536,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
bool failed = false;
unsigned int cpu;

+ max_active = clamp_val(max_active, 1, INT_MAX);
+
wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
goto err;
@@ -1538,8 +1569,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
cwq->cpu = cpu;
cwq->wq = wq;
cwq->flush_color = -1;
+ cwq->max_active = max_active;
spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
+ INIT_LIST_HEAD(&cwq->delayed_works);
init_waitqueue_head(&cwq->more_work);

if (failed)
@@ -1601,6 +1634,8 @@ void destroy_workqueue(struct workqueue_struct *wq)

for (i = 0; i < WORK_NR_COLORS; i++)
BUG_ON(cwq->nr_in_flight[i]);
+ BUG_ON(cwq->nr_active);
+ BUG_ON(!list_empty(&cwq->delayed_works));
}

free_cwqs(wq->cpu_wq);
--
1.6.4.2

2010-06-14 21:43:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24/30] workqueue: implement concurrency managed dynamic worker pool

Instead of creating a worker for each cwq and putting it into the
shared pool, manage per-cpu workers dynamically.

Works aren't supposed to be cpu cycle hogs and maintaining just enough
concurrency to prevent work processing from stalling due to lack of
processing context is optimal. gcwq keeps the number of concurrent
active workers to minimum but no less. As long as there's one or more
running workers on the cpu, no new worker is scheduled so that works
can be processed in batch as much as possible but when the last
running worker blocks, gcwq immediately schedules new worker so that
the cpu doesn't sit idle while there are works to be processed.

gcwq always keeps at least single idle worker around. When a new
worker is necessary and the worker is the last idle one, the worker
assumes the role of "manager" and manages the worker pool -
ie. creates another worker. Forward-progress is guaranteed by having
dedicated rescue workers for workqueues which may be necessary while
creating a new worker. When the manager is having problem creating a
new worker, mayday timer activates and rescue workers are summoned to
the cpu and execute works which might be necessary to create new
workers.

Trustee is expanded to serve the role of manager while a CPU is being
taken down and stays down. As no new works are supposed to be queued
on a dead cpu, it just needs to drain all the existing ones. Trustee
continues to try to create new workers and summon rescuers as long as
there are pending works. If the CPU is brought back up while the
trustee is still trying to drain the gcwq from the previous offlining,
the trustee will kill all idles ones and tell workers which are still
busy to rebind to the cpu, and pass control over to gcwq which assumes
the manager role as necessary.

Concurrency managed worker pool reduces the number of workers
drastically. Only workers which are necessary to keep the processing
going are created and kept. Also, it reduces cache footprint by
avoiding unnecessarily switching contexts between different workers.

Please note that this patch does not increase max_active of any
workqueue. All workqueues can still only process one work per cpu.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 8 +-
kernel/workqueue.c | 908 ++++++++++++++++++++++++++++++++++++++++-----
kernel/workqueue_sched.h | 13 +-
3 files changed, 816 insertions(+), 113 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 07cf5e5..b8f4ec4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -226,6 +226,7 @@ enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
WQ_SINGLE_CPU = 1 << 1, /* only single cpu at a time */
WQ_NON_REENTRANT = 1 << 2, /* guarantee non-reentrance */
+ WQ_RESCUER = 1 << 3, /* has an rescue worker */
};

extern struct workqueue_struct *
@@ -252,11 +253,12 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
#endif

#define create_workqueue(name) \
- __create_workqueue((name), 0, 1)
+ __create_workqueue((name), WQ_RESCUER, 1)
#define create_freezeable_workqueue(name) \
- __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_CPU, 1)
+ __create_workqueue((name), \
+ WQ_FREEZEABLE | WQ_SINGLE_CPU | WQ_RESCUER, 1)
#define create_singlethread_workqueue(name) \
- __create_workqueue((name), WQ_SINGLE_CPU, 1)
+ __create_workqueue((name), WQ_SINGLE_CPU | WQ_RESCUER, 1)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e0a7609..dd8c38b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -34,17 +34,25 @@
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
#include <linux/idr.h>
-#include <linux/delay.h>
+
+#include "workqueue_sched.h"

enum {
/* global_cwq flags */
+ GCWQ_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
+ GCWQ_MANAGING_WORKERS = 1 << 1, /* managing workers */
+ GCWQ_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
GCWQ_FREEZING = 1 << 3, /* freeze in progress */

/* worker flags */
WORKER_STARTED = 1 << 0, /* started */
WORKER_DIE = 1 << 1, /* die die die */
WORKER_IDLE = 1 << 2, /* is idle */
+ WORKER_PREP = 1 << 3, /* preparing to run works */
WORKER_ROGUE = 1 << 4, /* not bound to any cpu */
+ WORKER_REBIND = 1 << 5, /* mom is home, come back */
+
+ WORKER_IGN_RUNNING = WORKER_PREP | WORKER_ROGUE | WORKER_REBIND,

/* gcwq->trustee_state */
TRUSTEE_START = 0, /* start */
@@ -57,7 +65,19 @@ enum {
BUSY_WORKER_HASH_SIZE = 1 << BUSY_WORKER_HASH_ORDER,
BUSY_WORKER_HASH_MASK = BUSY_WORKER_HASH_SIZE - 1,

+ MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
+ IDLE_WORKER_TIMEOUT = 300 * HZ, /* keep idle ones for 5 mins */
+
+ MAYDAY_INITIAL_TIMEOUT = HZ / 100, /* call for help after 10ms */
+ MAYDAY_INTERVAL = HZ / 10, /* and then every 100ms */
+ CREATE_COOLDOWN = HZ, /* time to breath after fail */
TRUSTEE_COOLDOWN = HZ / 10, /* for trustee draining */
+
+ /*
+ * Rescue workers are used only on emergencies and shared by
+ * all cpus. Give -20.
+ */
+ RESCUER_NICE_LEVEL = -20,
};

/*
@@ -65,8 +85,16 @@ enum {
*
* I: Set during initialization and read-only afterwards.
*
+ * P: Preemption protected. Disabling preemption is enough and should
+ * only be modified and accessed from the local cpu.
+ *
* L: gcwq->lock protected. Access with gcwq->lock held.
*
+ * X: During normal operation, modification requires gcwq->lock and
+ * should be done only from local cpu. Either disabling preemption
+ * on local cpu or grabbing gcwq->lock is enough for read access.
+ * While trustee is in charge, it's identical to L.
+ *
* F: wq->flush_mutex protected.
*
* W: workqueue_lock protected.
@@ -74,6 +102,10 @@ enum {

struct global_cwq;

+/*
+ * The poor guys doing the actual heavy lifting. All on-duty workers
+ * are either serving the manager role, on idle list or on busy hash.
+ */
struct worker {
/* on idle list while idle, on busy hash table while busy */
union {
@@ -86,12 +118,17 @@ struct worker {
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */
- unsigned int flags; /* L: flags */
+ /* 64 bytes boundary on 64bit, 32 on 32bit */
+ unsigned long last_active; /* L: last active timestamp */
+ unsigned int flags; /* ?: flags */
int id; /* I: worker id */
+ struct work_struct rebind_work; /* L: rebind worker to cpu */
};

/*
- * Global per-cpu workqueue.
+ * Global per-cpu workqueue. There's one and only one for each cpu
+ * and all works are queued and processed here regardless of their
+ * target workqueues.
*/
struct global_cwq {
spinlock_t lock; /* the gcwq lock */
@@ -103,15 +140,19 @@ struct global_cwq {
int nr_idle; /* L: currently idle ones */

/* workers are chained either in the idle_list or busy_hash */
- struct list_head idle_list; /* L: list of idle workers */
+ struct list_head idle_list; /* ?: list of idle workers */
struct hlist_head busy_hash[BUSY_WORKER_HASH_SIZE];
/* L: hash of busy workers */

+ struct timer_list idle_timer; /* L: worker idle timeout */
+ struct timer_list mayday_timer; /* L: SOS timer for dworkers */
+
struct ida worker_ida; /* L: for worker IDs */

struct task_struct *trustee; /* L: for gcwq shutdown */
unsigned int trustee_state; /* L: trustee state */
wait_queue_head_t trustee_wait; /* trustee wait */
+ struct worker *first_idle; /* L: first idle worker */
} ____cacheline_aligned_in_smp;

/*
@@ -121,7 +162,6 @@ struct global_cwq {
*/
struct cpu_workqueue_struct {
struct global_cwq *gcwq; /* I: the associated gcwq */
- struct worker *worker;
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
int flush_color; /* L: flushing color */
@@ -160,6 +200,9 @@ struct workqueue_struct {

unsigned long single_cpu; /* cpu for single cpu wq */

+ cpumask_var_t mayday_mask; /* cpus requesting rescue */
+ struct worker *rescuer; /* I: rescue worker */
+
int saved_max_active; /* I: saved cwq max_active */
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
@@ -286,7 +329,13 @@ static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
static bool workqueue_freezing; /* W: have wqs started freezing? */

+/*
+ * The almighty global cpu workqueues. nr_running is the only field
+ * which is expected to be used frequently by other cpus via
+ * try_to_wake_up(). Put it in a separate cacheline.
+ */
static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);

static int worker_thread(void *__worker);

@@ -295,6 +344,11 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
return &per_cpu(global_cwq, cpu);
}

+static atomic_t *get_gcwq_nr_running(unsigned int cpu)
+{
+ return &per_cpu(gcwq_nr_running, cpu);
+}
+
static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
struct workqueue_struct *wq)
{
@@ -385,6 +439,63 @@ static struct global_cwq *get_work_gcwq(struct work_struct *work)
return get_gcwq(cpu);
}

+/*
+ * Policy functions. These define the policies on how the global
+ * worker pool is managed. Unless noted otherwise, these functions
+ * assume that they're being called with gcwq->lock held.
+ */
+
+/*
+ * Need to wake up a worker? Called from anything but currently
+ * running workers.
+ */
+static bool need_more_worker(struct global_cwq *gcwq)
+{
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ return !list_empty(&gcwq->worklist) && !atomic_read(nr_running);
+}
+
+/* Can I start working? Called from busy but !running workers. */
+static bool may_start_working(struct global_cwq *gcwq)
+{
+ return gcwq->nr_idle;
+}
+
+/* Do I need to keep working? Called from currently running workers. */
+static bool keep_working(struct global_cwq *gcwq)
+{
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ return !list_empty(&gcwq->worklist) && atomic_read(nr_running) <= 1;
+}
+
+/* Do we need a new worker? Called from manager. */
+static bool need_to_create_worker(struct global_cwq *gcwq)
+{
+ return need_more_worker(gcwq) && !may_start_working(gcwq);
+}
+
+/* Do I need to be the manager? */
+static bool need_to_manage_workers(struct global_cwq *gcwq)
+{
+ return need_to_create_worker(gcwq) || gcwq->flags & GCWQ_MANAGE_WORKERS;
+}
+
+/* Do we have too many workers and should some go away? */
+static bool too_many_workers(struct global_cwq *gcwq)
+{
+ bool managing = gcwq->flags & GCWQ_MANAGING_WORKERS;
+ int nr_idle = gcwq->nr_idle + managing; /* manager is considered idle */
+ int nr_busy = gcwq->nr_workers - nr_idle;
+
+ return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
+}
+
+/*
+ * Wake up functions.
+ */
+
/* Return the first worker. Safe with preemption disabled */
static struct worker *first_worker(struct global_cwq *gcwq)
{
@@ -412,6 +523,63 @@ static void wake_up_worker(struct global_cwq *gcwq)
}

/**
+ * wq_worker_waking_up - a worker is waking up
+ * @task: task waking up
+ * @cpu: CPU @task is waking up to
+ *
+ * This function is called during try_to_wake_up() when a worker is
+ * being awoken.
+ *
+ * CONTEXT:
+ * spin_lock_irq(rq->lock)
+ */
+void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
+{
+ struct worker *worker = kthread_data(task);
+
+ if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+ atomic_inc(get_gcwq_nr_running(cpu));
+}
+
+/**
+ * wq_worker_sleeping - a worker is going to sleep
+ * @task: task going to sleep
+ * @cpu: CPU in question, must be the current CPU number
+ *
+ * This function is called during schedule() when a busy worker is
+ * going to sleep. Worker on the same cpu can be woken up by
+ * returning pointer to its task.
+ *
+ * CONTEXT:
+ * spin_lock_irq(rq->lock)
+ *
+ * RETURNS:
+ * Worker task on @cpu to wake up, %NULL if none.
+ */
+struct task_struct *wq_worker_sleeping(struct task_struct *task,
+ unsigned int cpu)
+{
+ struct worker *worker = kthread_data(task), *to_wakeup = NULL;
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ atomic_t *nr_running = get_gcwq_nr_running(cpu);
+
+ if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+ return NULL;
+
+ /* this can only happen on the local cpu */
+ BUG_ON(cpu != raw_smp_processor_id());
+
+ /*
+ * The counterpart of the following dec_and_test, implied mb,
+ * worklist not empty test sequence is in insert_work().
+ * Please read comment there.
+ */
+ if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist))
+ to_wakeup = first_worker(gcwq);
+ return to_wakeup ? to_wakeup->task : NULL;
+}
+
+/**
* busy_worker_head - return the busy hash head for a work
* @gcwq: gcwq of interest
* @work: work to be hashed
@@ -508,6 +676,8 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head,
unsigned int extra_flags)
{
+ struct global_cwq *gcwq = cwq->gcwq;
+
/* we own @work, set data and link */
set_work_cwq(work, cwq, extra_flags);

@@ -518,7 +688,16 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
smp_wmb();

list_add_tail(&work->entry, head);
- wake_up_worker(cwq->gcwq);
+
+ /*
+ * Ensure either worker_sched_deactivated() sees the above
+ * list_add_tail() or we see zero nr_running to avoid workers
+ * lying around lazily while there are works to be processed.
+ */
+ smp_mb();
+
+ if (!atomic_read(get_gcwq_nr_running(gcwq->cpu)))
+ wake_up_worker(gcwq);
}

/**
@@ -778,11 +957,16 @@ static void worker_enter_idle(struct worker *worker)

worker->flags |= WORKER_IDLE;
gcwq->nr_idle++;
+ worker->last_active = jiffies;

/* idle_list is LIFO */
list_add(&worker->entry, &gcwq->idle_list);

- if (unlikely(worker->flags & WORKER_ROGUE))
+ if (likely(!(worker->flags & WORKER_ROGUE))) {
+ if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
+ mod_timer(&gcwq->idle_timer,
+ jiffies + IDLE_WORKER_TIMEOUT);
+ } else
wake_up_all(&gcwq->trustee_wait);
}

@@ -805,6 +989,89 @@ static void worker_leave_idle(struct worker *worker)
list_del_init(&worker->entry);
}

+/**
+ * worker_maybe_bind_and_lock - bind worker to its cpu if possible and lock gcwq
+ * @worker: self
+ *
+ * Works which are scheduled while the cpu is online must at least be
+ * scheduled to a worker which is bound to the cpu so that if they are
+ * flushed from cpu callbacks while cpu is going down, they are
+ * guaranteed to execute on the cpu.
+ *
+ * This function is to be used by rogue workers and rescuers to bind
+ * themselves to the target cpu and may race with cpu going down or
+ * coming online. kthread_bind() can't be used because it may put the
+ * worker to already dead cpu and set_cpus_allowed_ptr() can't be used
+ * verbatim as it's best effort and blocking and gcwq may be
+ * [dis]associated in the meantime.
+ *
+ * This function tries set_cpus_allowed() and locks gcwq and verifies
+ * the binding against GCWQ_DISASSOCIATED which is set during
+ * CPU_DYING and cleared during CPU_ONLINE, so if the worker enters
+ * idle state or fetches works without dropping lock, it can guarantee
+ * the scheduling requirement described in the first paragraph.
+ *
+ * CONTEXT:
+ * Might sleep. Called without any lock but returns with gcwq->lock
+ * held.
+ *
+ * RETURNS:
+ * %true if the associated gcwq is online (@worker is successfully
+ * bound), %false if offline.
+ */
+static bool worker_maybe_bind_and_lock(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+ struct task_struct *task = worker->task;
+
+ while (true) {
+ /*
+ * The following call may fail, succeed or succeed
+ * without actually migrating the task to the cpu if
+ * it races with cpu hotunplug operation. Verify
+ * against GCWQ_DISASSOCIATED.
+ */
+ set_cpus_allowed_ptr(task, get_cpu_mask(gcwq->cpu));
+
+ spin_lock_irq(&gcwq->lock);
+ if (gcwq->flags & GCWQ_DISASSOCIATED)
+ return false;
+ if (task_cpu(task) == gcwq->cpu &&
+ cpumask_equal(&current->cpus_allowed,
+ get_cpu_mask(gcwq->cpu)))
+ return true;
+ spin_unlock_irq(&gcwq->lock);
+
+ /* CPU has come up inbetween, retry migration */
+ cpu_relax();
+ }
+}
+
+/*
+ * Function for worker->rebind_work used to rebind rogue busy workers
+ * to the associated cpu which is coming back online.
+ */
+static void worker_rebind_fn(struct work_struct *work)
+{
+ struct worker *worker = container_of(work, struct worker, rebind_work);
+ struct global_cwq *gcwq = worker->gcwq;
+
+ /*
+ * This is scheduled by cpu up but can race with other cpu
+ * hotplug operations and may be executed twice without
+ * intervening cpu down. Bump nr_running only if we clear
+ * REBIND and nothing else tells us to skip running state
+ * tracking.
+ */
+ if (worker_maybe_bind_and_lock(worker) &&
+ (worker->flags & WORKER_REBIND)) {
+ worker->flags &= ~WORKER_REBIND;
+ if (!(worker->flags & WORKER_IGN_RUNNING))
+ atomic_inc(get_gcwq_nr_running(gcwq->cpu));
+ }
+ spin_unlock_irq(&gcwq->lock);
+}
+
static struct worker *alloc_worker(void)
{
struct worker *worker;
@@ -813,6 +1080,8 @@ static struct worker *alloc_worker(void)
if (worker) {
INIT_LIST_HEAD(&worker->entry);
INIT_LIST_HEAD(&worker->scheduled);
+ INIT_WORK(&worker->rebind_work, worker_rebind_fn);
+ /* on creation a worker is not idle */
}
return worker;
}
@@ -890,7 +1159,7 @@ fail:
*/
static void start_worker(struct worker *worker)
{
- worker->flags |= WORKER_STARTED;
+ worker->flags |= WORKER_STARTED | WORKER_PREP;
worker->gcwq->nr_workers++;
worker_enter_idle(worker);
wake_up_process(worker->task);
@@ -931,6 +1200,220 @@ static void destroy_worker(struct worker *worker)
ida_remove(&gcwq->worker_ida, id);
}

+static void idle_worker_timeout(unsigned long __gcwq)
+{
+ struct global_cwq *gcwq = (void *)__gcwq;
+
+ spin_lock_irq(&gcwq->lock);
+
+ if (too_many_workers(gcwq)) {
+ struct worker *worker;
+ unsigned long expires;
+
+ /* idle_list is kept in LIFO order, check the last one */
+ worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+ expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+ if (time_before(jiffies, expires))
+ mod_timer(&gcwq->idle_timer, expires);
+ else {
+ /* it's been idle for too long, wake up manager */
+ gcwq->flags |= GCWQ_MANAGE_WORKERS;
+ wake_up_worker(gcwq);
+ }
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+}
+
+static bool send_mayday(struct work_struct *work)
+{
+ struct cpu_workqueue_struct *cwq = get_work_cwq(work);
+ struct workqueue_struct *wq = cwq->wq;
+
+ if (!(wq->flags & WQ_RESCUER))
+ return false;
+
+ /* mayday mayday mayday */
+ if (!cpumask_test_and_set_cpu(cwq->gcwq->cpu, wq->mayday_mask))
+ wake_up_process(wq->rescuer->task);
+ return true;
+}
+
+static void gcwq_mayday_timeout(unsigned long __gcwq)
+{
+ struct global_cwq *gcwq = (void *)__gcwq;
+ struct work_struct *work;
+
+ spin_lock_irq(&gcwq->lock);
+
+ if (need_to_create_worker(gcwq)) {
+ /*
+ * We've been trying to create a new worker but
+ * haven't been successful. We might be hitting an
+ * allocation deadlock. Send distress signals to
+ * rescuers.
+ */
+ list_for_each_entry(work, &gcwq->worklist, entry)
+ send_mayday(work);
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+
+ mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INTERVAL);
+}
+
+/**
+ * maybe_create_worker - create a new worker if necessary
+ * @gcwq: gcwq to create a new worker for
+ *
+ * Create a new worker for @gcwq if necessary. @gcwq is guaranteed to
+ * have at least one idle worker on return from this function. If
+ * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
+ * sent to all rescuers with works scheduled on @gcwq to resolve
+ * possible allocation deadlock.
+ *
+ * On return, need_to_create_worker() is guaranteed to be false and
+ * may_start_working() true.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. Does GFP_KERNEL allocations. Called only from
+ * manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_create_worker(struct global_cwq *gcwq)
+{
+ if (!need_to_create_worker(gcwq))
+ return false;
+restart:
+ /* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
+ mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
+
+ while (true) {
+ struct worker *worker;
+
+ spin_unlock_irq(&gcwq->lock);
+
+ worker = create_worker(gcwq, true);
+ if (worker) {
+ del_timer_sync(&gcwq->mayday_timer);
+ spin_lock_irq(&gcwq->lock);
+ start_worker(worker);
+ BUG_ON(need_to_create_worker(gcwq));
+ return true;
+ }
+
+ if (!need_to_create_worker(gcwq))
+ break;
+
+ spin_unlock_irq(&gcwq->lock);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(CREATE_COOLDOWN);
+ spin_lock_irq(&gcwq->lock);
+ if (!need_to_create_worker(gcwq))
+ break;
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+ del_timer_sync(&gcwq->mayday_timer);
+ spin_lock_irq(&gcwq->lock);
+ if (need_to_create_worker(gcwq))
+ goto restart;
+ return true;
+}
+
+/**
+ * maybe_destroy_worker - destroy workers which have been idle for a while
+ * @gcwq: gcwq to destroy workers for
+ *
+ * Destroy @gcwq workers which have been idle for longer than
+ * IDLE_WORKER_TIMEOUT.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. Called only from manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_destroy_workers(struct global_cwq *gcwq)
+{
+ bool ret = false;
+
+ while (too_many_workers(gcwq)) {
+ struct worker *worker;
+ unsigned long expires;
+
+ worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+ expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+ if (time_before(jiffies, expires)) {
+ mod_timer(&gcwq->idle_timer, expires);
+ break;
+ }
+
+ destroy_worker(worker);
+ ret = true;
+ }
+
+ return ret;
+}
+
+/**
+ * manage_workers - manage worker pool
+ * @worker: self
+ *
+ * Assume the manager role and manage gcwq worker pool @worker belongs
+ * to. At any given time, there can be only zero or one manager per
+ * gcwq. The exclusion is handled automatically by this function.
+ *
+ * The caller can safely start processing works on false return. On
+ * true return, it's guaranteed that need_to_create_worker() is false
+ * and may_start_working() is true.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true if
+ * some action was taken.
+ */
+static bool manage_workers(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+ bool ret = false;
+
+ if (gcwq->flags & GCWQ_MANAGING_WORKERS)
+ return ret;
+
+ gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
+ gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+ /*
+ * Destroy and then create so that may_start_working() is true
+ * on return.
+ */
+ ret |= maybe_destroy_workers(gcwq);
+ ret |= maybe_create_worker(gcwq);
+
+ gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
+ /*
+ * The trustee might be waiting to take over the manager
+ * position, tell it we're done.
+ */
+ if (unlikely(gcwq->trustee))
+ wake_up_all(&gcwq->trustee_wait);
+
+ return ret;
+}
+
/**
* move_linked_works - move linked works to a list
* @work: start of series of works to be scheduled
@@ -1137,24 +1620,40 @@ static void process_scheduled_works(struct worker *worker)
* worker_thread - the worker thread function
* @__worker: self
*
- * The cwq worker thread function.
+ * The gcwq worker thread function. There's a single dynamic pool of
+ * these per each cpu. These workers process all works regardless of
+ * their specific target workqueue. The only exception is works which
+ * belong to workqueues with a rescuer which will be explained in
+ * rescuer_thread().
*/
static int worker_thread(void *__worker)
{
struct worker *worker = __worker;
struct global_cwq *gcwq = worker->gcwq;
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);

+ /* tell the scheduler that this is a workqueue worker */
+ worker->task->flags |= PF_WQ_WORKER;
woke_up:
spin_lock_irq(&gcwq->lock);

/* DIE can be set only while we're idle, checking here is enough */
if (worker->flags & WORKER_DIE) {
spin_unlock_irq(&gcwq->lock);
+ worker->task->flags &= ~PF_WQ_WORKER;
return 0;
}

worker_leave_idle(worker);
recheck:
+ /* no more worker necessary? */
+ if (!need_more_worker(gcwq))
+ goto sleep;
+
+ /* do we need to manage? */
+ if (unlikely(!may_start_working(gcwq)) && manage_workers(worker))
+ goto recheck;
+
/*
* ->scheduled list can only be filled while a worker is
* preparing to process a work or actually processing it.
@@ -1162,27 +1661,20 @@ recheck:
*/
BUG_ON(!list_empty(&worker->scheduled));

- while (!list_empty(&gcwq->worklist)) {
+ /*
+ * When control reaches this point, we're guaranteed to have
+ * at least one idle worker or that someone else has already
+ * assumed the manager role.
+ */
+ worker->flags &= ~WORKER_PREP;
+ if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+ atomic_inc(nr_running);
+
+ do {
struct work_struct *work =
list_first_entry(&gcwq->worklist,
struct work_struct, entry);

- /*
- * The following is a rather inefficient way to close
- * race window against cpu hotplug operations. Will
- * be replaced soon.
- */
- if (unlikely(!(worker->flags & WORKER_ROGUE) &&
- !cpumask_equal(&worker->task->cpus_allowed,
- get_cpu_mask(gcwq->cpu)))) {
- spin_unlock_irq(&gcwq->lock);
- set_cpus_allowed_ptr(worker->task,
- get_cpu_mask(gcwq->cpu));
- cpu_relax();
- spin_lock_irq(&gcwq->lock);
- goto recheck;
- }
-
if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
/* optimization path, not strictly necessary */
process_one_work(worker, work);
@@ -1192,13 +1684,21 @@ recheck:
move_linked_works(work, &worker->scheduled, NULL);
process_scheduled_works(worker);
}
- }
+ } while (keep_working(gcwq));

+ if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+ atomic_dec(nr_running);
+ worker->flags |= WORKER_PREP;
+
+ if (unlikely(need_to_manage_workers(gcwq)) && manage_workers(worker))
+ goto recheck;
+sleep:
/*
- * gcwq->lock is held and there's no work to process, sleep.
- * Workers are woken up only while holding gcwq->lock, so
- * setting the current state before releasing gcwq->lock is
- * enough to prevent losing any event.
+ * gcwq->lock is held and there's no work to process and no
+ * need to manage, sleep. Workers are woken up only while
+ * holding gcwq->lock or from local cpu, so setting the
+ * current state before releasing gcwq->lock is enough to
+ * prevent losing any event.
*/
worker_enter_idle(worker);
__set_current_state(TASK_INTERRUPTIBLE);
@@ -1207,6 +1707,68 @@ recheck:
goto woke_up;
}

+/**
+ * rescuer_thread - the rescuer thread function
+ * @__wq: the associated workqueue
+ *
+ * Workqueue rescuer thread function. There's one rescuer for each
+ * workqueue which has WQ_RESCUER set.
+ *
+ * Regular work processing on a gcwq may block trying to create a new
+ * worker which uses GFP_KERNEL allocation which has slight chance of
+ * developing into deadlock if some works currently on the same queue
+ * need to be processed to satisfy the GFP_KERNEL allocation. This is
+ * the problem rescuer solves.
+ *
+ * When such condition is possible, the gcwq summons rescuers of all
+ * workqueues which have works queued on the gcwq and let them process
+ * those works so that forward progress can be guaranteed.
+ *
+ * This should happen rarely.
+ */
+static int rescuer_thread(void *__wq)
+{
+ struct workqueue_struct *wq = __wq;
+ struct worker *rescuer = wq->rescuer;
+ struct list_head *scheduled = &rescuer->scheduled;
+ unsigned int cpu;
+
+ set_user_nice(current, RESCUER_NICE_LEVEL);
+repeat:
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ for_each_cpu(cpu, wq->mayday_mask) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;
+ struct work_struct *work, *n;
+
+ __set_current_state(TASK_RUNNING);
+ cpumask_clear_cpu(cpu, wq->mayday_mask);
+
+ /* migrate to the target cpu if possible */
+ rescuer->gcwq = gcwq;
+ worker_maybe_bind_and_lock(rescuer);
+
+ /*
+ * Slurp in all works issued via this workqueue and
+ * process'em.
+ */
+ BUG_ON(!list_empty(&rescuer->scheduled));
+ list_for_each_entry_safe(work, n, &gcwq->worklist, entry)
+ if (get_work_cwq(work) == cwq)
+ move_linked_works(work, scheduled, &n);
+
+ process_scheduled_works(rescuer);
+ spin_unlock_irq(&gcwq->lock);
+ }
+
+ schedule();
+ goto repeat;
+}
+
struct wq_barrier {
struct work_struct work;
struct completion done;
@@ -1960,7 +2522,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
const char *lock_name)
{
struct workqueue_struct *wq;
- bool failed = false;
unsigned int cpu;

max_active = clamp_val(max_active, 1, INT_MAX);
@@ -1985,13 +2546,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);

- cpu_maps_update_begin();
- /*
- * We must initialize cwqs for each possible cpu even if we
- * are going to call destroy_workqueue() finally. Otherwise
- * cpu_up() can hit the uninitialized cwq once we drop the
- * lock.
- */
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
struct global_cwq *gcwq = get_gcwq(cpu);
@@ -2002,14 +2556,25 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
cwq->flush_color = -1;
cwq->max_active = max_active;
INIT_LIST_HEAD(&cwq->delayed_works);
+ }

- if (failed)
- continue;
- cwq->worker = create_worker(gcwq, cpu_online(cpu));
- if (cwq->worker)
- start_worker(cwq->worker);
- else
- failed = true;
+ if (flags & WQ_RESCUER) {
+ struct worker *rescuer;
+
+ if (!alloc_cpumask_var(&wq->mayday_mask, GFP_KERNEL))
+ goto err;
+
+ wq->rescuer = rescuer = alloc_worker();
+ if (!rescuer)
+ goto err;
+
+ rescuer->task = kthread_create(rescuer_thread, wq, "%s", name);
+ if (IS_ERR(rescuer->task))
+ goto err;
+
+ wq->rescuer = rescuer;
+ rescuer->task->flags |= PF_THREAD_BOUND;
+ wake_up_process(rescuer->task);
}

/*
@@ -2027,16 +2592,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,

spin_unlock(&workqueue_lock);

- cpu_maps_update_done();
-
- if (failed) {
- destroy_workqueue(wq);
- wq = NULL;
- }
return wq;
err:
if (wq) {
free_cwqs(wq->cpu_wq);
+ free_cpumask_var(wq->mayday_mask);
+ kfree(wq->rescuer);
kfree(wq);
}
return NULL;
@@ -2059,42 +2620,26 @@ void destroy_workqueue(struct workqueue_struct *wq)
* wq list is used to freeze wq, remove from list after
* flushing is complete in case freeze races us.
*/
- cpu_maps_update_begin();
spin_lock(&workqueue_lock);
list_del(&wq->list);
spin_unlock(&workqueue_lock);
- cpu_maps_update_done();

+ /* sanity check */
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
- struct global_cwq *gcwq = cwq->gcwq;
int i;

- if (cwq->worker) {
- retry:
- spin_lock_irq(&gcwq->lock);
- /*
- * Worker can only be destroyed while idle.
- * Wait till it becomes idle. This is ugly
- * and prone to starvation. It will go away
- * once dynamic worker pool is implemented.
- */
- if (!(cwq->worker->flags & WORKER_IDLE)) {
- spin_unlock_irq(&gcwq->lock);
- msleep(100);
- goto retry;
- }
- destroy_worker(cwq->worker);
- cwq->worker = NULL;
- spin_unlock_irq(&gcwq->lock);
- }
-
for (i = 0; i < WORK_NR_COLORS; i++)
BUG_ON(cwq->nr_in_flight[i]);
BUG_ON(cwq->nr_active);
BUG_ON(!list_empty(&cwq->delayed_works));
}

+ if (wq->flags & WQ_RESCUER) {
+ kthread_stop(wq->rescuer->task);
+ free_cpumask_var(wq->mayday_mask);
+ }
+
free_cwqs(wq->cpu_wq);
kfree(wq);
}
@@ -2103,10 +2648,18 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
/*
* CPU hotplug.
*
- * CPU hotplug is implemented by allowing cwqs to be detached from
- * CPU, running with unbound workers and allowing them to be
- * reattached later if the cpu comes back online. A separate thread
- * is created to govern cwqs in such state and is called the trustee.
+ * There are two challenges in supporting CPU hotplug. Firstly, there
+ * are a lot of assumptions on strong associations among work, cwq and
+ * gcwq which make migrating pending and scheduled works very
+ * difficult to implement without impacting hot paths. Secondly,
+ * gcwqs serve mix of short, long and very long running works making
+ * blocked draining impractical.
+ *
+ * This is solved by allowing a gcwq to be detached from CPU, running
+ * it with unbound (rogue) workers and allowing it to be reattached
+ * later if the cpu comes back online. A separate thread is created
+ * to govern a gcwq in such state and is called the trustee of the
+ * gcwq.
*
* Trustee states and their descriptions.
*
@@ -2114,11 +2667,12 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
* new trustee is started with this state.
*
* IN_CHARGE Once started, trustee will enter this state after
- * making all existing workers rogue. DOWN_PREPARE waits
- * for trustee to enter this state. After reaching
- * IN_CHARGE, trustee tries to execute the pending
- * worklist until it's empty and the state is set to
- * BUTCHER, or the state is set to RELEASE.
+ * assuming the manager role and making all existing
+ * workers rogue. DOWN_PREPARE waits for trustee to
+ * enter this state. After reaching IN_CHARGE, trustee
+ * tries to execute the pending worklist until it's empty
+ * and the state is set to BUTCHER, or the state is set
+ * to RELEASE.
*
* BUTCHER Command state which is set by the cpu callback after
* the cpu has went down. Once this state is set trustee
@@ -2129,7 +2683,9 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
* RELEASE Command state which is set by the cpu callback if the
* cpu down has been canceled or it has come online
* again. After recognizing this state, trustee stops
- * trying to drain or butcher and transits to DONE.
+ * trying to drain or butcher and clears ROGUE, rebinds
+ * all remaining workers back to the cpu and releases
+ * manager role.
*
* DONE Trustee will enter this state after BUTCHER or RELEASE
* is complete.
@@ -2194,18 +2750,26 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
static int __cpuinit trustee_thread(void *__gcwq)
{
struct global_cwq *gcwq = __gcwq;
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
struct worker *worker;
+ struct work_struct *work;
struct hlist_node *pos;
+ long rc;
int i;

BUG_ON(gcwq->cpu != smp_processor_id());

spin_lock_irq(&gcwq->lock);
/*
- * Make all workers rogue. Trustee must be bound to the
- * target cpu and can't be cancelled.
+ * Claim the manager position and make all workers rogue.
+ * Trustee must be bound to the target cpu and can't be
+ * cancelled.
*/
BUG_ON(gcwq->cpu != smp_processor_id());
+ rc = trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS));
+ BUG_ON(rc < 0);
+
+ gcwq->flags |= GCWQ_MANAGING_WORKERS;

list_for_each_entry(worker, &gcwq->idle_list, entry)
worker->flags |= WORKER_ROGUE;
@@ -2214,6 +2778,28 @@ static int __cpuinit trustee_thread(void *__gcwq)
worker->flags |= WORKER_ROGUE;

/*
+ * Call schedule() so that we cross rq->lock and thus can
+ * guarantee sched callbacks see the rogue flag. This is
+ * necessary as scheduler callbacks may be invoked from other
+ * cpus.
+ */
+ spin_unlock_irq(&gcwq->lock);
+ schedule();
+ spin_lock_irq(&gcwq->lock);
+
+ /*
+ * Sched callbacks are disabled now. Zap nr_running. After
+ * this, gcwq->nr_running stays zero and need_more_worker()
+ * and keep_working() are always true as long as the worklist
+ * is not empty.
+ */
+ atomic_set(nr_running, 0);
+
+ spin_unlock_irq(&gcwq->lock);
+ del_timer_sync(&gcwq->idle_timer);
+ spin_lock_irq(&gcwq->lock);
+
+ /*
* We're now in charge. Notify and proceed to drain. We need
* to keep the gcwq running during the whole CPU down
* procedure as other cpu hotunplug callbacks may need to
@@ -2225,18 +2811,90 @@ static int __cpuinit trustee_thread(void *__gcwq)
/*
* The original cpu is in the process of dying and may go away
* anytime now. When that happens, we and all workers would
- * be migrated to other cpus. Try draining any left work.
- * Note that if the gcwq is frozen, there may be frozen works
- * in freezeable cwqs. Don't declare completion while frozen.
+ * be migrated to other cpus. Try draining any left work. We
+ * want to get it over with ASAP - spam rescuers, wake up as
+ * many idlers as necessary and create new ones till the
+ * worklist is empty. Note that if the gcwq is frozen, there
+ * may be frozen works in freezeable cwqs. Don't declare
+ * completion while frozen.
*/
while (gcwq->nr_workers != gcwq->nr_idle ||
gcwq->flags & GCWQ_FREEZING ||
gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+ int nr_works = 0;
+
+ list_for_each_entry(work, &gcwq->worklist, entry) {
+ send_mayday(work);
+ nr_works++;
+ }
+
+ list_for_each_entry(worker, &gcwq->idle_list, entry) {
+ if (!nr_works--)
+ break;
+ wake_up_process(worker->task);
+ }
+
+ if (need_to_create_worker(gcwq)) {
+ spin_unlock_irq(&gcwq->lock);
+ worker = create_worker(gcwq, false);
+ spin_lock_irq(&gcwq->lock);
+ if (worker) {
+ worker->flags |= WORKER_ROGUE;
+ start_worker(worker);
+ }
+ }
+
/* give a breather */
if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
break;
}

+ /*
+ * Either all works have been scheduled and cpu is down, or
+ * cpu down has already been canceled. Wait for and butcher
+ * all workers till we're canceled.
+ */
+ do {
+ rc = trustee_wait_event(!list_empty(&gcwq->idle_list));
+ while (!list_empty(&gcwq->idle_list))
+ destroy_worker(list_first_entry(&gcwq->idle_list,
+ struct worker, entry));
+ } while (gcwq->nr_workers && rc >= 0);
+
+ /*
+ * At this point, either draining has completed and no worker
+ * is left, or cpu down has been canceled or the cpu is being
+ * brought back up. There shouldn't be any idle one left.
+ * Tell the remaining busy ones to rebind once it finishes the
+ * currently scheduled works by scheduling the rebind_work.
+ */
+ WARN_ON(!list_empty(&gcwq->idle_list));
+
+ for_each_busy_worker(worker, i, pos, gcwq) {
+ struct work_struct *rebind_work = &worker->rebind_work;
+
+ /*
+ * Rebind_work may race with future cpu hotplug
+ * operations. Use a separate flag to mark that
+ * rebinding is scheduled.
+ */
+ worker->flags &= ~WORKER_ROGUE;
+ worker->flags |= WORKER_REBIND;
+
+ /* queue rebind_work, wq doesn't matter, use the default one */
+ if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
+ work_data_bits(rebind_work)))
+ continue;
+
+ debug_work_activate(rebind_work);
+ insert_work(get_cwq(gcwq->cpu, keventd_wq), rebind_work,
+ worker->scheduled.next,
+ work_color_to_flags(WORK_NO_COLOR));
+ }
+
+ /* relinquish manager role */
+ gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
/* notify completion */
gcwq->trustee = NULL;
gcwq->trustee_state = TRUSTEE_DONE;
@@ -2275,10 +2933,8 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
unsigned int cpu = (unsigned long)hcpu;
struct global_cwq *gcwq = get_gcwq(cpu);
struct task_struct *new_trustee = NULL;
- struct worker *worker;
- struct hlist_node *pos;
+ struct worker *uninitialized_var(new_worker);
unsigned long flags;
- int i;

action &= ~CPU_TASKS_FROZEN;

@@ -2289,6 +2945,15 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
if (IS_ERR(new_trustee))
return notifier_from_errno(PTR_ERR(new_trustee));
kthread_bind(new_trustee, cpu);
+ /* fall through */
+ case CPU_UP_PREPARE:
+ BUG_ON(gcwq->first_idle);
+ new_worker = create_worker(gcwq, false);
+ if (!new_worker) {
+ if (new_trustee)
+ kthread_stop(new_trustee);
+ return NOTIFY_BAD;
+ }
}

/* some are called w/ irq disabled, don't disturb irq status */
@@ -2302,26 +2967,50 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
gcwq->trustee_state = TRUSTEE_START;
wake_up_process(gcwq->trustee);
wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+ /* fall through */
+ case CPU_UP_PREPARE:
+ BUG_ON(gcwq->first_idle);
+ gcwq->first_idle = new_worker;
+ break;
+
+ case CPU_DYING:
+ /*
+ * Before this, the trustee and all workers except for
+ * the ones which are still executing works from
+ * before the last CPU down must be on the cpu. After
+ * this, they'll all be diasporas.
+ */
+ gcwq->flags |= GCWQ_DISASSOCIATED;
break;

case CPU_POST_DEAD:
gcwq->trustee_state = TRUSTEE_BUTCHER;
+ /* fall through */
+ case CPU_UP_CANCELED:
+ destroy_worker(gcwq->first_idle);
+ gcwq->first_idle = NULL;
break;

case CPU_DOWN_FAILED:
case CPU_ONLINE:
+ gcwq->flags &= ~GCWQ_DISASSOCIATED;
if (gcwq->trustee_state != TRUSTEE_DONE) {
gcwq->trustee_state = TRUSTEE_RELEASE;
wake_up_process(gcwq->trustee);
wait_trustee_state(gcwq, TRUSTEE_DONE);
}

- /* clear ROGUE from all workers */
- list_for_each_entry(worker, &gcwq->idle_list, entry)
- worker->flags &= ~WORKER_ROGUE;
-
- for_each_busy_worker(worker, i, pos, gcwq)
- worker->flags &= ~WORKER_ROGUE;
+ /*
+ * Trustee is done and there might be no worker left.
+ * Put the first_idle in and request a real manager to
+ * take a look.
+ */
+ spin_unlock_irq(&gcwq->lock);
+ kthread_bind(gcwq->first_idle->task, cpu);
+ spin_lock_irq(&gcwq->lock);
+ gcwq->flags |= GCWQ_MANAGE_WORKERS;
+ start_worker(gcwq->first_idle);
+ gcwq->first_idle = NULL;
break;
}

@@ -2510,10 +3199,10 @@ void thaw_workqueues(void)
if (wq->single_cpu == gcwq->cpu &&
!cwq->nr_active && list_empty(&cwq->delayed_works))
cwq_unbind_single_cpu(cwq);
-
- wake_up_process(cwq->worker->task);
}

+ wake_up_worker(gcwq);
+
spin_unlock_irq(&gcwq->lock);
}

@@ -2558,12 +3247,31 @@ void __init init_workqueues(void)
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
INIT_HLIST_HEAD(&gcwq->busy_hash[i]);

+ init_timer_deferrable(&gcwq->idle_timer);
+ gcwq->idle_timer.function = idle_worker_timeout;
+ gcwq->idle_timer.data = (unsigned long)gcwq;
+
+ setup_timer(&gcwq->mayday_timer, gcwq_mayday_timeout,
+ (unsigned long)gcwq);
+
ida_init(&gcwq->worker_ida);

gcwq->trustee_state = TRUSTEE_DONE;
init_waitqueue_head(&gcwq->trustee_wait);
}

+ /* create the initial worker */
+ for_each_online_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ struct worker *worker;
+
+ worker = create_worker(gcwq, true);
+ BUG_ON(!worker);
+ spin_lock_irq(&gcwq->lock);
+ start_worker(worker);
+ spin_unlock_irq(&gcwq->lock);
+ }
+
keventd_wq = create_workqueue("events");
BUG_ON(!keventd_wq);
}
diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h
index af040ba..2d10fc9 100644
--- a/kernel/workqueue_sched.h
+++ b/kernel/workqueue_sched.h
@@ -4,13 +4,6 @@
* Scheduler hooks for concurrency managed workqueue. Only to be
* included from sched.c and workqueue.c.
*/
-static inline void wq_worker_waking_up(struct task_struct *task,
- unsigned int cpu)
-{
-}
-
-static inline struct task_struct *wq_worker_sleeping(struct task_struct *task,
- unsigned int cpu)
-{
- return NULL;
-}
+void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
+struct task_struct *wq_worker_sleeping(struct task_struct *task,
+ unsigned int cpu);
--
1.6.4.2

2010-06-14 21:44:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 25/30] workqueue: increase max_active of keventd and kill current_is_keventd()

Define WQ_MAX_ACTIVE and create keventd with max_active set to half of
it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1
works concurrently. Unless some combination can result in dependency
loop longer than max_active, deadlock won't happen and thus it's
unnecessary to check whether current_is_keventd() before trying to
schedule a work. Kill current_is_keventd().

(Lockdep annotations are broken. We need lock_map_acquire_read_norecurse())

NOT_SIGNED_OFF_YET: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
arch/ia64/kernel/smpboot.c | 2 +-
arch/x86/kernel/smpboot.c | 2 +-
include/linux/workqueue.h | 4 ++-
kernel/workqueue.c | 63 +++++++++----------------------------------
4 files changed, 18 insertions(+), 53 deletions(-)

diff --git a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
index 6a1380e..99dcc85 100644
--- a/arch/ia64/kernel/smpboot.c
+++ b/arch/ia64/kernel/smpboot.c
@@ -519,7 +519,7 @@ do_boot_cpu (int sapicid, int cpu)
/*
* We can't use kernel_thread since we must avoid to reschedule the child.
*/
- if (!keventd_up() || current_is_keventd())
+ if (!keventd_up())
c_idle.work.func(&c_idle.work);
else {
schedule_work(&c_idle.work);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c4f33b2..4d90f37 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -735,7 +735,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
goto do_rest;
}

- if (!keventd_up() || current_is_keventd())
+ if (!keventd_up())
c_idle.work.func(&c_idle.work);
else {
schedule_work(&c_idle.work);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b8f4ec4..33e24e7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -227,6 +227,9 @@ enum {
WQ_SINGLE_CPU = 1 << 1, /* only single cpu at a time */
WQ_NON_REENTRANT = 1 << 2, /* guarantee non-reentrance */
WQ_RESCUER = 1 << 3, /* has an rescue worker */
+
+ WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
+ WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
};

extern struct workqueue_struct *
@@ -280,7 +283,6 @@ extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay)
extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
unsigned long delay);
extern int schedule_on_each_cpu(work_func_t func);
-extern int current_is_keventd(void);
extern int keventd_up(void);

extern void init_workqueues(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dd8c38b..9f22dbd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2343,7 +2343,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
int schedule_on_each_cpu(work_func_t func)
{
int cpu;
- int orig = -1;
struct work_struct *works;

works = alloc_percpu(struct work_struct);
@@ -2352,23 +2351,12 @@ int schedule_on_each_cpu(work_func_t func)

get_online_cpus();

- /*
- * When running in keventd don't schedule a work item on
- * itself. Can just call directly because the work queue is
- * already bound. This also is faster.
- */
- if (current_is_keventd())
- orig = raw_smp_processor_id();
-
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);

INIT_WORK(work, func);
- if (cpu != orig)
- schedule_work_on(cpu, work);
+ schedule_work_on(cpu, work);
}
- if (orig >= 0)
- func(per_cpu_ptr(works, orig));

for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(works, cpu));
@@ -2439,41 +2427,6 @@ int keventd_up(void)
return keventd_wq != NULL;
}

-int current_is_keventd(void)
-{
- bool found = false;
- unsigned int cpu;
-
- /*
- * There no longer is one-to-one relation between worker and
- * work queue and a worker task might be unbound from its cpu
- * if the cpu was offlined. Match all busy workers. This
- * function will go away once dynamic pool is implemented.
- */
- for_each_possible_cpu(cpu) {
- struct global_cwq *gcwq = get_gcwq(cpu);
- struct worker *worker;
- struct hlist_node *pos;
- unsigned long flags;
- int i;
-
- spin_lock_irqsave(&gcwq->lock, flags);
-
- for_each_busy_worker(worker, i, pos, gcwq) {
- if (worker->task == current) {
- found = true;
- break;
- }
- }
-
- spin_unlock_irqrestore(&gcwq->lock, flags);
- if (found)
- break;
- }
-
- return found;
-}
-
static struct cpu_workqueue_struct *alloc_cwqs(void)
{
const size_t size = sizeof(struct cpu_workqueue_struct);
@@ -2515,6 +2468,16 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)
#endif
}

+static int wq_clamp_max_active(int max_active, const char *name)
+{
+ if (max_active < 1 || max_active > WQ_MAX_ACTIVE)
+ printk(KERN_WARNING "workqueue: max_active %d requested for %s "
+ "is out of range, clamping between %d and %d\n",
+ max_active, name, 1, WQ_MAX_ACTIVE);
+
+ return clamp_val(max_active, 1, WQ_MAX_ACTIVE);
+}
+
struct workqueue_struct *__create_workqueue_key(const char *name,
unsigned int flags,
int max_active,
@@ -2524,7 +2487,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
struct workqueue_struct *wq;
unsigned int cpu;

- max_active = clamp_val(max_active, 1, INT_MAX);
+ max_active = wq_clamp_max_active(max_active, name);

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
@@ -3272,6 +3235,6 @@ void __init init_workqueues(void)
spin_unlock_irq(&gcwq->lock);
}

- keventd_wq = create_workqueue("events");
+ keventd_wq = __create_workqueue("events", 0, WQ_DFL_ACTIVE);
BUG_ON(!keventd_wq);
}
--
1.6.4.2

2010-06-14 21:44:42

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/30] workqueue: reimplement workqueue flushing using color coded works

Reimplement workqueue flushing using color coded works. wq has the
current work color which is painted on the works being issued via
cwqs. Flushing a workqueue is achieved by advancing the current work
colors of cwqs and waiting for all the works which have any of the
previous colors to drain.

Currently there are 16 possible colors, one is reserved for no color
and 15 colors are useable allowing 14 concurrent flushes. When color
space gets full, flush attempts are batched up and processed together
when color frees up, so even with many concurrent flushers, the new
implementation won't build up huge queue of flushers which has to be
processed one after another.

Only works which are queued via __queue_work() are colored. Works
which are directly put on queue using insert_work() use NO_COLOR and
don't participate in workqueue flushing. Currently only works used
for work-specific flush fall in this category.

This new implementation leaves only cleanup_workqueue_thread() as the
user of flush_cpu_workqueue(). Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.

This new implementation is to allow having and sharing multiple
workers per cpu.

Please note that one more bit is reserved for a future work flag by
this patch. This is to avoid shifting bits and updating comments
later.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 21 +++-
kernel/workqueue.c | 355 ++++++++++++++++++++++++++++++++++++++-------
2 files changed, 322 insertions(+), 54 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b90958a..8762f62 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -26,11 +26,13 @@ enum {
WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
#ifdef CONFIG_DEBUG_OBJECTS_WORK
WORK_STRUCT_STATIC_BIT = 1, /* static initializer (debugobjects) */
- WORK_STRUCT_FLAG_BITS = 2,
+ WORK_STRUCT_COLOR_SHIFT = 3, /* color for workqueue flushing */
#else
- WORK_STRUCT_FLAG_BITS = 1,
+ WORK_STRUCT_COLOR_SHIFT = 2, /* color for workqueue flushing */
#endif

+ WORK_STRUCT_COLOR_BITS = 4,
+
WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
#ifdef CONFIG_DEBUG_OBJECTS_WORK
WORK_STRUCT_STATIC = 1 << WORK_STRUCT_STATIC_BIT,
@@ -38,6 +40,21 @@ enum {
WORK_STRUCT_STATIC = 0,
#endif

+ /*
+ * The last color is no color used for works which don't
+ * participate in workqueue flushing.
+ */
+ WORK_NR_COLORS = (1 << WORK_STRUCT_COLOR_BITS) - 1,
+ WORK_NO_COLOR = WORK_NR_COLORS,
+
+ /*
+ * Reserve 6 bits off of cwq pointer w/ debugobjects turned
+ * off. This makes cwqs aligned to 64 bytes which isn't too
+ * excessive while allowing 15 workqueue flush colors.
+ */
+ WORK_STRUCT_FLAG_BITS = WORK_STRUCT_COLOR_SHIFT +
+ WORK_STRUCT_COLOR_BITS,
+
WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
};
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 878546e..eeec736 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,8 @@
*
* L: cwq->lock protected. Access with cwq->lock held.
*
+ * F: wq->flush_mutex protected.
+ *
* W: workqueue_lock protected.
*/

@@ -60,10 +62,23 @@ struct cpu_workqueue_struct {
unsigned int cpu;

struct workqueue_struct *wq; /* I: the owning workqueue */
+ int work_color; /* L: current color */
+ int flush_color; /* L: flushing color */
+ int nr_in_flight[WORK_NR_COLORS];
+ /* L: nr of in_flight works */
struct task_struct *thread;
};

/*
+ * Structure used to wait for workqueue flush.
+ */
+struct wq_flusher {
+ struct list_head list; /* F: list of flushers */
+ int flush_color; /* F: flush color waiting for */
+ struct completion done; /* flush completion */
+};
+
+/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
*/
@@ -71,6 +86,15 @@ struct workqueue_struct {
unsigned int flags; /* I: WQ_* flags */
struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
struct list_head list; /* W: list of all workqueues */
+
+ struct mutex flush_mutex; /* protects wq flushing */
+ int work_color; /* F: current work color */
+ int flush_color; /* F: current flush color */
+ atomic_t nr_cwqs_to_flush; /* flush in progress */
+ struct wq_flusher *first_flusher; /* F: first flusher */
+ struct list_head flusher_queue; /* F: flush waiters */
+ struct list_head flusher_overflow; /* F: flush overflow list */
+
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
@@ -207,6 +231,22 @@ static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
return get_cwq(cpu, wq);
}

+static unsigned int work_color_to_flags(int color)
+{
+ return color << WORK_STRUCT_COLOR_SHIFT;
+}
+
+static int get_work_color(struct work_struct *work)
+{
+ return (*work_data_bits(work) >> WORK_STRUCT_COLOR_SHIFT) &
+ ((1 << WORK_STRUCT_COLOR_BITS) - 1);
+}
+
+static int work_next_color(int color)
+{
+ return (color + 1) % WORK_NR_COLORS;
+}
+
/*
* Set the workqueue on which a work item is to be run
* - Must *only* be called if the pending flag is set
@@ -273,7 +313,9 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
debug_work_activate(work);
spin_lock_irqsave(&cwq->lock, flags);
BUG_ON(!list_empty(&work->entry));
- insert_work(cwq, work, &cwq->worklist, 0);
+ cwq->nr_in_flight[cwq->work_color]++;
+ insert_work(cwq, work, &cwq->worklist,
+ work_color_to_flags(cwq->work_color));
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -387,6 +429,44 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

/**
+ * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
+ * @cwq: cwq of interest
+ * @color: color of work which left the queue
+ *
+ * A work either has completed or is removed from pending queue,
+ * decrement nr_in_flight of its cwq and handle workqueue flushing.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
+{
+ /* ignore uncolored works */
+ if (color == WORK_NO_COLOR)
+ return;
+
+ cwq->nr_in_flight[color]--;
+
+ /* is flush in progress and are we at the flushing tip? */
+ if (likely(cwq->flush_color != color))
+ return;
+
+ /* are there still in-flight works? */
+ if (cwq->nr_in_flight[color])
+ return;
+
+ /* this cwq is done, clear flush_color */
+ cwq->flush_color = -1;
+
+ /*
+ * If this was the last cwq, wake up the first flusher. It
+ * will handle the rest.
+ */
+ if (atomic_dec_and_test(&cwq->wq->nr_cwqs_to_flush))
+ complete(&cwq->wq->first_flusher->done);
+}
+
+/**
* process_one_work - process single work
* @cwq: cwq to process work for
* @work: work to process
@@ -404,6 +484,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
work_func_t f = work->func;
+ int work_color;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -417,6 +498,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
/* claim and process */
debug_work_deactivate(work);
cwq->current_work = work;
+ work_color = get_work_color(work);
list_del_init(&work->entry);

spin_unlock_irq(&cwq->lock);
@@ -443,6 +525,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,

/* we're done with it, release */
cwq->current_work = NULL;
+ cwq_dec_nr_in_flight(cwq, work_color);
}

static void run_workqueue(struct cpu_workqueue_struct *cwq)
@@ -529,29 +612,78 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
init_completion(&barr->done);

debug_work_activate(&barr->work);
- insert_work(cwq, &barr->work, head, 0);
+ insert_work(cwq, &barr->work, head, work_color_to_flags(WORK_NO_COLOR));
}

-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
+/**
+ * flush_workqueue_prep_cwqs - prepare cwqs for workqueue flushing
+ * @wq: workqueue being flushed
+ * @flush_color: new flush color, < 0 for no-op
+ * @work_color: new work color, < 0 for no-op
+ *
+ * Prepare cwqs for workqueue flushing.
+ *
+ * If @flush_color is non-negative, flush_color on all cwqs should be
+ * -1. If no cwq has in-flight commands at the specified color, all
+ * cwq->flush_color's stay at -1 and %false is returned. If any cwq
+ * has in flight commands, its cwq->flush_color is set to
+ * @flush_color, @wq->nr_cwqs_to_flush is updated accordingly, cwq
+ * wakeup logic is armed and %true is returned.
+ *
+ * The caller should have initialized @wq->first_flusher prior to
+ * calling this function with non-negative @flush_color. If
+ * @flush_color is negative, no flush color update is done and %false
+ * is returned.
+ *
+ * If @work_color is non-negative, all cwqs should have the same
+ * work_color which is previous to @work_color and all will be
+ * advanced to @work_color.
+ *
+ * CONTEXT:
+ * mutex_lock(wq->flush_mutex).
+ *
+ * RETURNS:
+ * %true if @flush_color >= 0 and there's something to flush. %false
+ * otherwise.
+ */
+static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
+ int flush_color, int work_color)
{
- int active = 0;
- struct wq_barrier barr;
+ bool wait = false;
+ unsigned int cpu;

- WARN_ON(cwq->thread == current);
-
- spin_lock_irq(&cwq->lock);
- if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
- insert_wq_barrier(cwq, &barr, &cwq->worklist);
- active = 1;
+ if (flush_color >= 0) {
+ BUG_ON(atomic_read(&wq->nr_cwqs_to_flush));
+ atomic_set(&wq->nr_cwqs_to_flush, 1);
}
- spin_unlock_irq(&cwq->lock);

- if (active) {
- wait_for_completion(&barr.done);
- destroy_work_on_stack(&barr.work);
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ spin_lock_irq(&cwq->lock);
+
+ if (flush_color >= 0) {
+ BUG_ON(cwq->flush_color != -1);
+
+ if (cwq->nr_in_flight[flush_color]) {
+ cwq->flush_color = flush_color;
+ atomic_inc(&wq->nr_cwqs_to_flush);
+ wait = true;
+ }
+ }
+
+ if (work_color >= 0) {
+ BUG_ON(work_color != work_next_color(cwq->work_color));
+ cwq->work_color = work_color;
+ }
+
+ spin_unlock_irq(&cwq->lock);
}

- return active;
+ if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_cwqs_to_flush))
+ complete(&wq->first_flusher->done);
+
+ return wait;
}

/**
@@ -566,13 +698,143 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*/
void flush_workqueue(struct workqueue_struct *wq)
{
- int cpu;
+ struct wq_flusher this_flusher = {
+ .list = LIST_HEAD_INIT(this_flusher.list),
+ .flush_color = -1,
+ .done = COMPLETION_INITIALIZER_ONSTACK(this_flusher.done),
+ };
+ int next_color;

- might_sleep();
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
- for_each_possible_cpu(cpu)
- flush_cpu_workqueue(get_cwq(cpu, wq));
+
+ mutex_lock(&wq->flush_mutex);
+
+ /*
+ * Start-to-wait phase
+ */
+ next_color = work_next_color(wq->work_color);
+
+ if (next_color != wq->flush_color) {
+ /*
+ * Color space is not full. The current work_color
+ * becomes our flush_color and work_color is advanced
+ * by one.
+ */
+ BUG_ON(!list_empty(&wq->flusher_overflow));
+ this_flusher.flush_color = wq->work_color;
+ wq->work_color = next_color;
+
+ if (!wq->first_flusher) {
+ /* no flush in progress, become the first flusher */
+ BUG_ON(wq->flush_color != this_flusher.flush_color);
+
+ wq->first_flusher = &this_flusher;
+
+ if (!flush_workqueue_prep_cwqs(wq, wq->flush_color,
+ wq->work_color)) {
+ /* nothing to flush, done */
+ wq->flush_color = next_color;
+ wq->first_flusher = NULL;
+ goto out_unlock;
+ }
+ } else {
+ /* wait in queue */
+ BUG_ON(wq->flush_color == this_flusher.flush_color);
+ list_add_tail(&this_flusher.list, &wq->flusher_queue);
+ flush_workqueue_prep_cwqs(wq, -1, wq->work_color);
+ }
+ } else {
+ /*
+ * Oops, color space is full, wait on overflow queue.
+ * The next flush completion will assign us
+ * flush_color and transfer to flusher_queue.
+ */
+ list_add_tail(&this_flusher.list, &wq->flusher_overflow);
+ }
+
+ mutex_unlock(&wq->flush_mutex);
+
+ wait_for_completion(&this_flusher.done);
+
+ /*
+ * Wake-up-and-cascade phase
+ *
+ * First flushers are responsible for cascading flushes and
+ * handling overflow. Non-first flushers can simply return.
+ */
+ if (wq->first_flusher != &this_flusher)
+ return;
+
+ mutex_lock(&wq->flush_mutex);
+
+ wq->first_flusher = NULL;
+
+ BUG_ON(!list_empty(&this_flusher.list));
+ BUG_ON(wq->flush_color != this_flusher.flush_color);
+
+ while (true) {
+ struct wq_flusher *next, *tmp;
+
+ /* complete all the flushers sharing the current flush color */
+ list_for_each_entry_safe(next, tmp, &wq->flusher_queue, list) {
+ if (next->flush_color != wq->flush_color)
+ break;
+ list_del_init(&next->list);
+ complete(&next->done);
+ }
+
+ BUG_ON(!list_empty(&wq->flusher_overflow) &&
+ wq->flush_color != work_next_color(wq->work_color));
+
+ /* this flush_color is finished, advance by one */
+ wq->flush_color = work_next_color(wq->flush_color);
+
+ /* one color has been freed, handle overflow queue */
+ if (!list_empty(&wq->flusher_overflow)) {
+ /*
+ * Assign the same color to all overflowed
+ * flushers, advance work_color and append to
+ * flusher_queue. This is the start-to-wait
+ * phase for these overflowed flushers.
+ */
+ list_for_each_entry(tmp, &wq->flusher_overflow, list)
+ tmp->flush_color = wq->work_color;
+
+ wq->work_color = work_next_color(wq->work_color);
+
+ list_splice_tail_init(&wq->flusher_overflow,
+ &wq->flusher_queue);
+ flush_workqueue_prep_cwqs(wq, -1, wq->work_color);
+ }
+
+ if (list_empty(&wq->flusher_queue)) {
+ BUG_ON(wq->flush_color != wq->work_color);
+ break;
+ }
+
+ /*
+ * Need to flush more colors. Make the next flusher
+ * the new first flusher and arm cwqs.
+ */
+ BUG_ON(wq->flush_color == wq->work_color);
+ BUG_ON(wq->flush_color != next->flush_color);
+
+ list_del_init(&next->list);
+ wq->first_flusher = next;
+
+ if (flush_workqueue_prep_cwqs(wq, wq->flush_color, -1))
+ break;
+
+ /*
+ * Meh... this color is already done, clear first
+ * flusher and repeat cascading.
+ */
+ wq->first_flusher = NULL;
+ }
+
+out_unlock:
+ mutex_unlock(&wq->flush_mutex);
}
EXPORT_SYMBOL_GPL(flush_workqueue);

@@ -659,6 +921,7 @@ static int try_to_grab_pending(struct work_struct *work)
if (cwq == get_wq_data(work)) {
debug_work_deactivate(work);
list_del_init(&work->entry);
+ cwq_dec_nr_in_flight(cwq, get_work_color(work));
ret = 1;
}
}
@@ -1060,6 +1323,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
goto err;

wq->flags = flags;
+ mutex_init(&wq->flush_mutex);
+ atomic_set(&wq->nr_cwqs_to_flush, 0);
+ INIT_LIST_HEAD(&wq->flusher_queue);
+ INIT_LIST_HEAD(&wq->flusher_overflow);
wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);
@@ -1077,6 +1344,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
cwq->wq = wq;
cwq->cpu = cpu;
+ cwq->flush_color = -1;
spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
init_waitqueue_head(&cwq->more_work);
@@ -1110,33 +1378,6 @@ err:
}
EXPORT_SYMBOL_GPL(__create_workqueue_key);

-static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
-{
- /*
- * Our caller is either destroy_workqueue() or CPU_POST_DEAD,
- * cpu_add_remove_lock protects cwq->thread.
- */
- if (cwq->thread == NULL)
- return;
-
- lock_map_acquire(&cwq->wq->lockdep_map);
- lock_map_release(&cwq->wq->lockdep_map);
-
- flush_cpu_workqueue(cwq);
- /*
- * If the caller is CPU_POST_DEAD and cwq->worklist was not empty,
- * a concurrent flush_workqueue() can insert a barrier after us.
- * However, in that case run_workqueue() won't return and check
- * kthread_should_stop() until it flushes all work_struct's.
- * When ->worklist becomes empty it is safe to exit because no
- * more work_structs can be queued on this cwq: flush_workqueue
- * checks list_empty(), and a "normal" queue_work() can't use
- * a dead CPU.
- */
- kthread_stop(cwq->thread);
- cwq->thread = NULL;
-}
-
/**
* destroy_workqueue - safely terminate a workqueue
* @wq: target workqueue
@@ -1153,8 +1394,20 @@ void destroy_workqueue(struct workqueue_struct *wq)
spin_unlock(&workqueue_lock);
cpu_maps_update_done();

- for_each_possible_cpu(cpu)
- cleanup_workqueue_thread(get_cwq(cpu, wq));
+ flush_workqueue(wq);
+
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ int i;
+
+ if (cwq->thread) {
+ kthread_stop(cwq->thread);
+ cwq->thread = NULL;
+ }
+
+ for (i = 0; i < WORK_NR_COLORS; i++)
+ BUG_ON(cwq->nr_in_flight[i]);
+ }

free_cwqs(wq->cpu_wq);
kfree(wq);
@@ -1179,9 +1432,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,

switch (action) {
case CPU_POST_DEAD:
- lock_map_acquire(&cwq->wq->lockdep_map);
- lock_map_release(&cwq->wq->lockdep_map);
- flush_cpu_workqueue(cwq);
+ flush_workqueue(wq);
break;
}
}
--
1.6.4.2

2010-06-14 21:44:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/30] acpi: use queue_work_on() instead of binding workqueue worker to cpu0

ACPI works need to be executed on cpu0 and acpi/osl.c achieves this by
creating singlethread workqueue and then binding it to cpu0 from a
work which is quite unorthodox. Make it create regular workqueues and
use queue_work_on() instead. This is in preparation of concurrency
managed workqueue and the extra workers won't be a problem after it's
implemented.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/acpi/osl.c | 40 +++++++++++-----------------------------
1 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 78418ce..46cce39 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -191,36 +191,11 @@ acpi_status __init acpi_os_initialize(void)
return AE_OK;
}

-static void bind_to_cpu0(struct work_struct *work)
-{
- set_cpus_allowed_ptr(current, cpumask_of(0));
- kfree(work);
-}
-
-static void bind_workqueue(struct workqueue_struct *wq)
-{
- struct work_struct *work;
-
- work = kzalloc(sizeof(struct work_struct), GFP_KERNEL);
- INIT_WORK(work, bind_to_cpu0);
- queue_work(wq, work);
-}
-
acpi_status acpi_os_initialize1(void)
{
- /*
- * On some machines, a software-initiated SMI causes corruption unless
- * the SMI runs on CPU 0. An SMI can be initiated by any AML, but
- * typically it's done in GPE-related methods that are run via
- * workqueues, so we can avoid the known corruption cases by binding
- * the workqueues to CPU 0.
- */
- kacpid_wq = create_singlethread_workqueue("kacpid");
- bind_workqueue(kacpid_wq);
- kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
- bind_workqueue(kacpi_notify_wq);
- kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
- bind_workqueue(kacpi_hotplug_wq);
+ kacpid_wq = create_workqueue("kacpid");
+ kacpi_notify_wq = create_workqueue("kacpi_notify");
+ kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
BUG_ON(!kacpid_wq);
BUG_ON(!kacpi_notify_wq);
BUG_ON(!kacpi_hotplug_wq);
@@ -766,7 +741,14 @@ static acpi_status __acpi_os_execute(acpi_execute_type type,
else
INIT_WORK(&dpc->work, acpi_os_execute_deferred);

- ret = queue_work(queue, &dpc->work);
+ /*
+ * On some machines, a software-initiated SMI causes corruption unless
+ * the SMI runs on CPU 0. An SMI can be initiated by any AML, but
+ * typically it's done in GPE-related methods that are run via
+ * workqueues, so we can avoid the known corruption cases by always
+ * queueing on CPU 0.
+ */
+ ret = queue_work_on(0, queue, &dpc->work);

if (!ret) {
printk(KERN_ERR PREFIX
--
1.6.4.2

2010-06-14 21:45:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/30] workqueue: make single thread workqueue shared worker pool friendly

Reimplement st (single thread) workqueue so that it's friendly to
shared worker pool. It was originally implemented by confining st
workqueues to use cwq of a fixed cpu and always having a worker for
the cpu. This implementation isn't very friendly to shared worker
pool and suboptimal in that it ends up crossing cpu boundaries often.

Reimplement st workqueue using dynamic single cpu binding and
cwq->limit. WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU. In a
single cpu workqueue, at most single cwq is bound to the wq at any
given time. Arbitration is done using atomic accesses to
wq->single_cpu when queueing a work. Once bound, the binding stays
till the workqueue is drained.

Note that the binding is never broken while a workqueue is frozen.
This is because idle cwqs may have works waiting in delayed_works
queue while frozen. On thaw, the cwq is restarted if there are any
delayed works or unbound otherwise.

When combined with max_active limit of 1, single cpu workqueue has
exactly the same execution properties as the original single thread
workqueue while allowing sharing of per-cpu workers.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 6 +-
kernel/workqueue.c | 135 +++++++++++++++++++++++++++++++++------------
2 files changed, 103 insertions(+), 38 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index ab0b7fb..10611f7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -221,7 +221,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }

enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
- WQ_SINGLE_THREAD = 1 << 1, /* no per-cpu worker */
+ WQ_SINGLE_CPU = 1 << 1, /* only single cpu at a time */
};

extern struct workqueue_struct *
@@ -250,9 +250,9 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
#define create_workqueue(name) \
__create_workqueue((name), 0, 1)
#define create_freezeable_workqueue(name) \
- __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD, 1)
+ __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_CPU, 1)
#define create_singlethread_workqueue(name) \
- __create_workqueue((name), WQ_SINGLE_THREAD, 1)
+ __create_workqueue((name), WQ_SINGLE_CPU, 1)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5cd155d..2ce895e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -114,8 +114,7 @@ struct global_cwq {
} ____cacheline_aligned_in_smp;

/*
- * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu). The lower WORK_STRUCT_FLAG_BITS of
+ * The per-CPU workqueue. The lower WORK_STRUCT_FLAG_BITS of
* work_struct->data are used for flags and thus cwqs need to be
* aligned at two's power of the number of flag bits.
*/
@@ -159,6 +158,8 @@ struct workqueue_struct {
struct list_head flusher_queue; /* F: flush waiters */
struct list_head flusher_overflow; /* F: flush overflow list */

+ unsigned long single_cpu; /* cpu for single cpu wq */
+
int saved_max_active; /* I: saved cwq max_active */
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
@@ -289,8 +290,6 @@ static DEFINE_PER_CPU(struct global_cwq, global_cwq);

static int worker_thread(void *__worker);

-static int singlethread_cpu __read_mostly;
-
static struct global_cwq *get_gcwq(unsigned int cpu)
{
return &per_cpu(global_cwq, cpu);
@@ -302,14 +301,6 @@ static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
return per_cpu_ptr(wq->cpu_wq, cpu);
}

-static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
- struct workqueue_struct *wq)
-{
- if (unlikely(wq->flags & WQ_SINGLE_THREAD))
- cpu = singlethread_cpu;
- return get_cwq(cpu, wq);
-}
-
static unsigned int work_color_to_flags(int color)
{
return color << WORK_STRUCT_COLOR_SHIFT;
@@ -410,17 +401,87 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
wake_up_process(cwq->worker->task);
}

+/**
+ * cwq_unbind_single_cpu - unbind cwq from single cpu workqueue processing
+ * @cwq: cwq to unbind
+ *
+ * Try to unbind @cwq from single cpu workqueue processing. If
+ * @cwq->wq is frozen, unbind is delayed till the workqueue is thawed.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void cwq_unbind_single_cpu(struct cpu_workqueue_struct *cwq)
+{
+ struct workqueue_struct *wq = cwq->wq;
+ struct global_cwq *gcwq = cwq->gcwq;
+
+ BUG_ON(wq->single_cpu != gcwq->cpu);
+ /*
+ * Unbind from workqueue if @cwq is not frozen. If frozen,
+ * thaw_workqueues() will either restart processing on this
+ * cpu or unbind if empty. This keeps works queued while
+ * frozen fully ordered and flushable.
+ */
+ if (likely(!(gcwq->flags & GCWQ_FREEZING))) {
+ smp_wmb(); /* paired with cmpxchg() in __queue_work() */
+ wq->single_cpu = NR_CPUS;
+ }
+}
+
static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
- struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
- struct global_cwq *gcwq = cwq->gcwq;
+ struct global_cwq *gcwq;
+ struct cpu_workqueue_struct *cwq;
struct list_head *worklist;
unsigned long flags;
+ bool arbitrate;

debug_work_activate(work);

- spin_lock_irqsave(&gcwq->lock, flags);
+ /* determine gcwq to use */
+ if (!(wq->flags & WQ_SINGLE_CPU)) {
+ /* just use the requested cpu for multicpu workqueues */
+ gcwq = get_gcwq(cpu);
+ spin_lock_irqsave(&gcwq->lock, flags);
+ } else {
+ unsigned int req_cpu = cpu;
+
+ /*
+ * It's a bit more complex for single cpu workqueues.
+ * We first need to determine which cpu is going to be
+ * used. If no cpu is currently serving this
+ * workqueue, arbitrate using atomic accesses to
+ * wq->single_cpu; otherwise, use the current one.
+ */
+ retry:
+ cpu = wq->single_cpu;
+ arbitrate = cpu == NR_CPUS;
+ if (arbitrate)
+ cpu = req_cpu;
+
+ gcwq = get_gcwq(cpu);
+ spin_lock_irqsave(&gcwq->lock, flags);
+
+ /*
+ * The following cmpxchg() is a full barrier paired
+ * with smp_wmb() in cwq_unbind_single_cpu() and
+ * guarantees that all changes to wq->st_* fields are
+ * visible on the new cpu after this point.
+ */
+ if (arbitrate)
+ cmpxchg(&wq->single_cpu, NR_CPUS, cpu);
+
+ if (unlikely(wq->single_cpu != cpu)) {
+ spin_unlock_irqrestore(&gcwq->lock, flags);
+ goto retry;
+ }
+ }
+
+ /* gcwq determined, get cwq and queue */
+ cwq = get_cwq(gcwq->cpu, wq);
+
BUG_ON(!list_empty(&work->entry));

cwq->nr_in_flight[cwq->work_color]++;
@@ -530,7 +591,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
timer_stats_timer_set_start_info(&dwork->timer);

/* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, target_cwq(raw_smp_processor_id(), wq), 0);
+ set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
@@ -790,10 +851,14 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
cwq->nr_in_flight[color]--;
cwq->nr_active--;

- /* one down, submit a delayed one */
- if (!list_empty(&cwq->delayed_works) &&
- cwq->nr_active < cwq->max_active)
- cwq_activate_first_delayed(cwq);
+ if (!list_empty(&cwq->delayed_works)) {
+ /* one down, submit a delayed one */
+ if (cwq->nr_active < cwq->max_active)
+ cwq_activate_first_delayed(cwq);
+ } else if (!cwq->nr_active && cwq->wq->flags & WQ_SINGLE_CPU) {
+ /* this was the last work, unbind from single cpu */
+ cwq_unbind_single_cpu(cwq);
+ }

/* is flush in progress and are we at the flushing tip? */
if (likely(cwq->flush_color != color))
@@ -1721,7 +1786,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
struct lock_class_key *key,
const char *lock_name)
{
- bool singlethread = flags & WQ_SINGLE_THREAD;
struct workqueue_struct *wq;
bool failed = false;
unsigned int cpu;
@@ -1742,6 +1806,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
atomic_set(&wq->nr_cwqs_to_flush, 0);
INIT_LIST_HEAD(&wq->flusher_queue);
INIT_LIST_HEAD(&wq->flusher_overflow);
+ wq->single_cpu = NR_CPUS;
+
wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);
@@ -1767,8 +1833,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,

if (failed)
continue;
- cwq->worker = create_worker(cwq,
- cpu_online(cpu) && !singlethread);
+ cwq->worker = create_worker(cwq, cpu_online(cpu));
if (cwq->worker)
start_worker(cwq->worker);
else
@@ -1952,18 +2017,16 @@ static int __cpuinit trustee_thread(void *__gcwq)

spin_lock_irq(&gcwq->lock);
/*
- * Make all multithread workers rogue. Trustee must be bound
- * to the target cpu and can't be cancelled.
+ * Make all workers rogue. Trustee must be bound to the
+ * target cpu and can't be cancelled.
*/
BUG_ON(gcwq->cpu != smp_processor_id());

list_for_each_entry(worker, &gcwq->idle_list, entry)
- if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
- worker->flags |= WORKER_ROGUE;
+ worker->flags |= WORKER_ROGUE;

for_each_busy_worker(worker, i, pos, gcwq)
- if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
- worker->flags |= WORKER_ROGUE;
+ worker->flags |= WORKER_ROGUE;

/*
* We're now in charge. Notify and proceed to drain. We need
@@ -2068,14 +2131,12 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
wait_trustee_state(gcwq, TRUSTEE_DONE);
}

- /* clear ROGUE from all multithread workers */
+ /* clear ROGUE from all workers */
list_for_each_entry(worker, &gcwq->idle_list, entry)
- if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
- worker->flags &= ~WORKER_ROGUE;
+ worker->flags &= ~WORKER_ROGUE;

for_each_busy_worker(worker, i, pos, gcwq)
- if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
- worker->flags &= ~WORKER_ROGUE;
+ worker->flags &= ~WORKER_ROGUE;
break;
}

@@ -2260,6 +2321,11 @@ void thaw_workqueues(void)
cwq->nr_active < cwq->max_active)
cwq_activate_first_delayed(cwq);

+ /* perform delayed unbind from single cpu if empty */
+ if (wq->single_cpu == gcwq->cpu &&
+ !cwq->nr_active && list_empty(&cwq->delayed_works))
+ cwq_unbind_single_cpu(cwq);
+
wake_up_process(cwq->worker->task);
}

@@ -2285,7 +2351,6 @@ void __init init_workqueues(void)
BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
__alignof__(unsigned long long));

- singlethread_cpu = cpumask_first(cpu_possible_mask);
hotcpu_notifier(workqueue_cpu_callback, CPU_PRI_WORKQUEUE);

/* initialize gcwqs */
--
1.6.4.2

2010-06-14 21:45:42

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/30] workqueue: implement worker states

Implement worker states. After created, a worker is STARTED. While a
worker isn't processing a work, it's IDLE and chained on
gcwq->idle_list. While processing a work, a worker is BUSY and
chained on gcwq->busy_hash. Also, gcwq now counts the number of all
workers and idle ones.

worker_thread() is restructured to reflect state transitions.
cwq->more_work is removed and waking up a worker makes it check for
events. A worker is killed by setting DIE flag while it's IDLE and
waking it up.

This gives gcwq better visibility of what's going on and allows it to
find out whether a work is executing quickly which is necessary to
have multiple workers processing the same cwq.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/workqueue.c | 214 ++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 173 insertions(+), 41 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d0ca750..62d7cfd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,17 @@
#include <linux/lockdep.h>
#include <linux/idr.h>

+enum {
+ /* worker flags */
+ WORKER_STARTED = 1 << 0, /* started */
+ WORKER_DIE = 1 << 1, /* die die die */
+ WORKER_IDLE = 1 << 2, /* is idle */
+
+ BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
+ BUSY_WORKER_HASH_SIZE = 1 << BUSY_WORKER_HASH_ORDER,
+ BUSY_WORKER_HASH_MASK = BUSY_WORKER_HASH_SIZE - 1,
+};
+
/*
* Structure fields follow one of the following exclusion rules.
*
@@ -51,11 +62,18 @@ struct global_cwq;
struct cpu_workqueue_struct;

struct worker {
+ /* on idle list while idle, on busy hash table while busy */
+ union {
+ struct list_head entry; /* L: while idle */
+ struct hlist_node hentry; /* L: while busy */
+ };
+
struct work_struct *current_work; /* L: work being processed */
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */
struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
+ unsigned int flags; /* L: flags */
int id; /* I: worker id */
};

@@ -65,6 +83,15 @@ struct worker {
struct global_cwq {
spinlock_t lock; /* the gcwq lock */
unsigned int cpu; /* I: the associated cpu */
+
+ int nr_workers; /* L: total number of workers */
+ int nr_idle; /* L: currently idle ones */
+
+ /* workers are chained either in the idle_list or busy_hash */
+ struct list_head idle_list; /* L: list of idle workers */
+ struct hlist_head busy_hash[BUSY_WORKER_HASH_SIZE];
+ /* L: hash of busy workers */
+
struct ida worker_ida; /* L: for worker IDs */
} ____cacheline_aligned_in_smp;

@@ -77,7 +104,6 @@ struct global_cwq {
struct cpu_workqueue_struct {
struct global_cwq *gcwq; /* I: the associated gcwq */
struct list_head worklist;
- wait_queue_head_t more_work;
struct worker *worker;
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
@@ -307,6 +333,33 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
}

/**
+ * busy_worker_head - return the busy hash head for a work
+ * @gcwq: gcwq of interest
+ * @work: work to be hashed
+ *
+ * Return hash head of @gcwq for @work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to the hash head.
+ */
+static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
+ struct work_struct *work)
+{
+ const int base_shift = ilog2(sizeof(struct work_struct));
+ unsigned long v = (unsigned long)work;
+
+ /* simple shift and fold hash, do we need something better? */
+ v >>= base_shift;
+ v += v >> BUSY_WORKER_HASH_ORDER;
+ v &= BUSY_WORKER_HASH_MASK;
+
+ return &gcwq->busy_hash[v];
+}
+
+/**
* insert_work - insert a work into cwq
* @cwq: cwq @work belongs to
* @work: work to insert
@@ -332,7 +385,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
smp_wmb();

list_add_tail(&work->entry, head);
- wake_up(&cwq->more_work);
+ wake_up_process(cwq->worker->task);
}

static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
@@ -470,13 +523,59 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state. Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+
+ BUG_ON(worker->flags & WORKER_IDLE);
+ BUG_ON(!list_empty(&worker->entry) &&
+ (worker->hentry.next || worker->hentry.pprev));
+
+ worker->flags |= WORKER_IDLE;
+ gcwq->nr_idle++;
+
+ /* idle_list is LIFO */
+ list_add(&worker->entry, &gcwq->idle_list);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state. Update stats.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+
+ BUG_ON(!(worker->flags & WORKER_IDLE));
+ worker->flags &= ~WORKER_IDLE;
+ gcwq->nr_idle--;
+ list_del_init(&worker->entry);
+}
+
static struct worker *alloc_worker(void)
{
struct worker *worker;

worker = kzalloc(sizeof(*worker), GFP_KERNEL);
- if (worker)
+ if (worker) {
+ INIT_LIST_HEAD(&worker->entry);
INIT_LIST_HEAD(&worker->scheduled);
+ }
return worker;
}

@@ -541,13 +640,16 @@ fail:
* start_worker - start a newly created worker
* @worker: worker to start
*
- * Start @worker.
+ * Make the gcwq aware of @worker and start it.
*
* CONTEXT:
* spin_lock_irq(gcwq->lock).
*/
static void start_worker(struct worker *worker)
{
+ worker->flags |= WORKER_STARTED;
+ worker->gcwq->nr_workers++;
+ worker_enter_idle(worker);
wake_up_process(worker->task);
}

@@ -555,7 +657,10 @@ static void start_worker(struct worker *worker)
* destroy_worker - destroy a workqueue worker
* @worker: worker to be destroyed
*
- * Destroy @worker.
+ * Destroy @worker and adjust @gcwq stats accordingly.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
*/
static void destroy_worker(struct worker *worker)
{
@@ -566,12 +671,21 @@ static void destroy_worker(struct worker *worker)
BUG_ON(worker->current_work);
BUG_ON(!list_empty(&worker->scheduled));

+ if (worker->flags & WORKER_STARTED)
+ gcwq->nr_workers--;
+ if (worker->flags & WORKER_IDLE)
+ gcwq->nr_idle--;
+
+ list_del_init(&worker->entry);
+ worker->flags |= WORKER_DIE;
+
+ spin_unlock_irq(&gcwq->lock);
+
kthread_stop(worker->task);
kfree(worker);

spin_lock_irq(&gcwq->lock);
ida_remove(&gcwq->worker_ida, id);
- spin_unlock_irq(&gcwq->lock);
}

/**
@@ -686,6 +800,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
{
struct cpu_workqueue_struct *cwq = worker->cwq;
struct global_cwq *gcwq = cwq->gcwq;
+ struct hlist_head *bwh = busy_worker_head(gcwq, work);
work_func_t f = work->func;
int work_color;
#ifdef CONFIG_LOCKDEP
@@ -700,6 +815,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
#endif
/* claim and process */
debug_work_deactivate(work);
+ hlist_add_head(&worker->hentry, bwh);
worker->current_work = work;
work_color = get_work_color(work);
list_del_init(&work->entry);
@@ -727,6 +843,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
spin_lock_irq(&gcwq->lock);

/* we're done with it, release */
+ hlist_del_init(&worker->hentry);
worker->current_work = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}
@@ -763,47 +880,56 @@ static int worker_thread(void *__worker)
struct worker *worker = __worker;
struct global_cwq *gcwq = worker->gcwq;
struct cpu_workqueue_struct *cwq = worker->cwq;
- DEFINE_WAIT(wait);

- for (;;) {
- prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
- if (!kthread_should_stop() &&
- list_empty(&cwq->worklist))
- schedule();
- finish_wait(&cwq->more_work, &wait);
+woke_up:
+ if (unlikely(!cpumask_equal(&worker->task->cpus_allowed,
+ get_cpu_mask(gcwq->cpu))))
+ set_cpus_allowed_ptr(worker->task, get_cpu_mask(gcwq->cpu));

- if (kthread_should_stop())
- break;
+ spin_lock_irq(&gcwq->lock);

- if (unlikely(!cpumask_equal(&worker->task->cpus_allowed,
- get_cpu_mask(gcwq->cpu))))
- set_cpus_allowed_ptr(worker->task,
- get_cpu_mask(gcwq->cpu));
+ /* DIE can be set only while we're idle, checking here is enough */
+ if (worker->flags & WORKER_DIE) {
+ spin_unlock_irq(&gcwq->lock);
+ return 0;
+ }

- spin_lock_irq(&gcwq->lock);
+ worker_leave_idle(worker);

- while (!list_empty(&cwq->worklist)) {
- struct work_struct *work =
- list_first_entry(&cwq->worklist,
- struct work_struct, entry);
-
- if (likely(!(*work_data_bits(work) &
- WORK_STRUCT_LINKED))) {
- /* optimization path, not strictly necessary */
- process_one_work(worker, work);
- if (unlikely(!list_empty(&worker->scheduled)))
- process_scheduled_works(worker);
- } else {
- move_linked_works(work, &worker->scheduled,
- NULL);
+ /*
+ * ->scheduled list can only be filled while a worker is
+ * preparing to process a work or actually processing it.
+ * Make sure nobody diddled with it while I was sleeping.
+ */
+ BUG_ON(!list_empty(&worker->scheduled));
+
+ while (!list_empty(&cwq->worklist)) {
+ struct work_struct *work =
+ list_first_entry(&cwq->worklist,
+ struct work_struct, entry);
+
+ if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
+ /* optimization path, not strictly necessary */
+ process_one_work(worker, work);
+ if (unlikely(!list_empty(&worker->scheduled)))
process_scheduled_works(worker);
- }
+ } else {
+ move_linked_works(work, &worker->scheduled, NULL);
+ process_scheduled_works(worker);
}
-
- spin_unlock_irq(&gcwq->lock);
}

- return 0;
+ /*
+ * gcwq->lock is held and there's no work to process, sleep.
+ * Workers are woken up only while holding gcwq->lock, so
+ * setting the current state before releasing gcwq->lock is
+ * enough to prevent losing any event.
+ */
+ worker_enter_idle(worker);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock_irq(&gcwq->lock);
+ schedule();
+ goto woke_up;
}

struct wq_barrier {
@@ -1594,7 +1720,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
cwq->max_active = max_active;
INIT_LIST_HEAD(&cwq->worklist);
INIT_LIST_HEAD(&cwq->delayed_works);
- init_waitqueue_head(&cwq->more_work);

if (failed)
continue;
@@ -1645,7 +1770,7 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key);
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- int cpu;
+ unsigned int cpu;

flush_workqueue(wq);

@@ -1664,8 +1789,10 @@ void destroy_workqueue(struct workqueue_struct *wq)
int i;

if (cwq->worker) {
+ spin_lock_irq(&cwq->gcwq->lock);
destroy_worker(cwq->worker);
cwq->worker = NULL;
+ spin_unlock_irq(&cwq->gcwq->lock);
}

for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1875,7 +2002,7 @@ void thaw_workqueues(void)
cwq->nr_active < cwq->max_active)
cwq_activate_first_delayed(cwq);

- wake_up(&cwq->more_work);
+ wake_up_process(cwq->worker->task);
}

spin_unlock_irq(&gcwq->lock);
@@ -1890,6 +2017,7 @@ out_unlock:
void __init init_workqueues(void)
{
unsigned int cpu;
+ int i;

/*
* cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
@@ -1909,6 +2037,10 @@ void __init init_workqueues(void)
spin_lock_init(&gcwq->lock);
gcwq->cpu = cpu;

+ INIT_LIST_HEAD(&gcwq->idle_list);
+ for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
+ INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
+
ida_init(&gcwq->worker_ida);
}

--
1.6.4.2

2010-06-14 21:46:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 22/30] workqueue: implement WQ_NON_REENTRANT

With gcwq managing all the workers and work->data pointing to the last
gcwq it was on, non-reentrance can be easily implemented by checking
whether the work is still running on the previous gcwq on queueing.
Implement it.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 32 +++++++++++++++++++++++++++++---
2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0a78141..07cf5e5 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -225,6 +225,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
WQ_SINGLE_CPU = 1 << 1, /* only single cpu at a time */
+ WQ_NON_REENTRANT = 1 << 2, /* guarantee non-reentrance */
};

extern struct workqueue_struct *
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f606c44..7994edb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -534,11 +534,37 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,

debug_work_activate(work);

- /* determine gcwq to use */
+ /*
+ * Determine gcwq to use. SINGLE_CPU is inherently
+ * NON_REENTRANT, so test it first.
+ */
if (!(wq->flags & WQ_SINGLE_CPU)) {
- /* just use the requested cpu for multicpu workqueues */
+ struct global_cwq *last_gcwq;
+
+ /*
+ * It's multi cpu. If @wq is non-reentrant and @work
+ * was previously on a different cpu, it might still
+ * be running there, in which case the work needs to
+ * be queued on that cpu to guarantee non-reentrance.
+ */
gcwq = get_gcwq(cpu);
- spin_lock_irqsave(&gcwq->lock, flags);
+ if (wq->flags & WQ_NON_REENTRANT &&
+ (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
+ struct worker *worker;
+
+ spin_lock_irqsave(&last_gcwq->lock, flags);
+
+ worker = find_worker_executing_work(last_gcwq, work);
+
+ if (worker && worker->current_cwq->wq == wq)
+ gcwq = last_gcwq;
+ else {
+ /* meh... not running there, queue here */
+ spin_unlock_irqrestore(&last_gcwq->lock, flags);
+ spin_lock_irqsave(&gcwq->lock, flags);
+ }
+ } else
+ spin_lock_irqsave(&gcwq->lock, flags);
} else {
unsigned int req_cpu = cpu;

--
1.6.4.2

2010-06-14 22:00:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Mon, 14 Jun 2010 23:37:17 +0200
Tejun Heo <[email protected]> wrote:

> This is the fifth take of cmwq (concurrency managed workqueue)
> patchset.

What is a concurrency managed workqueue and why do we want one?

2010-06-14 22:19:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On 06/14/2010 11:58 PM, Andrew Morton wrote:
> On Mon, 14 Jun 2010 23:37:17 +0200
> Tejun Heo <[email protected]> wrote:
>
>> This is the fifth take of cmwq (concurrency managed workqueue)
>> patchset.
>
> What is a concurrency managed workqueue and why do we want one?

That goes back about 9 month. You were cc'd there too. :-)

http://thread.gmane.org/gmane.linux.kernel/896268

Thanks.

--
tejun

2010-06-14 22:31:19

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Tue, 2010-06-15 at 00:17 +0200, Tejun Heo wrote:
> On 06/14/2010 11:58 PM, Andrew Morton wrote:
> > On Mon, 14 Jun 2010 23:37:17 +0200
> > Tejun Heo <[email protected]> wrote:
> >
> >> This is the fifth take of cmwq (concurrency managed workqueue)
> >> patchset.
> >
> > What is a concurrency managed workqueue and why do we want one?
>
> That goes back about 9 month. You were cc'd there too. :-)
>
> http://thread.gmane.org/gmane.linux.kernel/896268

Could you explain how you view "concurrency" w.r.t this patchset?

Daniel

2010-06-14 22:34:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On 06/15/2010 12:31 AM, Daniel Walker wrote:
> Could you explain how you view "concurrency" w.r.t this patchset?

Hmm... can you be a bit more specific? I don't understand what the
question is.

Thanks.

--
tejun

2010-06-14 22:35:58

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Tue, 2010-06-15 at 00:33 +0200, Tejun Heo wrote:
> On 06/15/2010 12:31 AM, Daniel Walker wrote:
> > Could you explain how you view "concurrency" w.r.t this patchset?
>
> Hmm... can you be a bit more specific? I don't understand what the
> question is.

What is you definition of "concurrency" ? and explain try to explain it
in terms of your patchset.

Daniel

2010-06-14 22:36:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Tue, 15 Jun 2010 00:17:25 +0200
Tejun Heo <[email protected]> wrote:

> On 06/14/2010 11:58 PM, Andrew Morton wrote:
> > On Mon, 14 Jun 2010 23:37:17 +0200
> > Tejun Heo <[email protected]> wrote:
> >
> >> This is the fifth take of cmwq (concurrency managed workqueue)
> >> patchset.
> >
> > What is a concurrency managed workqueue and why do we want one?
>
> That goes back about 9 month. You were cc'd there too. :-)
>
> http://thread.gmane.org/gmane.linux.kernel/896268
>

Nobody's going to remember all that stuff except yourself, and the info
might be out of date. So please update and maintain that information
and retain it with the patchset.

eg: "<NEED SOME BACKING NUMBERS>". And "Please read the patch
description of the last patch for more details" is out of date.

Because right now I have a bunch of code in my inbox and little
(actually "no") idea why anyone might want to merge it into anything.

Trying to review a large patchset when you don't have an overall
picture of what it's trying to do and how it's trying to do it is
rather painful - you have to work all that stuff out from the
implementation. It's also error-prone if the implementation doesn't
implement that which the author thinks it implements (ie: if it has
design bugs).

2010-06-14 22:44:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

Hello,

On 06/15/2010 12:35 AM, Andrew Morton wrote:
> Nobody's going to remember all that stuff except yourself, and the info
> might be out of date. So please update and maintain that information
> and retain it with the patchset.
>
> eg: "<NEED SOME BACKING NUMBERS>". And "Please read the patch
> description of the last patch for more details" is out of date.
>
> Because right now I have a bunch of code in my inbox and little
> (actually "no") idea why anyone might want to merge it into anything.
>
> Trying to review a large patchset when you don't have an overall
> picture of what it's trying to do and how it's trying to do it is
> rather painful - you have to work all that stuff out from the
> implementation. It's also error-prone if the implementation doesn't
> implement that which the author thinks it implements (ie: if it has
> design bugs).

Well, basics of the whole thing didn't change all that much since the
first take and most people on cc list were cc'd on each take. The
biggest reason I'm still carrying the whole patchset is due to the
scheduler changes. The numbers are in the third take (which you can
follow the links to find out). Anyways, I'll write up another summary
tomorrow.

Thanks.

--
tejun

2010-06-14 22:45:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On 06/15/2010 12:35 AM, Daniel Walker wrote:
> What is you definition of "concurrency" ? and explain try to explain it
> in terms of your patchset.

How many workers are concurrently executing? What else? Eh... I'll
write up another summary tomorrow. Let's talk it there.

Thanks.

--
tejun

2010-06-14 22:49:19

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Tue, 2010-06-15 at 00:44 +0200, Tejun Heo wrote:
> On 06/15/2010 12:35 AM, Daniel Walker wrote:
> > What is you definition of "concurrency" ? and explain try to explain it
> > in terms of your patchset.
>
> How many workers are concurrently executing? What else? Eh... I'll
> write up another summary tomorrow. Let's talk it there.

I'd suggest If you write up another description try to not use the word
concurrency .. That term can be used for all sorts of things, and it
doesn't covey enough detail for anyone to know what your implementation
is actually doing.

Daniel

2010-06-14 22:53:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On 06/15/2010 12:49 AM, Daniel Walker wrote:
> On Tue, 2010-06-15 at 00:44 +0200, Tejun Heo wrote:
>> On 06/15/2010 12:35 AM, Daniel Walker wrote:
>>> What is you definition of "concurrency" ? and explain try to explain it
>>> in terms of your patchset.
>>
>> How many workers are concurrently executing? What else? Eh... I'll
>> write up another summary tomorrow. Let's talk it there.
>
> I'd suggest If you write up another description try to not use the word
> concurrency .. That term can be used for all sorts of things, and it
> doesn't covey enough detail for anyone to know what your implementation
> is actually doing.

So is "manage" and "level of concurrency" is often used to describe
exactly this type of property. I'll try to be clear.

Thanks.

--
tejun

2010-06-14 23:07:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Tue, 15 Jun 2010 00:43:17 +0200
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On 06/15/2010 12:35 AM, Andrew Morton wrote:
> > Nobody's going to remember all that stuff except yourself, and the info
> > might be out of date. So please update and maintain that information
> > and retain it with the patchset.
> >
> > eg: "<NEED SOME BACKING NUMBERS>". And "Please read the patch
> > description of the last patch for more details" is out of date.
> >
> > Because right now I have a bunch of code in my inbox and little
> > (actually "no") idea why anyone might want to merge it into anything.
> >
> > Trying to review a large patchset when you don't have an overall
> > picture of what it's trying to do and how it's trying to do it is
> > rather painful - you have to work all that stuff out from the
> > implementation. It's also error-prone if the implementation doesn't
> > implement that which the author thinks it implements (ie: if it has
> > design bugs).
>
> Well, basics of the whole thing didn't change all that much since the
> first take and most people on cc list were cc'd on each take. The
> biggest reason I'm still carrying the whole patchset is due to the
> scheduler changes. The numbers are in the third take (which you can
> follow the links to find out). Anyways, I'll write up another summary
> tomorrow.
>

Thanks. I don't think I've looked at these patches at all since the
first version, and I'd like to. That was many many thousands of
patches ago and I don't remember anything useful at all about them.

2010-06-15 01:20:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On 06/14/2010 05:37 PM, Tejun Heo wrote:
> Hello, all.
>
> This is the fifth take of cmwq (concurrency managed workqueue)
> patchset. It's on top of v2.6.35-rc3 + sched/core patches. Git tree
> is available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq
>
> Changes from the last take[L] are...
>
> * fscache patches are omitted for now.
>
> * The patchset is rebased on cpu_stop + sched/core, which now includes
> all the necessary scheduler patches. cpu_stop already reimplements
> stop_machine so that it doesn't use RT workqueue, so this patchset
> simply drops RT wq support.
>
> * __set_cpus_allowed() was determined to be unnecessary with recent
> scheduler changes. On cpu re-onlining, cmwq now kills all idle
> workers and tells busy ones to rebind after finishing the current
> work by scheduling a dedicated work. This allows managing proper
> cpu binding without adding overhead to hotpath.
>
> * Oleg's clear work->data patch moved at the head of the queue and now
> lives in the for-next branch which will be pushed to mainline on the
> next merge window.
>
> * Applied Oleg's review.
>
> * Comments updated as suggested.
>
> * work_flags_to_color() replaced w/ get_work_color()
>
> * nr_cwqs_to_flush bug which could cause premature flush completion
> fixed.
>
> * Replace rewind + list_for_each_entry_safe_continue() w/
> list_for_each_entry_safe_from().
>
> * Don't directly write to *work_data_bits() but use __set_bit()
> instead.
>
> * Fixed cpu hotplug exclusion bug.
>
> * Other misc tweaks.
>
> Now that all scheduler bits are in place, I'll keep the tree stable
> and publish it to linux-next soonish, so this hopefully is the last of
> exhausting massive postings of this patchset.
>
> Jeff, Arjan, I think it'll be best to route the libata and async
> patches through wq tree. Would that be okay?

ACK for libata bits routing through wq tree... you know I support this
work, as libata (and the kernel, generally speaking) has needed
something like this for a long time.

2010-06-15 12:53:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

On Tue, Jun 15, 2010 at 12:43:17AM +0200, Tejun Heo wrote:
>
> Well, basics of the whole thing didn't change all that much since the
> first take and most people on cc list were cc'd on each take. The
> biggest reason I'm still carrying the whole patchset is due to the
> scheduler changes. The numbers are in the third take (which you can
> follow the links to find out). Anyways, I'll write up another summary
> tomorrow.

It really helps if patch summaries are self contained and don't
require a bunch of kernel developers who are trying to review things
to have to do research and then figure out which links are the right
ones to chase down. It's also not reasonable to expect your reviewers
to diff your patches to determine how much has changed and whether
they should expect benchmarks run from months ago to still be
applicable or not.

Many of us get literally hundreds of e-mail messages a day, and
e-mails are read with one finger hovering over the the 'd' key. It
simply scales better if you don't assume that everybody else considers
the patch as important as you do, and instead assume that most people
have forgotten patches sent months ago....

Regards,

- Ted

2010-06-15 13:29:38

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 08/30] workqueue: temporarily disable workqueue tracing

On Mon, Jun 14, 2010 at 11:37:25PM +0200, Tejun Heo wrote:
> Strip tracing code from workqueue and disable workqueue tracing. This
> is temporary measure till concurrency managed workqueue is complete.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/trace/Kconfig | 4 +++-
> kernel/workqueue.c | 14 +++-----------
> 2 files changed, 6 insertions(+), 12 deletions(-)



If the new workqueue implementation makes the workqueue tracing useless,
than please remove it. We can think about providing a more suitable
tracing for that later, once we have determined the interesting organic
points.


I'll be happy to help.

2010-06-15 13:54:48

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 27/30] workqueue: implement DEBUGFS/workqueue

On Mon, Jun 14, 2010 at 11:37:44PM +0200, Tejun Heo wrote:
> Implement DEBUGFS/workqueue which lists all workers and works for
> debugging. Workqueues can also have ->show_work() callback which
> describes a pending or running work in custom way. If ->show_work()
> is missing or returns %false, wchan is printed.
>
> # cat /sys/kernel/debug/workqueue
>
> CPU ID PID WORK ADDR WORKQUEUE TIME DESC
> ==== ==== ===== ================ ============ ===== ============================
> 0 0 15 ffffffffa0004708 test-wq-04 1 s test_work_fn+0x469/0x690 [test_wq]
> 0 2 4146 <IDLE> 0us
> 0 1 21 <IDLE> 4 s
> 0 DELA ffffffffa00047b0 test-wq-04 1 s test work 2
> 1 1 418 <IDLE> 780ms
> 1 0 16 <IDLE> 40 s
> 1 2 443 <IDLE> 40 s
>
> Workqueue debugfs support is suggested by David Howells and
> implementation mostly mimics that of slow-work.
>
> * Anton Blanchard spotted that ITER_* constants are overflowing w/
> high cpu configuration. This was caused by using
> powerup_power_of_two() where order_base_2() should have been used.
> Fixed.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> ---
> include/linux/workqueue.h | 12 ++
> kernel/workqueue.c | 369 ++++++++++++++++++++++++++++++++++++++++++++-
> lib/Kconfig.debug | 7 +
> 3 files changed, 384 insertions(+), 4 deletions(-)



I don't like this. This adds 300 lines of ad hoc in-kernel instrumentation code while
we now have a nice kernel tracing API (trace events) coupled with easy userspace
tools to post-process that (perf trace scripting). And this is going to provide
a much more powerful view of your new workqueue implementation runtime behaviour.

We already have kernel/trace/trace_workqueue.c that has been obsolated for these
very reasons and we are even going to remove it soon, probably for .36

Please work with us for that, if everybody makes his own corner instrumentation, we
are not going to make any progress in having a powerful and unified tracing/profiling.

The first step is to pinpoint the important places that need tracepoints,
and then just write a perf trace script to use the provided informations by these
tracepoints.

I can help about that if needed.

2010-06-15 16:18:41

by Randy Dunlap

[permalink] [raw]
Subject: [PATCH] SubmittingPatches: add more about patch descriptions

On Tue, 15 Jun 2010 08:53:02 -0400 [email protected] wrote:

> On Tue, Jun 15, 2010 at 12:43:17AM +0200, Tejun Heo wrote:
> >
> > Well, basics of the whole thing didn't change all that much since the
> > first take and most people on cc list were cc'd on each take. The
> > biggest reason I'm still carrying the whole patchset is due to the
> > scheduler changes. The numbers are in the third take (which you can
> > follow the links to find out). Anyways, I'll write up another summary
> > tomorrow.
>
> It really helps if patch summaries are self contained and don't
> require a bunch of kernel developers who are trying to review things
> to have to do research and then figure out which links are the right
> ones to chase down. It's also not reasonable to expect your reviewers
> to diff your patches to determine how much has changed and whether
> they should expect benchmarks run from months ago to still be
> applicable or not.
>
> Many of us get literally hundreds of e-mail messages a day, and
> e-mails are read with one finger hovering over the the 'd' key. It
> simply scales better if you don't assume that everybody else considers
> the patch as important as you do, and instead assume that most people
> have forgotten patches sent months ago....

Ack that.

Does this help? anything need to be added to it?

---
From: Randy Dunlap <[email protected]>

Add more information about patch descriptions.

Signed-off-by: Randy Dunlap <[email protected]>
---
Documentation/SubmittingPatches | 11 +++++++++++
1 file changed, 11 insertions(+)

--- lnx-2635-rc3.orig/Documentation/SubmittingPatches
+++ lnx-2635-rc3/Documentation/SubmittingPatches
@@ -98,6 +98,17 @@ system, git, as a "commit log". See #15
If your description starts to get long, that's a sign that you probably
need to split up your patch. See #3, next.

+When you submit or resubmit a patch or patch series, include the
+complete patch description and justification for it. Don't just
+say that this is version N of the patch (series). Don't expect the
+patch merger to refer back to earlier patch versions or referenced
+URLs to find the patch description and put that into the patch.
+I.e., the patch (series) and its description should be self-contained.
+This benefits both the patch merger(s) and reviewers. Some reviewers
+probably didn't even receive earlier versions of the patch.
+
+If the patch fixes a logged bug entry, refer to that bug entry by
+number and URL.


3) Separate your changes.

2010-06-15 16:38:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 08/30] workqueue: temporarily disable workqueue tracing

On 06/15/2010 03:29 PM, Frederic Weisbecker wrote:
> On Mon, Jun 14, 2010 at 11:37:25PM +0200, Tejun Heo wrote:
>> Strip tracing code from workqueue and disable workqueue tracing. This
>> is temporary measure till concurrency managed workqueue is complete.
>>
>> Signed-off-by: Tejun Heo <[email protected]>
>> ---
>> kernel/trace/Kconfig | 4 +++-
>> kernel/workqueue.c | 14 +++-----------
>> 2 files changed, 6 insertions(+), 12 deletions(-)
>
> If the new workqueue implementation makes the workqueue tracing useless,
> than please remove it. We can think about providing a more suitable
> tracing for that later, once we have determined the interesting organic
> points.
>
> I'll be happy to help.

Cool, yeah, will rip it out.

Thanks.

--
tejun

2010-06-15 16:39:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] SubmittingPatches: add more about patch descriptions

On Tue, 15 Jun 2010, Randy Dunlap wrote:

> Does this help? anything need to be added to it?

Good.

Reviewed-by: Christoph Lameter <[email protected]>

2010-06-15 16:43:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 27/30] workqueue: implement DEBUGFS/workqueue

Hello,

On 06/15/2010 03:54 PM, Frederic Weisbecker wrote:
> I don't like this. This adds 300 lines of ad hoc in-kernel
> instrumentation code while we now have a nice kernel tracing API
> (trace events) coupled with easy userspace tools to post-process
> that (perf trace scripting). And this is going to provide a much
> more powerful view of your new workqueue implementation runtime
> behaviour.
>
> We already have kernel/trace/trace_workqueue.c that has been
> obsolated for these very reasons and we are even going to remove it
> soon, probably for .36
>
> Please work with us for that, if everybody makes his own corner
> instrumentation, we are not going to make any progress in having a
> powerful and unified tracing/profiling.
>
> The first step is to pinpoint the important places that need
> tracepoints, and then just write a perf trace script to use the
> provided informations by these tracepoints.
>
> I can help about that if needed.

Yeah, I agree that trace would be better way to do it. This patch was
added because slow-work had similar facility and David was unhappy
about losing easy way to monitor if cmwq replaces slow-work. I'll be
happy to drop this one. David, what do you think?

Thanks.

--
tejun

2010-06-15 18:26:25

by Tejun Heo

[permalink] [raw]
Subject: Overview of concurrency managed workqueue

Hello, all.

So, here's the overview I wrote up today. If anything needs more
clarification, just ask. Thanks.

== Overview

There are many cases where an execution context is needed and there
already are several mechanisms for them. The most commonly used one
is workqueue and there are slow_work, async and a few other. Although
workqueue has been serving the kernel for quite some time now, it has
some limitations.

There are two types of workqueues, single and multi threaded. MT wq
keeps a bound thread for each online CPU, while ST wq uses single
unbound thread. With the quickly rising number of CPU cores, there
already are systems in which just booting up saturates the default 32k
PID space.

Frustratingly, although MT wqs end up spending a lot of resources, the
level of concurrency provided is unsatisfactory. The concurrency
limitation is common to both ST and MT wqs although it's less severe
on MT ones. Worker pools of wqs are completely separate from each
other. A MT wq provides one execution context per CPU while a ST wq
one for the whole system. This leads to various problems.

One such problem is possible deadlock through dependency on the same
execution resource. These can be detected quite reliably with lockdep
these days but in most cases the only solution is to create a
dedicated wq for one of the parties involved in the deadlock, which
feeds back into the waste of resources. Also, when creating such
dedicated wq to avoid deadlock, to avoid wasting large number of
threads just for that work, ST wqs are often used but in most cases ST
wqs are suboptimal compared to MT wqs.

The tension between the provided level of concurrency and resource
usage force its users to make unnecessary tradeoffs like libata
choosing to use ST wq for polling PIOs and accepting a silly
limitation that no two polling PIOs can be in progress at the same
time. As MT wqs don't provide much better concurrency, users which
require higher level of concurrency, like async or fscache, end up
having to implement their own worker pool.

cmwq extends workqueue with focus on the following goals.

* Workqueue is already very widely used. Maintain compatibility with
the current API while removing limitations of the current
implementation.

* Provide single unified worker pool per cpu which can be shared by
all users. The worker pool and level of concurrency should be
regulated automatically so that the API users don't need to worry
about that.

* Use what's necessary and allocate resources lazily on demand while
still maintaining forward progress guarantee where necessary.


== Unified worklist

There's a single global cwq, or gcwq, per each possible cpu which
actually serves out the execution contexts. cpu_workqueues or cwqs of
each wq are mostly simple frontends to the associated gcwq. Under
normal operation, when a work is queued, it's queued to the gcwq on
the same cpu. Each gcwq has its own pool of workers bound to the gcwq
which will be used to process all the works queued on the cpu. For
the most part, works don't care to which wqs they're queued to and
using a unified worklist is pretty straight forward. There are a
couple of areas where things are a bit more complicated.

First, when queueing works from different wqs on the same queue,
ordering of works needs special care. Originally, a MT wq allows a
work to be executed simultaneously on multiple cpus although it
doesn't allow the same one to execute simultaneously on the same cpu
(reentrant). A ST wq allows only single work to be executed on any
cpu which guarantees both non-reentrancy and single-threadedness.

cmwq provides three different ordering modes - reentrant (default),
non-reentrant and single-cpu, where single-cpu can be used to achieve
single-threadedness and full ordering combined with in-flight work
limit of 1. The default mode is basically the same as the original
implementation. The distinction between non-reentrancy and single-cpu
were made because some ST wq users didn't really need single
threadedness but just non-reentrancy.

Another area where things get more involved is workqueue flushing as
for flushing to which wq a work is queued matters. cmwq tracks this
using colors. When a work is queued to a cwq, it's assigned a color
and each cwq maintains counters for each work color. The color
assignment changes on each wq flush attempt. A cwq can tell that all
works queued before a certain wq flush attempt have finished by
waiting for all the colors upto that point to drain. This maintains
the original workqueue flush semantics without adding unscalable
overhead.


== Automatically regulated shared worker pool

For any worker pool, managing the concurrency level (how many workers
are executing simultaneously) is an important issue. cmwq tries to
keep the concurrency at minimum but sufficient level.

Concurrency management is implemented by hooking into the scheduler.
gcwq is notified whenever a busy worker wakes up or sleeps and thus
can keep track of the current level of concurrency. Works aren't
supposed to be cpu cycle hogs and maintaining just enough concurrency
to prevent work processing from stalling due to lack of processing
context should be optimal. gcwq keeps the number of concurrent active
workers to minimum but no less. As long as there's one or more
running workers on the cpu, no new worker is scheduled so that works
can be processed in batch as much as possible but when the last
running worker blocks, gcwq immediately schedules new worker so that
the cpu doesn't sit idle while there are works to be processed.

This allows using minimal number of workers without losing execution
bandwidth. Keeping idle workers around doesn't cost much other than
the memory space, so cmwq holds onto idle ones for a while before
killing them.

As multiple execution contexts are available for each wq, deadlocks
around execution contexts is much harder to create. The default
workqueue, system_wq, has maximum concurrency level of 256 and unless
there is a use case which can result in a dependency loop involving
more than 254 workers, it won't deadlock.

Such forward progress guarantee relies on that workers can be created
when more execution contexts are necessary. This is guaranteed by
using emergency workers. All wqs which can be used in allocation path
are required to have emergency workers which are reserved for
execution of that specific workqueue so that allocation needed for
worker creation doesn't deadlock on workers.


== Benefits

* Less to worry about causing deadlocks around execution resources.

* Far fewer number of kthreads.

* More flexibility without runtime overhead.

* As concurrency is no longer a problem, workloads which needed
separate mechanisms can now use generic workqueue instead. This
easy access to concurrency also allows stuff which wasn't worth
implementing a dedicated mechanism for but still needed flexible
concurrency.


== Numbers (this is with the third take but nothing which could affect
performance has changed since then. Eh well, very little has
changed since then in fact.)

wq workload is generated by perf-wq.c module which is a very simple
synthetic wq load generator (I'll attach it to this message). A work
is described by five parameters - burn_usecs, mean_sleep_msecs,
mean_resched_msecs and factor. It randomly splits burn_usecs into
two, burns the first part, sleeps for 0 - 2 * mean_sleep_msecs, burns
what's left of burn_usecs and then reschedules itself in 0 - 2 *
mean_resched_msecs. factor is used to tune the number of cycles to
match execution duration.

It issues three types of works - short, medium and long, each with two
burn durations L and S.

burn/L(us) burn/S(us) mean_sleep(ms) mean_resched(ms) cycles
short 50 1 1 10 454
medium 50 2 10 50 125
long 50 4 100 250 42

And then these works are put into the following workloads. The lower
numbered workloads have more short/medium works.

workload 0
* 12 wqs with 4 short works
* 2 wqs with 2 short and 2 medium works
* 4 wqs with 2 medium and 1 long works
* 8 wqs with 1 long work

workload 1
* 8 wqs with 4 short works
* 2 wqs with 2 short and 2 medium works
* 4 wqs with 2 medium and 1 long works
* 8 wqs with 1 long work

workload 2
* 4 wqs with 4 short works
* 2 wqs with 2 short and 2 medium works
* 4 wqs with 2 medium and 1 long works
* 8 wqs with 1 long work

workload 3
* 2 wqs with 4 short works
* 2 wqs with 2 short and 2 medium works
* 4 wqs with 2 medium and 1 long works
* 8 wqs with 1 long work

workload 4
* 2 wqs with 4 short works
* 2 wqs with 2 medium works
* 4 wqs with 2 medium and 1 long works
* 8 wqs with 1 long work

workload 5
* 2 wqs with 2 medium works
* 4 wqs with 2 medium and 1 long works
* 8 wqs with 1 long work

The above wq loads are run in parallel with mencoder converting 76M
mjpeg file into mpeg4 which takes 25.59 seconds with standard
deviation of 0.19 without wq loading. The CPU was intel netburst
celeron running at 2.66GHz (chosen for its small cache size and
slowness). wl0 and 1 are only tested for burn/S. Each test case was
run 11 times and the first run was discarded.

vanilla/L cmwq/L vanilla/S cmwq/S
wl0 26.18 d0.24 26.27 d0.29
wl1 26.50 d0.45 26.52 d0.23
wl2 26.62 d0.35 26.53 d0.23 26.14 d0.22 26.12 d0.32
wl3 26.30 d0.25 26.29 d0.26 25.94 d0.25 26.17 d0.30
wl4 26.26 d0.23 25.93 d0.24 25.90 d0.23 25.91 d0.29
wl5 25.81 d0.33 25.88 d0.25 25.63 d0.27 25.59 d0.26

There is no significant difference between the two. Maybe the code
overhead and benefits coming from context sharing are canceling each
other nicely. With longer burns, cmwq looks better but it's nothing
significant. With shorter burns, other than wl3 spiking up for
vanilla which probably would go away if the test is repeated, the two
are performing virtually identically.

The above is exaggerated synthetic test result and the performance
difference will be even less noticeable in either direction under
realistic workloads.

cmwq extends workqueue such that it can serve as robust async
mechanism which can be used (mostly) universally without introducing
any noticeable performance degradation.

Thanks.

--
tejun


Attachments:
perf-wq.c (6.00 kB)

2010-06-15 18:29:57

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

Tejun Heo wrote:
> This is the fifth take of cmwq (concurrency managed workqueue)
> patchset. It's on top of v2.6.35-rc3 + sched/core patches. Git tree
> is available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

A comment and a question:

As a driver maintainer, I would find it helpful if the WQ_flags in
include/linux/workqueue.h and/or __create_workqueue_key() in
kernel/workqueue.c (or its wrappers in include/linux/workqueue.h) were
better documented.

How about the global workqueue, i.e. schedule_work() and friends? At
your current review-cmwq head, they use system_wq, not system_nrt_wq.
But doesn't have the present global workqueue WQ_NON_REENTRANT
semantics? In fact, don't have _all_ workqueues WQ_NON_REENTRANT
semantics presently? If so, a good deal of existing users probably
relies on non-reentrant behaviour. Or am I thoroughly misunderstanding
the meaning of WQ_NON_REENTRANT?

(Sorry if this had been discussed before; I followed the discussions of
some of your previous submissions but not all. And PS, I am eagerly
awaiting for this to go into the mainline.)
--
Stefan Richter
-=====-==-=- -==- -====
http://arcgraph.de/sr/

2010-06-15 18:41:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

Hello,

On 06/15/2010 08:29 PM, Stefan Richter wrote:
> Tejun Heo wrote:
>> This is the fifth take of cmwq (concurrency managed workqueue)
>> patchset. It's on top of v2.6.35-rc3 + sched/core patches. Git tree
>> is available at
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq
>
> A comment and a question:
>
> As a driver maintainer, I would find it helpful if the WQ_flags in
> include/linux/workqueue.h and/or __create_workqueue_key() in
> kernel/workqueue.c (or its wrappers in include/linux/workqueue.h) were
> better documented.

Sure, it can definitely be improved.

> How about the global workqueue, i.e. schedule_work() and friends? At
> your current review-cmwq head, they use system_wq, not system_nrt_wq.
> But doesn't have the present global workqueue WQ_NON_REENTRANT
> semantics? In fact, don't have _all_ workqueues WQ_NON_REENTRANT
> semantics presently? If so, a good deal of existing users probably
> relies on non-reentrant behaviour. Or am I thoroughly misunderstanding
> the meaning of WQ_NON_REENTRANT?

Yeah, it's a bit confusing. :-( The current workqueue semantics is
non-reentrant on the same cpu but reentrant on different cpus.
WQ_NON_REENTRANT is non-reentrant regardless of cpu, so it's stronger
guarantee than before. To summarize,

current MT == !WQ_NON_REENTRANT < WQ_NON_REENTRANT <
WQ_SINGLE_CPU < current ST == WQ_SINGLE_CPU + max in_flight of 1.

> (Sorry if this had been discussed before; I followed the discussions of
> some of your previous submissions but not all. And PS, I am eagerly
> awaiting for this to go into the mainline.)

Ah, yeah, after ten month, I'm pretty eager too. :-)

Thanks.

--
tejun

2010-06-15 18:45:47

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/15/2010 08:40 PM, Christoph Lameter wrote:
> On Tue, 15 Jun 2010, Tejun Heo wrote:
>
>> == Benefits
>>
>> * Less to worry about causing deadlocks around execution resources.
>>
>> * Far fewer number of kthreads.
>>
>> * More flexibility without runtime overhead.
>>
>> * As concurrency is no longer a problem, workloads which needed
>> separate mechanisms can now use generic workqueue instead. This
>> easy access to concurrency also allows stuff which wasn't worth
>> implementing a dedicated mechanism for but still needed flexible
>> concurrency.
>
> Start the whole with the above? Otherwise people get tired of reading
> before finding out what the point of the exercise is?

Yeah, maybe that would have been better. I was going for a nice
closing but who cares about closing if the opening is boring. I'll
reorder on the next round.

Thanks.

--
tejun

2010-06-15 18:46:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Tue, 15 Jun 2010, Tejun Heo wrote:

> == Benefits
>
> * Less to worry about causing deadlocks around execution resources.
>
> * Far fewer number of kthreads.
>
> * More flexibility without runtime overhead.
>
> * As concurrency is no longer a problem, workloads which needed
> separate mechanisms can now use generic workqueue instead. This
> easy access to concurrency also allows stuff which wasn't worth
> implementing a dedicated mechanism for but still needed flexible
> concurrency.

Start the whole with the above? Otherwise people get tired of reading
before finding out what the point of the exercise is?


Attachments:
perf-wq.c (6.25 kB)

2010-06-15 18:46:36

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

Andrew Morton wrote:
> On Mon, 14 Jun 2010 23:37:17 +0200
> Tejun Heo <[email protected]> wrote:
>
>> This is the fifth take of cmwq (concurrency managed workqueue)
>> patchset.
>
> What is a concurrency managed workqueue and why do we want one?

>From what I understood, this is about the following:

- Right now, a workqueue is backed by either 1 or by #_of_CPUs
kernel threads. There is no other option.

- To avoid creating half a million of kernel threads, driver authors
resort to either
- using the globally shared workqueue even if they might queue
high-latency work in corner cases,
or
- creating a single-threaded workqueue even if they put unrelated
jobs into that queue that should better be executed in
parallel, not serially.
(I for one have both cases in drivers/firewire/, and I have similar
issues in the old drivers/ieee1394/.)

The cmwq patch series reforms workqueues to be backed by a global thread
pool. Hence:

+ Driver authors can and should simply register one queue for any one
purpose now. They don't need to worry anymore about having too many
or too few backing threads.

+ [A side effect: In some cases, a driver that currently uses a
thread pool can be simplified by migrating to the workqueue API.]

Tejun, please correct me if I misunderstood.
--
Stefan Richter
-=====-==-=- -==- -====
http://arcgraph.de/sr/

2010-06-15 19:40:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

Hello,

On 06/15/2010 08:15 PM, Stefan Richter wrote:
>>From what I understood, this is about the following:
>
> - Right now, a workqueue is backed by either 1 or by #_of_CPUs
> kernel threads. There is no other option.
>
> - To avoid creating half a million of kernel threads, driver authors
> resort to either
> - using the globally shared workqueue even if they might queue
> high-latency work in corner cases,
> or
> - creating a single-threaded workqueue even if they put unrelated
> jobs into that queue that should better be executed in
> parallel, not serially.
> (I for one have both cases in drivers/firewire/, and I have similar
> issues in the old drivers/ieee1394/.)
>
> The cmwq patch series reforms workqueues to be backed by a global thread
> pool. Hence:
>
> + Driver authors can and should simply register one queue for any one
> purpose now. They don't need to worry anymore about having too many
> or too few backing threads.

Wq now serves more as a flushing and max-inflight controlling domain,
so unless it needs to flush the workqueue (as opposed to each work) or
throttle max-inflight or might be used during allocation path (in
which case emergency worker should also be used), default wq should
work fine too.

> + [A side effect: In some cases, a driver that currently uses a
> thread pool can be simplified by migrating to the workqueue API.]
>
> Tejun, please correct me if I misunderstood.

Yeap, from driver's POV, mostly precise. The reason I started this
whole thing is that I was trying to implement in-kernel media presence
polling (mostly for cdroms but may also useful for polling other stuff
for other types of devices) and I got immediately stuck on how to
manage concurrency.

I can create single kthread per drive which should work fine in most
cases but there are configurations with a lot of devices and it's not
only wasteful but might actually cause scalability issues. For most
common cases, ST or MT wq could be enough but then again when
something gets stuck (unfortunately somewhat common with cheap
drives), the whole thing will get stuck. So, I was thinking about
creating a worker pool for it and managing concurrency, which felt
very silly. I just needed some context to host those pollings on
demand and this is not something I should be worrying about when I'm
trying to implement media presence polling.

I think there are many similar situations for drivers. I already
wrote about libata but it's just silly to worry about how to manage
execution contexts for polling PIO, EH and hotplug from individual
drivers and drivers often have to make suboptimal choices because it's
not worth solving fully at that layer. So, cmwq tries to provide an
easy way to get hold of execution contexts on demand.

Thanks.

--
tejun

2010-06-15 19:44:06

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Tue, 2010-06-15 at 20:25 +0200, Tejun Heo wrote:
> Hello, all.
>
> So, here's the overview I wrote up today. If anything needs more
> clarification, just ask. Thanks.
>
> == Overview
>
> There are many cases where an execution context is needed and there
> already are several mechanisms for them. The most commonly used one
> is workqueue and there are slow_work, async and a few other. Although
> workqueue has been serving the kernel for quite some time now, it has
> some limitations.

I noticed that you removed the RT workqueue since it's no longer used,
but it's possible that a user can raise the priority of a given work
queue thread into real time priorities. So with single threaded, and
multithreaded workqueues specific to certain areas of the kernel the
user would have a greater ability to control priorities of those areas.

It looks like with your patches it would remove that level of
flexability effectively making all the work item the same priority with
no ability to raise or lower .. Is that accurate ?

btw, Thanks for the write up.

Daniel

2010-06-15 20:29:56

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCHSET] workqueue: concurrency managed workqueue, take#5

I wrote:
> As a driver maintainer, I would find it helpful if the WQ_flags in
> include/linux/workqueue.h and/or __create_workqueue_key() in
> kernel/workqueue.c (or its wrappers in include/linux/workqueue.h) were
> better documented.

On second thought, and having read your other posts of today, I take
this mostly back. It seems that a great deal of what driver writers
need is provided by one of your three system workqueues, and these are
IMO fully well documented. Thanks Tejun.
--
Stefan Richter
-=====-==-=- -==- -====
http://arcgraph.de/sr/

2010-06-16 06:56:23

by Florian Mickler

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hi!

On Tue, 15 Jun 2010 20:25:28 +0200 Tejun Heo <[email protected]> wrote:

> Hello, all.
>
> So, here's the overview I wrote up today. If anything needs more
> clarification, just ask. Thanks.

Nice writeup! I think it is sufficient already and I probably wouldn't
bother, but here are a little comments if you want to polish it up...

Also, feel free to ignore :)

As a genereal rule, every abbreviation should be written out at least
once and if you are going to abbreviate it from then on, the
abbreviation goes in parenthesis after that. That helps the reader a
lot.

For example:

>
> == Overview
>
> There are many cases where an execution context is needed and there
> already are several mechanisms for them. The most commonly used one
> is workqueue and there are slow_work, async and a few other. Although

The most commonly used one is workqueue (wq) and there are ...

> workqueue has been serving the kernel for quite some time now, it has
> some limitations.

here you can then already use "wq". That makes it shorter, and if you
use it consistently the reader doesn't wonder if wq and worqueue are
different things.

>
> There are two types of workqueues, single and multi threaded. MT wq

... multi threaded (MT). MT wq keeps a bound ...

> keeps a bound thread for each online CPU, while ST wq uses single

... while single threaded (ST) wq uses single ...


> unbound thread. With the quickly rising number of CPU cores, there
> already are systems in which just booting up saturates the default 32k
> PID space.

CPU and PID are well defined in the kernel, so no need to explain these.

>
> Frustratingly, although MT wqs end up spending a lot of resources, the
> level of concurrency provided is unsatisfactory. The concurrency
> limitation is common to both ST and MT wqs although it's less severe

I don't know what the english rules for plural of abbreviated word. But
I would probably just drop the plural s and let the reader add it when
he decodes the abbreviation. (ie replace wqs with wq) Or introduce it
properly: "... workqueues (wqs) ... ", Or don't abbreviate it in the
plural.

> on MT ones. Worker pools of wqs are completely separate from each
> other. A MT wq provides one execution context per CPU while a ST wq
> one for the whole system. This leads to various problems.
>
> One such problem is possible deadlock through dependency on the same
> execution resource. These can be detected quite reliably with lockdep
> these days but in most cases the only solution is to create a
> dedicated wq for one of the parties involved in the deadlock, which
> feeds back into the waste of resources. Also, when creating such
> dedicated wq to avoid deadlock, to avoid wasting large number of
> threads just for that work, ST wqs are often used but in most cases ST
> wqs are suboptimal compared to MT wqs.
>
> The tension between the provided level of concurrency and resource
> usage force its users to make unnecessary tradeoffs like libata
> choosing to use ST wq for polling PIOs and accepting a silly
> limitation that no two polling PIOs can be in progress at the same
> time. As MT wqs don't provide much better concurrency, users which
> require higher level of concurrency, like async or fscache, end up
> having to implement their own worker pool.
>
> cmwq extends workqueue with focus on the following goals.

first mentioning of cmwq as an abbreviation is not nice for the reader.
Better:
Concurrency managed wq (cmwq) ... goals:
Concurrency managed workqueue (cmwq) ... goals:



>
> * Workqueue is already very widely used. Maintain compatibility with
> the current API while removing limitations of the current
> implementation.

* Because the current wq implementation is already very widely used we
maintain compatibility with the API while removing above
mentioned limitations.

>
> * Provide single unified worker pool per cpu which can be shared by
> all users. The worker pool and level of concurrency should be
> regulated automatically so that the API users don't need to worry
> about that.
>
> * Use what's necessary and allocate resources lazily on demand while
> still maintaining forward progress guarantee where necessary.
>
>
> == Unified worklist
>
> There's a single global cwq, or gcwq, per each possible cpu which

... global cwq (gcwq) per each possible cpu

> actually serves out the execution contexts. cpu_workqueues or cwqs of

cpu_workqueues (cwqs)

> each wq are mostly simple frontends to the associated gcwq. Under
> normal operation, when a work is queued, it's queued to the gcwq on
> the same cpu. Each gcwq has its own pool of workers bound to the gcwq
> which will be used to process all the works queued on the cpu. For
> the most part, works don't care to which wqs they're queued to and
> using a unified worklist is pretty straight forward. There are a
> couple of areas where things are a bit more complicated.
>
> First, when queueing works from different wqs on the same queue,
> ordering of works needs special care. Originally, a MT wq allows a
> work to be executed simultaneously on multiple cpus although it
> doesn't allow the same one to execute simultaneously on the same cpu
> (reentrant). A ST wq allows only single work to be executed on any
> cpu which guarantees both non-reentrancy and single-threadedness.
>
> cmwq provides three different ordering modes - reentrant (default),

... (default mode)...

> non-reentrant and single-cpu, where single-cpu can be used to achieve
> single-threadedness and full ordering combined with in-flight work
> limit of 1. The default mode is basically the same as the original

The default mode (reentrant) is basically...

> implementation. The distinction between non-reentrancy and single-cpu
> were made because some ST wq users didn't really need single
> threadedness but just non-reentrancy.


> Another area where things get more involved is workqueue flushing as
> for flushing to which wq a work is queued matters. cmwq tracks this
> using colors. When a work is queued to a cwq, it's assigned a color
> and each cwq maintains counters for each work color. The color
> assignment changes on each wq flush attempt. A cwq can tell that all
> works queued before a certain wq flush attempt have finished by
> waiting for all the colors upto that point to drain. This maintains
> the original workqueue flush semantics without adding unscalable
> overhead.

[nice solution, btw]

>
>
> == Automatically regulated shared worker pool
>
> For any worker pool, managing the concurrency level (how many workers
> are executing simultaneously) is an important issue. cmwq tries to
> keep the concurrency at minimum but sufficient level.
>
> Concurrency management is implemented by hooking into the scheduler.
> gcwq is notified whenever a busy worker wakes up or sleeps and thus

There is only one gcwq?
Then maybe better:

_The_ gcwq is notified...

> can keep track of the current level of concurrency. Works aren't
> supposed to be cpu cycle hogs and maintaining just enough concurrency
> to prevent work processing from stalling due to lack of processing
> context should be optimal. gcwq keeps the number of concurrent active
> workers to minimum but no less.

also:
... The gcwq keeps the number of concurrent ...

> As long as there's one or more
> running workers on the cpu, no new worker is scheduled so that works
> can be processed in batch as much as possible but when the last
> running worker blocks, gcwq immediately schedules new worker so that
> the cpu doesn't sit idle while there are works to be processed.

here too: ..., the gcwq immediately schedules ...

> This allows using minimal number of workers without losing execution
> bandwidth. Keeping idle workers around doesn't cost much other than
> the memory space, so cmwq holds onto idle ones for a while before
> killing them.
>
> As multiple execution contexts are available for each wq, deadlocks
> around execution contexts is much harder to create. The default
> workqueue, system_wq, has maximum concurrency level of 256 and unless
> there is a use case which can result in a dependency loop involving
> more than 254 workers, it won't deadlock.
>
> Such forward progress guarantee relies on that workers can be created
> when more execution contexts are necessary. This is guaranteed by
> using emergency workers. All wqs which can be used in allocation path
> are required to have emergency workers which are reserved for
> execution of that specific workqueue so that allocation needed for
> worker creation doesn't deadlock on workers.
>
>

> == Benefits
>
> * Less to worry about causing deadlocks around execution resources.
>
> * Far fewer number of kthreads.
>
> * More flexibility without runtime overhead.
>
> * As concurrency is no longer a problem, workloads which needed
> separate mechanisms can now use generic workqueue instead. This
> easy access to concurrency also allows stuff which wasn't worth
> implementing a dedicated mechanism for but still needed flexible
> concurrency.

* improved latency for current schedule_work() users, i.e. the work
get's executed in a more timely fashion?




> == Numbers (this is with the third take but nothing which could affect
> performance has changed since then. Eh well, very little has
> changed since then in fact.)
>
> wq workload is generated by perf-wq.c module which is a very simple
> synthetic wq load generator (I'll attach it to this message). A work
> is described by five parameters - burn_usecs, mean_sleep_msecs,
> mean_resched_msecs and factor. It randomly splits burn_usecs into
> two, burns the first part, sleeps for 0 - 2 * mean_sleep_msecs, burns
> what's left of burn_usecs and then reschedules itself in 0 - 2 *
> mean_resched_msecs. factor is used to tune the number of cycles to
> match execution duration.
>
> It issues three types of works - short, medium and long, each with two
> burn durations L and S.
>
> burn/L(us) burn/S(us) mean_sleep(ms) mean_resched(ms) cycles
> short 50 1 1 10 454
> medium 50 2 10 50 125
> long 50 4 100 250 42
>
> And then these works are put into the following workloads. The lower
> numbered workloads have more short/medium works.
>
> workload 0
> * 12 wqs with 4 short works
> * 2 wqs with 2 short and 2 medium works
> * 4 wqs with 2 medium and 1 long works
> * 8 wqs with 1 long work
>
> workload 1
> * 8 wqs with 4 short works
> * 2 wqs with 2 short and 2 medium works
> * 4 wqs with 2 medium and 1 long works
> * 8 wqs with 1 long work
>
> workload 2
> * 4 wqs with 4 short works
> * 2 wqs with 2 short and 2 medium works
> * 4 wqs with 2 medium and 1 long works
> * 8 wqs with 1 long work
>
> workload 3
> * 2 wqs with 4 short works
> * 2 wqs with 2 short and 2 medium works
> * 4 wqs with 2 medium and 1 long works
> * 8 wqs with 1 long work
>
> workload 4
> * 2 wqs with 4 short works
> * 2 wqs with 2 medium works
> * 4 wqs with 2 medium and 1 long works
> * 8 wqs with 1 long work
>
> workload 5
> * 2 wqs with 2 medium works
> * 4 wqs with 2 medium and 1 long works
> * 8 wqs with 1 long work
>
> The above wq loads are run in parallel with mencoder converting 76M
> mjpeg file into mpeg4 which takes 25.59 seconds with standard
> deviation of 0.19 without wq loading. The CPU was intel netburst
> celeron running at 2.66GHz (chosen for its small cache size and
> slowness). wl0 and 1 are only tested for burn/S. Each test case was
> run 11 times and the first run was discarded.
>
> vanilla/L cmwq/L vanilla/S cmwq/S
> wl0 26.18 d0.24 26.27 d0.29
> wl1 26.50 d0.45 26.52 d0.23
> wl2 26.62 d0.35 26.53 d0.23 26.14 d0.22 26.12 d0.32
> wl3 26.30 d0.25 26.29 d0.26 25.94 d0.25 26.17 d0.30
> wl4 26.26 d0.23 25.93 d0.24 25.90 d0.23 25.91 d0.29
> wl5 25.81 d0.33 25.88 d0.25 25.63 d0.27 25.59 d0.26
>
> There is no significant difference between the two. Maybe the code
> overhead and benefits coming from context sharing are canceling each
> other nicely. With longer burns, cmwq looks better but it's nothing
> significant. With shorter burns, other than wl3 spiking up for
> vanilla which probably would go away if the test is repeated, the two
> are performing virtually identically.
>
> The above is exaggerated synthetic test result and the performance
> difference will be even less noticeable in either direction under
> realistic workloads.
>
> cmwq extends workqueue such that it can serve as robust async
> mechanism which can be used (mostly) universally without introducing
> any noticeable performance degradation.
>
> Thanks.
>
> --
> tejun
>
Cheers,
Flo

2010-06-16 12:11:58

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/15/2010 09:43 PM, Daniel Walker wrote:
> I noticed that you removed the RT workqueue since it's no longer used,
> but it's possible that a user can raise the priority of a given work
> queue thread into real time priorities. So with single threaded, and
> multithreaded workqueues specific to certain areas of the kernel the
> user would have a greater ability to control priorities of those areas.
>
> It looks like with your patches it would remove that level of
> flexability effectively making all the work item the same priority with
> no ability to raise or lower .. Is that accurate ?

Yes, that is. With new cmwq, a wq can't assume association with
specific kthread and thus can't use wq as simple frontend to kthreads,
but if somebody wants dedicated kthreads instead of shared ones in
units of work, [s]he should be using kthread.

wq does provide nicer tools for synchronization but in general I don't
think using kthread is too hard and there aren't too many cases
anyway. If there are many users && kthread is difficult to use
directly, we can definitely write up a wrapping layer tho. But I
really think using wq as wrapper around kthreads and manipulating
worker thread directly is an abusement.

Thanks.

--
tejun

2010-06-16 12:23:42

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/16/2010 08:55 AM, Florian Mickler wrote:
>> So, here's the overview I wrote up today. If anything needs more
>> clarification, just ask. Thanks.
>
> Nice writeup! I think it is sufficient already and I probably wouldn't
> bother, but here are a little comments if you want to polish it up...
>
> Also, feel free to ignore :)
>
> As a genereal rule, every abbreviation should be written out at least
> once and if you are going to abbreviate it from then on, the
> abbreviation goes in parenthesis after that. That helps the reader a
> lot.

Ah... all the fond memories of technical writing class are coming back
to me. :-)

> here you can then already use "wq". That makes it shorter, and if you
> use it consistently the reader doesn't wonder if wq and worqueue are
> different things.
>
>>
>> There are two types of workqueues, single and multi threaded. MT wq
>
> ... multi threaded (MT). MT wq keeps a bound ...
>
>> keeps a bound thread for each online CPU, while ST wq uses single
>
> ... while single threaded (ST) wq uses single ...

Updated.

>> Frustratingly, although MT wqs end up spending a lot of resources, the
>> level of concurrency provided is unsatisfactory. The concurrency
>> limitation is common to both ST and MT wqs although it's less severe
>
> I don't know what the english rules for plural of abbreviated word. But
> I would probably just drop the plural s and let the reader add it when
> he decodes the abbreviation. (ie replace wqs with wq) Or introduce it
> properly: "... workqueues (wqs) ... ", Or don't abbreviate it in the
> plural.

Dropped all the 's'es after abbrs.

>> cmwq extends workqueue with focus on the following goals.
>
> first mentioning of cmwq as an abbreviation is not nice for the reader.
> Better:
> Concurrency managed wq (cmwq) ... goals:
> Concurrency managed workqueue (cmwq) ... goals:

Updated.

>> * Workqueue is already very widely used. Maintain compatibility with
>> the current API while removing limitations of the current
>> implementation.
>
> * Because the current wq implementation is already very widely used we
> maintain compatibility with the API while removing above
> mentioned limitations.

Replaced.

>> * Provide single unified worker pool per cpu which can be shared by
>> all users. The worker pool and level of concurrency should be
>> regulated automatically so that the API users don't need to worry
>> about that.
>>
>> * Use what's necessary and allocate resources lazily on demand while
>> still maintaining forward progress guarantee where necessary.
>>
>>
>> == Unified worklist
>>
>> There's a single global cwq, or gcwq, per each possible cpu which
>
> ... global cwq (gcwq) per each possible cpu
>
>> actually serves out the execution contexts. cpu_workqueues or cwqs of
>
> cpu_workqueues (cwqs)

Hmmm.... how about cpu_workqueue's (cwq)?

>> cmwq provides three different ordering modes - reentrant (default),
>
> ... (default mode)...
>
>> non-reentrant and single-cpu, where single-cpu can be used to achieve
>> single-threadedness and full ordering combined with in-flight work
>> limit of 1. The default mode is basically the same as the original
>
> The default mode (reentrant) is basically...
>
>> implementation. The distinction between non-reentrancy and single-cpu
>> were made because some ST wq users didn't really need single
>> threadedness but just non-reentrancy.
>
>> Another area where things get more involved is workqueue flushing as
>> for flushing to which wq a work is queued matters. cmwq tracks this
>> using colors. When a work is queued to a cwq, it's assigned a color
>> and each cwq maintains counters for each work color. The color
>> assignment changes on each wq flush attempt. A cwq can tell that all
>> works queued before a certain wq flush attempt have finished by
>> waiting for all the colors upto that point to drain. This maintains
>> the original workqueue flush semantics without adding unscalable
>> overhead.
>
> [nice solution, btw]

I just wish the implementation were simpler. It's a bit too complex
than I would like. If anyone can simplify it, please go ahead and
give it a shot.

> There is only one gcwq?
> Then maybe better:
>
> _The_ gcwq is notified...
...
> also:
> ... The gcwq keeps the number of concurrent ...
...
> here too: ..., the gcwq immediately schedules ...

Okay.

> * improved latency for current schedule_work() users, i.e. the work
> get's executed in a more timely fashion?

Yeah, added.

I've updated the doc but I'm not really sure what I'm gonna do with
it. I suppose I can include part of it in the head comment or I can
beef it up with use cases and howtos and put it under Documentations/.
Eh... let's see. Anyways, thanks a lot.

--
tejun

2010-06-16 13:27:17

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 14:10 +0200, Tejun Heo wrote:
> Hello,
>
> On 06/15/2010 09:43 PM, Daniel Walker wrote:
> > I noticed that you removed the RT workqueue since it's no longer used,
> > but it's possible that a user can raise the priority of a given work
> > queue thread into real time priorities. So with single threaded, and
> > multithreaded workqueues specific to certain areas of the kernel the
> > user would have a greater ability to control priorities of those areas.
> >
> > It looks like with your patches it would remove that level of
> > flexability effectively making all the work item the same priority with
> > no ability to raise or lower .. Is that accurate ?
>
> Yes, that is. With new cmwq, a wq can't assume association with
> specific kthread and thus can't use wq as simple frontend to kthreads,
> but if somebody wants dedicated kthreads instead of shared ones in
> units of work, [s]he should be using kthread.

I'm not talking about coders using workqueues when they should be using
kthreads .. We're talking about currently existing workqueues. Aren't
you converting all _current_ workqueues to your system?

> wq does provide nicer tools for synchronization but in general I don't
> think using kthread is too hard and there aren't too many cases
> anyway. If there are many users && kthread is difficult to use
> directly, we can definitely write up a wrapping layer tho. But I
> really think using wq as wrapper around kthreads and manipulating
> worker thread directly is an abusement.

It would be a hack the user would have to patch onto there kernel in
order to get back functionality your taking away.

I think from your perspective workqueue threads are all used for
"concurrency management" only, but I don't think that's true. Some will
be user for prioritization (I'm talking about _current_ workqueues).

Could you address or ponder how the work items could be prioritized
under your system?

Daniel

2010-06-16 13:31:18

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 03:27 PM, Daniel Walker wrote:
>> Yes, that is. With new cmwq, a wq can't assume association with
>> specific kthread and thus can't use wq as simple frontend to kthreads,
>> but if somebody wants dedicated kthreads instead of shared ones in
>> units of work, [s]he should be using kthread.
>
> I'm not talking about coders using workqueues when they should be using
> kthreads .. We're talking about currently existing workqueues. Aren't
> you converting all _current_ workqueues to your system?

Yes, sure I'm but which current users are you talking about?

>> wq does provide nicer tools for synchronization but in general I don't
>> think using kthread is too hard and there aren't too many cases
>> anyway. If there are many users && kthread is difficult to use
>> directly, we can definitely write up a wrapping layer tho. But I
>> really think using wq as wrapper around kthreads and manipulating
>> worker thread directly is an abusement.
>
> It would be a hack the user would have to patch onto there kernel in
> order to get back functionality your taking away.
>
> I think from your perspective workqueue threads are all used for
> "concurrency management" only, but I don't think that's true. Some will
> be user for prioritization (I'm talking about _current_ workqueues).
>
> Could you address or ponder how the work items could be prioritized
> under your system?

Again, please give me some examples.

Thanks.

--
tejun

2010-06-16 13:38:10

by Johannes Berg

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Tue, 2010-06-15 at 20:25 +0200, Tejun Heo wrote:

> cmwq extends workqueue with focus on the following goals.
>
> * Workqueue is already very widely used. Maintain compatibility with
> the current API while removing limitations of the current
> implementation.

...

> As multiple execution contexts are available for each wq, deadlocks
> around execution contexts is much harder to create. The default
> workqueue, system_wq, has maximum concurrency level of 256 and unless
> there is a use case which can result in a dependency loop involving
> more than 254 workers, it won't deadlock.

I see a lot of stuff about the current limitations etc., but nothing
about code that actually _relies_ on the synchronisation properties of
the current wqs. We talked about that a long time ago, is it still
guaranteed that a single-threaded wq will serialise all work put onto
it? It needs to be, but I don't see you explicitly mentioning it.

johannes

2010-06-16 13:40:36

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 03:37 PM, Johannes Berg wrote:
>> As multiple execution contexts are available for each wq, deadlocks
>> around execution contexts is much harder to create. The default
>> workqueue, system_wq, has maximum concurrency level of 256 and unless
>> there is a use case which can result in a dependency loop involving
>> more than 254 workers, it won't deadlock.
>
> I see a lot of stuff about the current limitations etc., but nothing
> about code that actually _relies_ on the synchronisation properties of
> the current wqs. We talked about that a long time ago, is it still
> guaranteed that a single-threaded wq will serialise all work put onto
> it? It needs to be, but I don't see you explicitly mentioning it.

Oh yeah, if you have WQ_SINGLE_CPU + max inflight of 1, works on the
wq are fully ordered.

--
tejun

2010-06-16 13:41:13

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 15:30 +0200, Tejun Heo wrote:
> On 06/16/2010 03:27 PM, Daniel Walker wrote:
> >> Yes, that is. With new cmwq, a wq can't assume association with
> >> specific kthread and thus can't use wq as simple frontend to kthreads,
> >> but if somebody wants dedicated kthreads instead of shared ones in
> >> units of work, [s]he should be using kthread.
> >
> > I'm not talking about coders using workqueues when they should be using
> > kthreads .. We're talking about currently existing workqueues. Aren't
> > you converting all _current_ workqueues to your system?
>
> Yes, sure I'm but which current users are you talking about?

Any workqueue that has a thread which can be prioritized from userspace.
As long as there is a thread it can usually be given a priority from
userspace, so any _current_ workqueue which uses a single thread or
multiple threads is an example of what I'm talking about.

Daniel

2010-06-16 13:42:32

by Johannes Berg

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 15:39 +0200, Tejun Heo wrote:
> On 06/16/2010 03:37 PM, Johannes Berg wrote:
> >> As multiple execution contexts are available for each wq, deadlocks
> >> around execution contexts is much harder to create. The default
> >> workqueue, system_wq, has maximum concurrency level of 256 and unless
> >> there is a use case which can result in a dependency loop involving
> >> more than 254 workers, it won't deadlock.
> >
> > I see a lot of stuff about the current limitations etc., but nothing
> > about code that actually _relies_ on the synchronisation properties of
> > the current wqs. We talked about that a long time ago, is it still
> > guaranteed that a single-threaded wq will serialise all work put onto
> > it? It needs to be, but I don't see you explicitly mentioning it.
>
> Oh yeah, if you have WQ_SINGLE_CPU + max inflight of 1, works on the
> wq are fully ordered.

Ok, great, thanks. FWIW, that's pretty much all I care about right
now :)

johannes

2010-06-16 13:46:30

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 03:41 PM, Daniel Walker wrote:
> Any workqueue that has a thread which can be prioritized from userspace.
> As long as there is a thread it can usually be given a priority from
> userspace, so any _current_ workqueue which uses a single thread or
> multiple threads is an example of what I'm talking about.

Eh... what's the use case for that? That's just so wrong. What do
you do after a suspend/resume cycle? Reprioritize all of them from
suspend/resume hooks?

--
tejun

2010-06-16 14:05:49

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 15:45 +0200, Tejun Heo wrote:
> On 06/16/2010 03:41 PM, Daniel Walker wrote:
> > Any workqueue that has a thread which can be prioritized from userspace.
> > As long as there is a thread it can usually be given a priority from
> > userspace, so any _current_ workqueue which uses a single thread or
> > multiple threads is an example of what I'm talking about.
>
> Eh... what's the use case for that? That's just so wrong. What do
> you do after a suspend/resume cycle? Reprioritize all of them from
> suspend/resume hooks?

The use case is any situation when the user wants to give higher
priority to some set of work items, and there's nothing wrong with that.
In fact there has been a lot of work in the RT kernel related to
workqueue prioritization ..

suspend/resume shouldn't touch the thread priorities unless your tearing
down the threads and remaking them each suspend/resume cycle from inside
the kernel.

Daniel

2010-06-16 14:16:24

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/16/2010 04:05 PM, Daniel Walker wrote:
> On Wed, 2010-06-16 at 15:45 +0200, Tejun Heo wrote:
>> On 06/16/2010 03:41 PM, Daniel Walker wrote:
>>> Any workqueue that has a thread which can be prioritized from userspace.
>>> As long as there is a thread it can usually be given a priority from
>>> userspace, so any _current_ workqueue which uses a single thread or
>>> multiple threads is an example of what I'm talking about.
>>
>> Eh... what's the use case for that? That's just so wrong. What do
>> you do after a suspend/resume cycle? Reprioritize all of them from
>> suspend/resume hooks?
>
> The use case is any situation when the user wants to give higher
> priority to some set of work items, and there's nothing wrong with that.

Come on. The user can't even know what's going on each workqueue
thread. Something you can do doesn't make it a good idea. In this
case, it's a very bad idea.

> In fact there has been a lot of work in the RT kernel related to
> workqueue prioritization ..

That frankly I don't have much idea about.

> suspend/resume shouldn't touch the thread priorities unless your tearing
> down the threads and remaking them each suspend/resume cycle from inside
> the kernel.

And here's a perfect example of why it's a very bad idea. The kernel
is *ALREADY* doing that on every suspend/resume cycle, so if you are
thinking that priorities on kernel workqueue threads were being
maintained over suspend/resume cycles, you're wrong and have been
wrong for a very long time.

Mangling workqueue thread priorities from userland is a fundamentally
broken thing to do. It's not a part of AP|BI and there's no guarantee
whatsoever that something which works currently in certain way will
keep working that way one release later. If there's something which
wants priority adjustment from userland, the right thing to do is
finding out precisely why it's necesssary and implementing a published
interface to do that.

Thanks.

--
tejun

2010-06-16 14:34:45

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 16:15 +0200, Tejun Heo wrote:
> Hello,
>
> On 06/16/2010 04:05 PM, Daniel Walker wrote:
> > On Wed, 2010-06-16 at 15:45 +0200, Tejun Heo wrote:
> >> On 06/16/2010 03:41 PM, Daniel Walker wrote:
> >>> Any workqueue that has a thread which can be prioritized from userspace.
> >>> As long as there is a thread it can usually be given a priority from
> >>> userspace, so any _current_ workqueue which uses a single thread or
> >>> multiple threads is an example of what I'm talking about.
> >>
> >> Eh... what's the use case for that? That's just so wrong. What do
> >> you do after a suspend/resume cycle? Reprioritize all of them from
> >> suspend/resume hooks?
> >
> > The use case is any situation when the user wants to give higher
> > priority to some set of work items, and there's nothing wrong with that.
>
> Come on. The user can't even know what's going on each workqueue
> thread. Something you can do doesn't make it a good idea. In this
> case, it's a very bad idea.

You just don't understand the use case. Let say I have a high priority
thread in userspace , and I discover through analysis that my thread is
being forced to wait on a workqueue thread (priority inversion) , so
then I just increase the workqueue thread priority to overcome the
inversion. That's totally valid, and you don't even need to know exactly
what the thread is doing..

Now lets say the same thing happens under your changes.. Well, then I'm
screwed cause your changes turn the workqueues into a opaque thread
cloud which can't be prioritized.

> > In fact there has been a lot of work in the RT kernel related to
> > workqueue prioritization ..
>
> That frankly I don't have much idea about.

Exactly, but I do know about it which is why we're talking. So your
saying that use cases that you don't know about don't really matter
right?

> > suspend/resume shouldn't touch the thread priorities unless your tearing
> > down the threads and remaking them each suspend/resume cycle from inside
> > the kernel.
>
> And here's a perfect example of why it's a very bad idea. The kernel
> is *ALREADY* doing that on every suspend/resume cycle, so if you are
> thinking that priorities on kernel workqueue threads were being
> maintained over suspend/resume cycles, you're wrong and have been
> wrong for a very long time.

I'd have to look into that cause that seems odd to me. In terms of what
we're talking about it doesn't really matter tho. The use cases I'm
citing could easily be on systems that don't suspend (or even have that
enabled).

> Mangling workqueue thread priorities from userland is a fundamentally
> broken thing to do. It's not a part of AP|BI and there's no guarantee
> whatsoever that something which works currently in certain way will
> keep working that way one release later. If there's something which
> wants priority adjustment from userland, the right thing to do is
> finding out precisely why it's necesssary and implementing a published
> interface to do that.

Your completely wrong .. Look at the example I gave above .. Bottom line
is that your removing currently useful functionality, which is bad.. You
can say it's not useful, but again you also say you don't "have much
idea" regarding the cases where this is actually important.

Could you please just entertain the idea that maybe someone somewhere
might want to give priorities to the work items inside your system. I
mean consider the RT workqueue that you removed, why in the world would
we even have that if giving workqueues special priorities was a bad
thing (or not useful).

Daniel

2010-06-16 14:51:14

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hi,

On 06/16/2010 04:34 PM, Daniel Walker wrote:
>> Come on. The user can't even know what's going on each workqueue
>> thread. Something you can do doesn't make it a good idea. In this
>> case, it's a very bad idea.
>
> You just don't understand the use case. Let say I have a high priority
> thread in userspace , and I discover through analysis that my thread is
> being forced to wait on a workqueue thread (priority inversion) , so
> then I just increase the workqueue thread priority to overcome the
> inversion. That's totally valid, and you don't even need to know exactly
> what the thread is doing..
>
> Now lets say the same thing happens under your changes.. Well, then I'm
> screwed cause your changes turn the workqueues into a opaque thread
> cloud which can't be prioritized.

If you find that some work item is causing priority inversion, you
need to fix the problem instead of working around in adhoc way which
won't be useful to anyone else, so, no, it doesn't sound like a useful
use case.

>>> In fact there has been a lot of work in the RT kernel related to
>>> workqueue prioritization ..
>>
>> That frankly I don't have much idea about.
>
> Exactly, but I do know about it which is why we're talking. So your
> saying that use cases that you don't know about don't really matter
> right?

Peter brought up the work priority issue previously and Linus was
pretty clear on the issue. They're in the discussiosn for the first
or second take.

> Your completely wrong .. Look at the example I gave above .. Bottom line
> is that your removing currently useful functionality, which is bad.. You
> can say it's not useful, but again you also say you don't "have much
> idea" regarding the cases where this is actually important.

I said that I didn't have much idea about RT work priority thing, not
setting priority on wq workers for adhoc workaround for whatever.
IIRC, the RT work priority thing Peter was talking about was a
different thing. Sure, more complex workqueue implementation would
complicate implementing work priorities, but if necessary maybe we can
grow different classes of worker pools.

> Could you please just entertain the idea that maybe someone somewhere
> might want to give priorities to the work items inside your system.

If that's necessary, do it properly. Give *work* priorities or at
least give explicit priorities to workqueues. That's a completely
different thing from insisting fixed workqueue to kthread mapping so
that three people on the whole planet can set priority on those
kthreads not even knowing what the hell they do.

> I mean consider the RT workqueue that you removed, why in the world
> would we even have that if giving workqueues special priorities was
> a bad thing (or not useful).

Because stop_machine wnated to use wq as a frontend for threads which
I converted to cpu_stop and which actually proves my point not yours,
sorry.

Thanks.

--
tejun

2010-06-16 15:11:18

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 16:50 +0200, Tejun Heo wrote:
> Hi,
>
> On 06/16/2010 04:34 PM, Daniel Walker wrote:
> >> Come on. The user can't even know what's going on each workqueue
> >> thread. Something you can do doesn't make it a good idea. In this
> >> case, it's a very bad idea.
> >
> > You just don't understand the use case. Let say I have a high priority
> > thread in userspace , and I discover through analysis that my thread is
> > being forced to wait on a workqueue thread (priority inversion) , so
> > then I just increase the workqueue thread priority to overcome the
> > inversion. That's totally valid, and you don't even need to know exactly
> > what the thread is doing..
> >
> > Now lets say the same thing happens under your changes.. Well, then I'm
> > screwed cause your changes turn the workqueues into a opaque thread
> > cloud which can't be prioritized.
>
> If you find that some work item is causing priority inversion, you
> need to fix the problem instead of working around in adhoc way which
> won't be useful to anyone else, so, no, it doesn't sound like a useful
> use case.

How do I fix the problem then? Without doing what I've already
suggested.

> >>> In fact there has been a lot of work in the RT kernel related to
> >>> workqueue prioritization ..
> >>
> >> That frankly I don't have much idea about.
> >
> > Exactly, but I do know about it which is why we're talking. So your
> > saying that use cases that you don't know about don't really matter
> > right?
>
> Peter brought up the work priority issue previously and Linus was
> pretty clear on the issue. They're in the discussiosn for the first
> or second take.

Send us a link to this discussion.

> > Your completely wrong .. Look at the example I gave above .. Bottom line
> > is that your removing currently useful functionality, which is bad.. You
> > can say it's not useful, but again you also say you don't "have much
> > idea" regarding the cases where this is actually important.
>
> I said that I didn't have much idea about RT work priority thing, not
> setting priority on wq workers for adhoc workaround for whatever.
> IIRC, the RT work priority thing Peter was talking about was a
> different thing. Sure, more complex workqueue implementation would
> complicate implementing work priorities, but if necessary maybe we can
> grow different classes of worker pools.

Ok .. So what would these different classes look like then ? Is that
something I could prioritize from userspace perhaps ?

> > Could you please just entertain the idea that maybe someone somewhere
> > might want to give priorities to the work items inside your system.
>
> If that's necessary, do it properly. Give *work* priorities or at
> least give explicit priorities to workqueues. That's a completely
> different thing from insisting fixed workqueue to kthread mapping so
> that three people on the whole planet can set priority on those
> kthreads not even knowing what the hell they do.

You have no idea how many people are doing this, or in what
circumstances .. Please don't make mass speculation over things you
clearly are not aware of.

I'm not insisting any fixed mapping, you need to open you mind to
_possibilities_ .. How can the work items be given priorities _inside
your system_! Can you give an interface in which people can set a
priority to a give type of work item, then maybe you system honors those
priorities in _some way_ ..

In fact, I'm only asking that you consider it, ponder it..

> > I mean consider the RT workqueue that you removed, why in the world
> > would we even have that if giving workqueues special priorities was
> > a bad thing (or not useful).
>
> Because stop_machine wnated to use wq as a frontend for threads which
> I converted to cpu_stop and which actually proves my point not yours,
> sorry.

I'm not sure your proving much here, other than you thought it was
better to use another method.

Daniel

2010-06-16 15:51:53

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/16/2010 05:11 PM, Daniel Walker wrote:
>> If you find that some work item is causing priority inversion, you
>> need to fix the problem instead of working around in adhoc way which
>> won't be useful to anyone else, so, no, it doesn't sound like a useful
>> use case.
>
> How do I fix the problem then? Without doing what I've already
> suggested.

I don't know. I suppose a high priority thread is trying to flush a
work item or workqueue thus causing priority inversion, right? Maybe
we can add high priority emergency worker which gets triggered through
priority inversion detection or maybe the code shouldn't be flushing
in the critical path anyway.

>> Peter brought up the work priority issue previously and Linus was
>> pretty clear on the issue. They're in the discussiosn for the first
>> or second take.
>
> Send us a link to this discussion.

http://thread.gmane.org/gmane.linux.kernel/929641

>>> Your completely wrong .. Look at the example I gave above .. Bottom line
>>> is that your removing currently useful functionality, which is bad.. You
>>> can say it's not useful, but again you also say you don't "have much
>>> idea" regarding the cases where this is actually important.
>>
>> I said that I didn't have much idea about RT work priority thing, not
>> setting priority on wq workers for adhoc workaround for whatever.
>> IIRC, the RT work priority thing Peter was talking about was a
>> different thing. Sure, more complex workqueue implementation would
>> complicate implementing work priorities, but if necessary maybe we can
>> grow different classes of worker pools.
>
> Ok .. So what would these different classes look like then ? Is that
> something I could prioritize from userspace perhaps ?

Maybe it's me not understanding something but I don't really think
exposing workqueue priorities to userland is a good solution at all.

>>> Could you please just entertain the idea that maybe someone somewhere
>>> might want to give priorities to the work items inside your system.
>>
>> If that's necessary, do it properly. Give *work* priorities or at
>> least give explicit priorities to workqueues. That's a completely
>> different thing from insisting fixed workqueue to kthread mapping so
>> that three people on the whole planet can set priority on those
>> kthreads not even knowing what the hell they do.
>
> You have no idea how many people are doing this, or in what
> circumstances .. Please don't make mass speculation over things you
> clearly are not aware of.

Well, then please stop insisting it to be a feature to keep. It's not
a feature.

> I'm not insisting any fixed mapping, you need to open you mind to
> _possibilities_ .. How can the work items be given priorities _inside
> your system_! Can you give an interface in which people can set a
> priority to a give type of work item, then maybe you system honors those
> priorities in _some way_ ..
>
> In fact, I'm only asking that you consider it, ponder it..

Oh yeah, if you're not insisting fixed mapping, then I don't have any
problem with that. As for what to do for priority inversions
involving workqueues, I don't have any concrete idea (it was not in
the mainline, so I didn't have to solve it) but one way would be
reserving/creating temporary high priority workers and use them to
process work items which the high priority thread is blocked on.

But, really, without knowing details of those inversion cases, it
would be pretty difficult to design something which fits. All that I
can say is having shared worker pool isn't exclusive with solving the
problem.

>>> I mean consider the RT workqueue that you removed, why in the world
>>> would we even have that if giving workqueues special priorities was
>>> a bad thing (or not useful).
>>
>> Because stop_machine wnated to use wq as a frontend for threads which
>> I converted to cpu_stop and which actually proves my point not yours,
>> sorry.
>
> I'm not sure your proving much here, other than you thought it was
> better to use another method.

Eh well, it depends on the view point I guess. :-P

Thanks.

--
tejun

2010-06-16 16:30:41

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 17:50 +0200, Tejun Heo wrote:
> Hello,
>
> On 06/16/2010 05:11 PM, Daniel Walker wrote:
> >> If you find that some work item is causing priority inversion, you
> >> need to fix the problem instead of working around in adhoc way which
> >> won't be useful to anyone else, so, no, it doesn't sound like a useful
> >> use case.
> >
> > How do I fix the problem then? Without doing what I've already
> > suggested.
>
> I don't know. I suppose a high priority thread is trying to flush a
> work item or workqueue thus causing priority inversion, right? Maybe
> we can add high priority emergency worker which gets triggered through
> priority inversion detection or maybe the code shouldn't be flushing
> in the critical path anyway.

It's not a flushing situation .. The high priority thread is a userspace
thread so it , AFAIK, can't flush any workqueues.

> >> Peter brought up the work priority issue previously and Linus was
> >> pretty clear on the issue. They're in the discussiosn for the first
> >> or second take.
> >
> > Send us a link to this discussion.
>
> http://thread.gmane.org/gmane.linux.kernel/929641

I didn't see anything in there related to this discussion.

> >>> Your completely wrong .. Look at the example I gave above .. Bottom line
> >>> is that your removing currently useful functionality, which is bad.. You
> >>> can say it's not useful, but again you also say you don't "have much
> >>> idea" regarding the cases where this is actually important.
> >>
> >> I said that I didn't have much idea about RT work priority thing, not
> >> setting priority on wq workers for adhoc workaround for whatever.
> >> IIRC, the RT work priority thing Peter was talking about was a
> >> different thing. Sure, more complex workqueue implementation would
> >> complicate implementing work priorities, but if necessary maybe we can
> >> grow different classes of worker pools.
> >
> > Ok .. So what would these different classes look like then ? Is that
> > something I could prioritize from userspace perhaps ?
>
> Maybe it's me not understanding something but I don't really think
> exposing workqueue priorities to userland is a good solution at all.

Why not? They currently are to no known ill effects (none that I know
of).

> >>> Could you please just entertain the idea that maybe someone somewhere
> >>> might want to give priorities to the work items inside your system.
> >>
> >> If that's necessary, do it properly. Give *work* priorities or at
> >> least give explicit priorities to workqueues. That's a completely
> >> different thing from insisting fixed workqueue to kthread mapping so
> >> that three people on the whole planet can set priority on those
> >> kthreads not even knowing what the hell they do.
> >
> > You have no idea how many people are doing this, or in what
> > circumstances .. Please don't make mass speculation over things you
> > clearly are not aware of.
>
> Well, then please stop insisting it to be a feature to keep. It's not
> a feature.

It may not have been a feature in the past, but it's used as a feature
now.. So it is a feature even tho you don't want it to be.

> > I'm not insisting any fixed mapping, you need to open you mind to
> > _possibilities_ .. How can the work items be given priorities _inside
> > your system_! Can you give an interface in which people can set a
> > priority to a give type of work item, then maybe you system honors those
> > priorities in _some way_ ..
> >
> > In fact, I'm only asking that you consider it, ponder it..
>
> Oh yeah, if you're not insisting fixed mapping, then I don't have any
> problem with that. As for what to do for priority inversions
> involving workqueues, I don't have any concrete idea (it was not in
> the mainline, so I didn't have to solve it) but one way would be
> reserving/creating temporary high priority workers and use them to
> process work items which the high priority thread is blocked on.

But it is in the mainline, that's why we're talking right now.

What I was thinking is that you could have a debugfs interface which
would list off what workqueues you system is processing and give the
user the ability to pull one or more of those workqueues into individual
threads for processing, just like it currently is. That way I can
prioritize the work items with out you having to give priorities through
your entire system.

The alternative is that you would give each work item a settable
priority and your whole system would have to honor that, which would be
a little like re-creating the scheduler.

> But, really, without knowing details of those inversion cases, it
> would be pretty difficult to design something which fits. All that I
> can say is having shared worker pool isn't exclusive with solving the
> problem.

The cases are all in the mainline kernel, you just have to look at the
code in a different way to understand them .. If I have a userspace
thread at a high priority and I'm making calls into the kernel, some of
those call inevitably will put work items onto workqueues, right? I'm
sure you can think of 100's of ways in which this could happen .. At
that point my thread depends on the workqueue thread, since the
workqueue thread is doing processing for which I've , in some way,
requested from userspace.

Daniel

2010-06-16 16:55:48

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/16/2010 06:30 PM, Daniel Walker wrote:
>> I don't know. I suppose a high priority thread is trying to flush a
>> work item or workqueue thus causing priority inversion, right? Maybe
>> we can add high priority emergency worker which gets triggered through
>> priority inversion detection or maybe the code shouldn't be flushing
>> in the critical path anyway.
>
> It's not a flushing situation .. The high priority thread is a userspace
> thread so it , AFAIK, can't flush any workqueues.

So, how is it stalling? How would I be able to tell anything about
the situation when all you're saying is doing this and that voodoo
thing made it go away for me?

>>>> Peter brought up the work priority issue previously and Linus was
>>>> pretty clear on the issue. They're in the discussiosn for the first
>>>> or second take.
>>>
>>> Send us a link to this discussion.
>>
>> http://thread.gmane.org/gmane.linux.kernel/929641
>
> I didn't see anything in there related to this discussion.

It was about using wq for cpu intensive / RT stuff. Linus said,

So stop arguing about irrelevancies. Nobody uses workqueues for RT
or for CPU-intensive crap. It's not what they were designed for, or
used for.

>> Maybe it's me not understanding something but I don't really think
>> exposing workqueue priorities to userland is a good solution at all.
>
> Why not? They currently are to no known ill effects (none that I know
> of).

Because...

* fragile as hell

* depends heavily on unrelated implementation details

* has extremely limited test coverage

* doesn't help progressing mainline at all

>>> You have no idea how many people are doing this, or in what
>>> circumstances .. Please don't make mass speculation over things you
>>> clearly are not aware of.
>>
>> Well, then please stop insisting it to be a feature to keep. It's not
>> a feature.
>
> It may not have been a feature in the past, but it's used as a feature
> now.. So it is a feature even tho you don't want it to be.

That's exactly like grepping /proc/kallsyms to determine some feature
and claiming it's a feature whether the kernel intends it or not.
Sure, use it all you want. Just don't expect it to be there on the
next release.

>> Oh yeah, if you're not insisting fixed mapping, then I don't have any
>> problem with that. As for what to do for priority inversions
>> involving workqueues, I don't have any concrete idea (it was not in
>> the mainline, so I didn't have to solve it) but one way would be
>> reserving/creating temporary high priority workers and use them to
>> process work items which the high priority thread is blocked on.
>
> But it is in the mainline, that's why we're talking right now.

The problem is in the mainline. Solution is not. It's not a solved
problem. It's also a pretty niche problem.

> What I was thinking is that you could have a debugfs interface which
> would list off what workqueues you system is processing and give the
> user the ability to pull one or more of those workqueues into individual
> threads for processing, just like it currently is. That way I can
> prioritize the work items with out you having to give priorities through
> your entire system.

Why would the user want to bother with all that? Shouldn't the kernel
just do the right thing? IIRC, people are working on priority
inheriting userland mutexes. If your problem case can't be solved by
that, please try to fix it in proper way not through some random knobs
that nobody really can understand.

> The alternative is that you would give each work item a settable
> priority and your whole system would have to honor that, which would be
> a little like re-creating the scheduler.
>
>> But, really, without knowing details of those inversion cases, it
>> would be pretty difficult to design something which fits. All that I
>> can say is having shared worker pool isn't exclusive with solving the
>> problem.
>
> The cases are all in the mainline kernel, you just have to look at the
> code in a different way to understand them ..

Yeah, sure, all the problems in the world which haven't been solved
yet are there. The _solution_ isn't there and solutions are what
people conserve when trying to improve things.

> If I have a userspace thread at a high priority and I'm making calls
> into the kernel, some of those call inevitably will put work items
> onto workqueues, right? I'm sure you can think of 100's of ways in
> which this could happen .. At that point my thread depends on the
> workqueue thread, since the workqueue thread is doing processing for
> which I've , in some way, requested from userspace.

You're basically saying that "I don't know how those inheritance
inversions are happening but if I turn these magic knobs they seem to
go away so I want those magic knobs". Maybe the RT part of the code
shouldn't be depending on that many random things to begin with? And
if there are actually things which are necessary, it's better idea to
solve it properly through identifying problem points and properly
inheriting priority instead of turning knobs until it somehow works?

If you wanna work on such things, be my guest. I'll be happy to work
with you but please stop talking about setting priorities of
workqueues from userland. That's just nuts.

Thanks.

--
tejun

2010-06-16 18:22:30

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 18:55 +0200, Tejun Heo wrote:
> Hello,
>
> On 06/16/2010 06:30 PM, Daniel Walker wrote:
> >> I don't know. I suppose a high priority thread is trying to flush a
> >> work item or workqueue thus causing priority inversion, right? Maybe
> >> we can add high priority emergency worker which gets triggered through
> >> priority inversion detection or maybe the code shouldn't be flushing
> >> in the critical path anyway.
> >
> > It's not a flushing situation .. The high priority thread is a userspace
> > thread so it , AFAIK, can't flush any workqueues.
>
> So, how is it stalling? How would I be able to tell anything about
> the situation when all you're saying is doing this and that voodoo
> thing made it go away for me?

There's so many different ways that threads can interact .. Can you
imagine a thread waiting in userspace for something to complete in the
kernel, that does actually happen pretty often ;) .

I was just now randomly trolling through drivers and found this one,
drivers/spi/amba-pl022.c ..

It processes some data in the interrupt, but sometimes it offloads the
processing to a workqueue from the interrupt (or tasklet) .. If for
example I'm a userspace thread waiting for that data then I would have
to wait for that workqueue to complete (and it's priority plays a major
roll in when it completes).

I'm sure there's plenty other examples of userspace threads waiting for
workqueues to complete. I would hope that with your patchset you've
investigated some.

> >>>> Peter brought up the work priority issue previously and Linus was
> >>>> pretty clear on the issue. They're in the discussiosn for the first
> >>>> or second take.
> >>>
> >>> Send us a link to this discussion.
> >>
> >> http://thread.gmane.org/gmane.linux.kernel/929641
> >
> > I didn't see anything in there related to this discussion.
>
> It was about using wq for cpu intensive / RT stuff. Linus said,
>
> So stop arguing about irrelevancies. Nobody uses workqueues for RT
> or for CPU-intensive crap. It's not what they were designed for, or
> used for.

Which is not relevant to this discussion .. We're talking about
re-prioritizing the workqueue threads. We're _not_ talking about
workqueues designed specifically for real time purposes.

> >> Maybe it's me not understanding something but I don't really think
> >> exposing workqueue priorities to userland is a good solution at all.
> >
> > Why not? They currently are to no known ill effects (none that I know
> > of).
>
> Because...
>
> * fragile as hell

Changing the thread priorities shouldn't be fragile , if it is right now
then the threads are broken .. Can you exaplin in which cases you've
seen it being fragile?

> * depends heavily on unrelated implementation details

I have no idea what this means.

> * has extremely limited test coverage

Simple, just write tests.

> * doesn't help progressing mainline at all

progressing where?

> >>> You have no idea how many people are doing this, or in what
> >>> circumstances .. Please don't make mass speculation over things you
> >>> clearly are not aware of.
> >>
> >> Well, then please stop insisting it to be a feature to keep. It's not
> >> a feature.
> >
> > It may not have been a feature in the past, but it's used as a feature
> > now.. So it is a feature even tho you don't want it to be.
>
> That's exactly like grepping /proc/kallsyms to determine some feature
> and claiming it's a feature whether the kernel intends it or not.
> Sure, use it all you want. Just don't expect it to be there on the
> next release.

Your assume there's no value in changing the priorities which is wrong.
Your assuming way to much . Changing the priorities is useful.

> >> Oh yeah, if you're not insisting fixed mapping, then I don't have any
> >> problem with that. As for what to do for priority inversions
> >> involving workqueues, I don't have any concrete idea (it was not in
> >> the mainline, so I didn't have to solve it) but one way would be
> >> reserving/creating temporary high priority workers and use them to
> >> process work items which the high priority thread is blocked on.
> >
> > But it is in the mainline, that's why we're talking right now.
>
> The problem is in the mainline. Solution is not. It's not a solved
> problem. It's also a pretty niche problem.

We're talking about functionality your removing from the kernel, which
is useful . So your creating a problem that mainline currently doesn't
have. That's a defect in your patchset ..

> > What I was thinking is that you could have a debugfs interface which
> > would list off what workqueues you system is processing and give the
> > user the ability to pull one or more of those workqueues into individual
> > threads for processing, just like it currently is. That way I can
> > prioritize the work items with out you having to give priorities through
> > your entire system.
>
> Why would the user want to bother with all that? Shouldn't the kernel
> just do the right thing? IIRC, people are working on priority
> inheriting userland mutexes. If your problem case can't be solved by
> that, please try to fix it in proper way not through some random knobs
> that nobody really can understand.

It's funny cause your talking about priority inheritance, yet your
system makes that much more difficult to implement .. So, yes, I would
like priority inheritance for _current_ workqueues in mainline that
would solve many problems, but that doesn't exist in mainline yet. So I
use the features which _do exist_ .

> > The alternative is that you would give each work item a settable
> > priority and your whole system would have to honor that, which would be
> > a little like re-creating the scheduler.
> >
> >> But, really, without knowing details of those inversion cases, it
> >> would be pretty difficult to design something which fits. All that I
> >> can say is having shared worker pool isn't exclusive with solving the
> >> problem.
> >
> > The cases are all in the mainline kernel, you just have to look at the
> > code in a different way to understand them ..
>
> Yeah, sure, all the problems in the world which haven't been solved
> yet are there. The _solution_ isn't there and solutions are what
> people conserve when trying to improve things.

I have no idea what your saying here.. Your adding a new problem to
mainline that's what we're addressing.

> > If I have a userspace thread at a high priority and I'm making calls
> > into the kernel, some of those call inevitably will put work items
> > onto workqueues, right? I'm sure you can think of 100's of ways in
> > which this could happen .. At that point my thread depends on the
> > workqueue thread, since the workqueue thread is doing processing for
> > which I've , in some way, requested from userspace.
>
> You're basically saying that "I don't know how those inheritance
> inversions are happening but if I turn these magic knobs they seem to
> go away so I want those magic knobs". Maybe the RT part of the code
> shouldn't be depending on that many random things to begin with? And
> if there are actually things which are necessary, it's better idea to
> solve it properly through identifying problem points and properly
> inheriting priority instead of turning knobs until it somehow works?

I think your mis-interpreting me .. If I write a thread (in userspace)
which I put into RT priorities I don't have a lot of control over what
dependencies the kernel may put on my thread. Think from a users
perspective not from a kernel developers perspective.

I'm not saying changing a workqueue priority would be a final solution,
but it is a way to prioritize things immediately and has worked in prior
kernels up till your patches.

> If you wanna work on such things, be my guest. I'll be happy to work
> with you but please stop talking about setting priorities of
> workqueues from userland. That's just nuts.

You just don't understand it.. How can you expect your patches to go
into mainline with this attitude toward usages you just don't
understand?

Daniel

2010-06-16 18:32:28

by Stefan Richter

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Daniel Walker wrote:
[tweak scheduling priority of a worker thread]
> Let say I have a high priority
> thread in userspace , and I discover through analysis that my thread is
> being forced to wait on a workqueue thread (priority inversion) , so
> then I just increase the workqueue thread priority to overcome the
> inversion. That's totally valid, and you don't even need to know exactly
> what the thread is doing..

I suspect the _actual_ problem to solve here is not that of proper
scheduling priorities but that of having to meet deadlines.
--
Stefan Richter
-=====-==-=- -==- =----
http://arcgraph.de/sr/

2010-06-16 18:41:36

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 20:31 +0200, Stefan Richter wrote:
> Daniel Walker wrote:
> [tweak scheduling priority of a worker thread]
> > Let say I have a high priority
> > thread in userspace , and I discover through analysis that my thread is
> > being forced to wait on a workqueue thread (priority inversion) , so
> > then I just increase the workqueue thread priority to overcome the
> > inversion. That's totally valid, and you don't even need to know exactly
> > what the thread is doing..
>
> I suspect the _actual_ problem to solve here is not that of proper
> scheduling priorities but that of having to meet deadlines.

I think the workqueue would need to be adjust somehow even in terms of
deadlines. Like you'll miss your deadline unless the workqueue runs.

Daniel

2010-06-16 18:46:43

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/16/2010 08:22 PM, Daniel Walker wrote:
> There's so many different ways that threads can interact .. Can you
> imagine a thread waiting in userspace for something to complete in the
> kernel, that does actually happen pretty often ;) .
>
> I was just now randomly trolling through drivers and found this one,
> drivers/spi/amba-pl022.c ..
>
> It processes some data in the interrupt, but sometimes it offloads the
> processing to a workqueue from the interrupt (or tasklet) .. If for
> example I'm a userspace thread waiting for that data then I would have
> to wait for that workqueue to complete (and it's priority plays a major
> roll in when it completes).

Yeah, and it would wait that by flushing the work, right? If the
waiting part is using completion or some other event notification,
you'll just need to update the driver so that the kernel can determine
who's waiting for what so that it can bump the waited one's priority.
Otherwise, the problem can't be solved.

>> It was about using wq for cpu intensive / RT stuff. Linus said,
>>
>> So stop arguing about irrelevancies. Nobody uses workqueues for RT
>> or for CPU-intensive crap. It's not what they were designed for, or
>> used for.
>
> Which is not relevant to this discussion .. We're talking about
> re-prioritizing the workqueue threads. We're _not_ talking about
> workqueues designed specifically for real time purposes.

Well, it's somewhat related,

* Don't depend on works or workqueues for RT stuff. It's not designed
for that.

* If you really wanna solve the problem, please go ahead and _solve_
it yourself. (read the rest of the mail)

>> * fragile as hell
>
> Changing the thread priorities shouldn't be fragile , if it is right now
> then the threads are broken .. Can you exaplin in which cases you've
> seen it being fragile?

Because the workqueue might just go away in the next release or other
unrelated work which shouldn't get high priority might be scheduled
there. Maybe the name of the workqueue changes or it gets merged with
another workqueue. Maybe it gets split. Maybe the system suspends
and resumes and nobody knows that workers die and are created again
over those events. Maybe the backend implementaiton changes so that
workers are pooled.

>> * depends heavily on unrelated implementation details
>
> I have no idea what this means.

(continued) because all those are implementation details which are NOT
PART OF THE INTERFACE in any way.

>> * has extremely limited test coverage
>
> Simple, just write tests.

Yeah, and test your few configurations with those,

>> * doesn't help progressing mainline at all
>
> progressing where?

(continued) and other people experiencing the same problem will have
to do about the same thing and won't know whether there nice + pidof
will work with the next kernel upgrade.

Gee, I don't know. These are pretty evident problems to me. Aren't
they obvious?

>> That's exactly like grepping /proc/kallsyms to determine some feature
>> and claiming it's a feature whether the kernel intends it or not.
>> Sure, use it all you want. Just don't expect it to be there on the
>> next release.
>
> Your assume there's no value in changing the priorities which is wrong.
> Your assuming way to much . Changing the priorities is useful.

And you're assuming grepping /proc/kallsyms is not useful? It's
useful in its adhoc unsupported hacky way.

>> You're basically saying that "I don't know how those inheritance
>> inversions are happening but if I turn these magic knobs they seem to
>> go away so I want those magic knobs". Maybe the RT part of the code
>> shouldn't be depending on that many random things to begin with? And
>> if there are actually things which are necessary, it's better idea to
>> solve it properly through identifying problem points and properly
>> inheriting priority instead of turning knobs until it somehow works?
>
> I think your mis-interpreting me .. If I write a thread (in userspace)
> which I put into RT priorities I don't have a lot of control over what
> dependencies the kernel may put on my thread. Think from a users
> perspective not from a kernel developers perspective.
>
> I'm not saying changing a workqueue priority would be a final solution,
> but it is a way to prioritize things immediately and has worked in prior
> kernels up till your patches.

* Make the kernel or driver or whatever you use in the RT path track
priority is the right thing to do.

* I'm very sorry I'm breaking your hacky workaround but seriously
that's another problem to solve. Let's talk about the problem
itself instead of your hacky workaround. (I think for most cases
not using workqueue in RT path would be the right thing to do.)

>> If you wanna work on such things, be my guest. I'll be happy to work
>> with you but please stop talking about setting priorities of
>> workqueues from userland. That's just nuts.
>
> You just don't understand it.. How can you expect your patches to go
> into mainline with this attitude toward usages you just don't
> understand?

I'll keep your doubts on mind but I'm really understanding what you're
saying. You just don't understand that I understand and disagree. :-)

Thanks.

--
tejun

2010-06-16 19:22:43

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 08:46 PM, Tejun Heo wrote:
> * I'm very sorry I'm breaking your hacky workaround but seriously
> that's another problem to solve. Let's talk about the problem
> itself instead of your hacky workaround. (I think for most cases
> not using workqueue in RT path would be the right thing to do.)

For example, for the actual case of amba-pl022.c you mentioned, where
interrupt handler sometimes offloads to workqueue, convert
amba-pl022.c to use threaded interrupt handler. That's why it's
there.

If you actually _solve_ the problem like this, other users wouldn't
experience the problem at all once the update reaches them and you
won't have to worry about your workaround breaking with the next
kernel update or unexpected suspend/resume and we won't be having this
discussion about adjusting workqueue priorities from userland.

There are many wrong things about working around RT latency problems
by setting workqueue priorities from userland. Please think about why
the driver would have a separate workqueue for itself in the first
place. It was to work around the limitation of workqueue facility and
you're arguing that, because that work around allows yet another very
fragile workaround, the property which made the original work around
necessary in the first place needs to stay. That sounds really
perverse to me.

--
tejun

2010-06-16 19:36:43

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 20:46 +0200, Tejun Heo wrote:
> Hello,
>
> On 06/16/2010 08:22 PM, Daniel Walker wrote:
> > There's so many different ways that threads can interact .. Can you
> > imagine a thread waiting in userspace for something to complete in the
> > kernel, that does actually happen pretty often ;) .
> >
> > I was just now randomly trolling through drivers and found this one,
> > drivers/spi/amba-pl022.c ..
> >
> > It processes some data in the interrupt, but sometimes it offloads the
> > processing to a workqueue from the interrupt (or tasklet) .. If for
> > example I'm a userspace thread waiting for that data then I would have
> > to wait for that workqueue to complete (and it's priority plays a major
> > roll in when it completes).
>
> Yeah, and it would wait that by flushing the work, right? If the
> waiting part is using completion or some other event notification,
> you'll just need to update the driver so that the kernel can determine
> who's waiting for what so that it can bump the waited one's priority.
> Otherwise, the problem can't be solved.

This has nothing to do with flushing .. You keep bringing this back into
the kernel for some reason, we're talking about entirely userspace
threads ..

> >> It was about using wq for cpu intensive / RT stuff. Linus said,
> >>
> >> So stop arguing about irrelevancies. Nobody uses workqueues for RT
> >> or for CPU-intensive crap. It's not what they were designed for, or
> >> used for.
> >
> > Which is not relevant to this discussion .. We're talking about
> > re-prioritizing the workqueue threads. We're _not_ talking about
> > workqueues designed specifically for real time purposes.
>
> Well, it's somewhat related,
>
> * Don't depend on works or workqueues for RT stuff. It's not designed
> for that.

Too bad .. We have a posix OS, and posix has RT priorities .. You can't
control what priorities user give those threads.

> * If you really wanna solve the problem, please go ahead and _solve_
> it yourself. (read the rest of the mail)

Your causing the problem, why should I solve it? My solution would just
be to NAK your patches.

> >> * fragile as hell
> >
> > Changing the thread priorities shouldn't be fragile , if it is right now
> > then the threads are broken .. Can you exaplin in which cases you've
> > seen it being fragile?
>
> Because the workqueue might just go away in the next release or other
> unrelated work which shouldn't get high priority might be scheduled
> there. Maybe the name of the workqueue changes or it gets merged with
> another workqueue. Maybe it gets split. Maybe the system suspends
> and resumes and nobody knows that workers die and are created again
> over those events. Maybe the backend implementaiton changes so that
> workers are pooled.

Changing the priorities is not fragile, your saying that ones ability to
adapt to changes in the kernel makes it hard to know what the workqueue
is actually doing.. Ok, that's fair.. This doesn't make it less useful
since people can discover thread dependencies without looking at the
kernel source.

> >> * depends heavily on unrelated implementation details
> >
> > I have no idea what this means.
>
> (continued) because all those are implementation details which are NOT
> PART OF THE INTERFACE in any way.

yet they are part of the interface like it or not. How could you use
threads and think thread priorities are not part of the interface.

In your new system how do you currently prevent thread priorities on
your new workqueue threads from getting modified? Surely you must be
doing that since you don't want those priorities to change right?

> >> * has extremely limited test coverage
> >
> > Simple, just write tests.
>
> Yeah, and test your few configurations with those,
>
> >> * doesn't help progressing mainline at all
> >
> > progressing where?
>
> (continued) and other people experiencing the same problem will have
> to do about the same thing and won't know whether there nice + pidof
> will work with the next kernel upgrade.
>
> Gee, I don't know. These are pretty evident problems to me. Aren't
> they obvious?

Your just looking at the problem through your specific use case glasses
without imagining what else people could be doing with the kernel.

How often do you think workqueues change names anyway? It's not all that
often.

> >> That's exactly like grepping /proc/kallsyms to determine some feature
> >> and claiming it's a feature whether the kernel intends it or not.
> >> Sure, use it all you want. Just don't expect it to be there on the
> >> next release.
> >
> > Your assume there's no value in changing the priorities which is wrong.
> > Your assuming way to much . Changing the priorities is useful.
>
> And you're assuming grepping /proc/kallsyms is not useful? It's
> useful in its adhoc unsupported hacky way.

Well lets say it's useful and 100k people use that method in it's
"hacky" way .. When does it become a feature then?

> >> You're basically saying that "I don't know how those inheritance
> >> inversions are happening but if I turn these magic knobs they seem to
> >> go away so I want those magic knobs". Maybe the RT part of the code
> >> shouldn't be depending on that many random things to begin with? And
> >> if there are actually things which are necessary, it's better idea to
> >> solve it properly through identifying problem points and properly
> >> inheriting priority instead of turning knobs until it somehow works?
> >
> > I think your mis-interpreting me .. If I write a thread (in userspace)
> > which I put into RT priorities I don't have a lot of control over what
> > dependencies the kernel may put on my thread. Think from a users
> > perspective not from a kernel developers perspective.
> >
> > I'm not saying changing a workqueue priority would be a final solution,
> > but it is a way to prioritize things immediately and has worked in prior
> > kernels up till your patches.
>
> * Make the kernel or driver or whatever you use in the RT path track
> priority is the right thing to do.

That's why your changing the priority of the workqueue.

> * I'm very sorry I'm breaking your hacky workaround but seriously
> that's another problem to solve. Let's talk about the problem
> itself instead of your hacky workaround. (I think for most cases
> not using workqueue in RT path would be the right thing to do.)

You have no control over using workqueue in an RT path, like I said you
can't control which applications might get RT priorities and what
workqueues they could be using..

Bottom line is you have to assume any kernel path way could have an RT
thread using it. You can't say "This is RT safe this kernel version, and
this other stuff it's not RT safe.." This is posix, everything can and
will get used by RT threads.

> >> If you wanna work on such things, be my guest. I'll be happy to work
> >> with you but please stop talking about setting priorities of
> >> workqueues from userland. That's just nuts.
> >
> > You just don't understand it.. How can you expect your patches to go
> > into mainline with this attitude toward usages you just don't
> > understand?
>
> I'll keep your doubts on mind but I'm really understanding what you're
> saying. You just don't understand that I understand and disagree. :-)

So your totally unwilling to change your patches to correct this
problem? Is that what your getting at? Agree or disagree isn't relevant
it's a real problem or I wouldn't have brought it up.

btw, I already gave you a relatively easy way to correct this.

Daniel

2010-06-16 19:46:19

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 21:20 +0200, Tejun Heo wrote:
> On 06/16/2010 08:46 PM, Tejun Heo wrote:
> > * I'm very sorry I'm breaking your hacky workaround but seriously
> > that's another problem to solve. Let's talk about the problem
> > itself instead of your hacky workaround. (I think for most cases
> > not using workqueue in RT path would be the right thing to do.)
>
> For example, for the actual case of amba-pl022.c you mentioned, where
> interrupt handler sometimes offloads to workqueue, convert
> amba-pl022.c to use threaded interrupt handler. That's why it's
> there.
>
> If you actually _solve_ the problem like this, other users wouldn't
> experience the problem at all once the update reaches them and you
> won't have to worry about your workaround breaking with the next
> kernel update or unexpected suspend/resume and we won't be having this
> discussion about adjusting workqueue priorities from userland.

What you suggesting just means the user has to adjust an interrupt
thread instead of a workqueue thread. That really doesn't change
anything, since it's just another type of kernel thread.

> There are many wrong things about working around RT latency problems
> by setting workqueue priorities from userland. Please think about why
> the driver would have a separate workqueue for itself in the first
> place. It was to work around the limitation of workqueue facility and
> you're arguing that, because that work around allows yet another very
> fragile workaround, the property which made the original work around
> necessary in the first place needs to stay. That sounds really
> perverse to me.

I have no idea what your trying to say here.. I'm sure there is no one
reason why people use workqueues in their drivers. In fact I'm sure
there are many reasons to use workqueues or not to use them.

Daniel

2010-06-16 19:54:44

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 09:36 PM, Daniel Walker wrote:
>> Yeah, and it would wait that by flushing the work, right? If the
>> waiting part is using completion or some other event notification,
>> you'll just need to update the driver so that the kernel can determine
>> who's waiting for what so that it can bump the waited one's priority.
>> Otherwise, the problem can't be solved.
>
> This has nothing to do with flushing .. You keep bringing this back into
> the kernel for some reason, we're talking about entirely userspace
> threads ..

Yeah, sure, how would tose userspace threads wait for the event? And
how would the kernel be able to honor latency requirements not knowing
the dependency?

>> Well, it's somewhat related,
>>
>> * Don't depend on works or workqueues for RT stuff. It's not designed
>> for that.
>
> Too bad .. We have a posix OS, and posix has RT priorities .. You can't
> control what priorities user give those threads.

So, you're not talking about the real RT w/ timing guarantees? So no
priority inheritance or whatever? Gees, then any lock can give you
unexpected latencies.

>> * If you really wanna solve the problem, please go ahead and _solve_
>> it yourself. (read the rest of the mail)
>
> Your causing the problem, why should I solve it? My solution would just
> be to NAK your patches.

I don't have any problem with that. I would be almost happy to get
your NACK.

>> Because the workqueue might just go away in the next release or other
>> unrelated work which shouldn't get high priority might be scheduled
>> there. Maybe the name of the workqueue changes or it gets merged with
>> another workqueue. Maybe it gets split. Maybe the system suspends
>> and resumes and nobody knows that workers die and are created again
>> over those events. Maybe the backend implementaiton changes so that
>> workers are pooled.
>
> Changing the priorities is not fragile, your saying that ones ability to
> adapt to changes in the kernel makes it hard to know what the workqueue
> is actually doing.. Ok, that's fair.. This doesn't make it less useful
> since people can discover thread dependencies without looking at the
> kernel source.

Sigh, so, yeah, the whole thing is fragile. When did I say nice(1) is
fragile?

>>>> * depends heavily on unrelated implementation details
>>>
>>> I have no idea what this means.
>>
>> (continued) because all those are implementation details which are NOT
>> PART OF THE INTERFACE in any way.
>
> yet they are part of the interface like it or not. How could you use
> threads and think thread priorities are not part of the interface.
>
> In your new system how do you currently prevent thread priorities on
> your new workqueue threads from getting modified? Surely you must be
> doing that since you don't want those priorities to change right?

No, I don't. If the root wanna shoot its feet, it can. On the same
tune, you can lower the priority of the migration thread to your own
peril.

>> Gee, I don't know. These are pretty evident problems to me. Aren't
>> they obvious?
>
> Your just looking at the problem through your specific use case glasses
> without imagining what else people could be doing with the kernel.
>
> How often do you think workqueues change names anyway? It's not all that
> often.

So does most symbols in kernel. What you're saying applies almost
word to word to grepping /proc/kallsyms. Can't you see it?

>> And you're assuming grepping /proc/kallsyms is not useful? It's
>> useful in its adhoc unsupported hacky way.
>
> Well lets say it's useful and 100k people use that method in it's
> "hacky" way .. When does it become a feature then?

If 100k people actually want it, solve the damn problem instead of
holding onto the bandaid which doesn't work anyway.

> So your totally unwilling to change your patches to correct this
> problem? Is that what your getting at? Agree or disagree isn't relevant
> it's a real problem or I wouldn't have brought it up.
>
> btw, I already gave you a relatively easy way to correct this.

I'm sorry but the problem you brought up seems bogus to me. So does
the solution. In debugfs? Is that debug feature or is it an API? Do
we keep the workqueues stable then? Do we make announcements when
moving one work from one workqueue to another? If it's a debug
feature, why are we talking like this any way?

Thanks.

--
tejun

2010-06-16 19:59:21

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/16/2010 09:46 PM, Daniel Walker wrote:
>> If you actually _solve_ the problem like this, other users wouldn't
>> experience the problem at all once the update reaches them and you
>> won't have to worry about your workaround breaking with the next
>> kernel update or unexpected suspend/resume and we won't be having this
>> discussion about adjusting workqueue priorities from userland.
>
> What you suggesting just means the user has to adjust an interrupt
> thread instead of a workqueue thread. That really doesn't change
> anything, since it's just another type of kernel thread.

It allows writing an interrupt handler which requires context and
PREEMPT_RT kernel will handle priority inheritance correctly with
them, so there's no priority inversion problem.

>> There are many wrong things about working around RT latency problems
>> by setting workqueue priorities from userland. Please think about why
>> the driver would have a separate workqueue for itself in the first
>> place. It was to work around the limitation of workqueue facility and
>> you're arguing that, because that work around allows yet another very
>> fragile workaround, the property which made the original work around
>> necessary in the first place needs to stay. That sounds really
>> perverse to me.
>
> I have no idea what your trying to say here.. I'm sure there is no one
> reason why people use workqueues in their drivers. In fact I'm sure
> there are many reasons to use workqueues or not to use them.

I mean that if workqueues don't have current limitations, there's no
reason for most drivers which aren't in the allocation path to use
separate workqueues. They can simply use the default system one.

Thanks.

--
tejun

2010-06-16 20:20:05

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 21:52 +0200, Tejun Heo wrote:
> On 06/16/2010 09:36 PM, Daniel Walker wrote:
> >> Yeah, and it would wait that by flushing the work, right? If the
> >> waiting part is using completion or some other event notification,
> >> you'll just need to update the driver so that the kernel can determine
> >> who's waiting for what so that it can bump the waited one's priority.
> >> Otherwise, the problem can't be solved.
> >
> > This has nothing to do with flushing .. You keep bringing this back into
> > the kernel for some reason, we're talking about entirely userspace
> > threads ..
>
> Yeah, sure, how would tose userspace threads wait for the event? And
> how would the kernel be able to honor latency requirements not knowing
> the dependency?

Let say a userspace thread calls into the kernel via some method syscall
for instance, and while executing the syscall there is a mutex,
semaphore, completion or any other blocking mechanism. Then the
userspace thread blocks.

> >> Well, it's somewhat related,
> >>
> >> * Don't depend on works or workqueues for RT stuff. It's not designed
> >> for that.
> >
> > Too bad .. We have a posix OS, and posix has RT priorities .. You can't
> > control what priorities user give those threads.
>
> So, you're not talking about the real RT w/ timing guarantees? So no
> priority inheritance or whatever? Gees, then any lock can give you
> unexpected latencies.

I'm talking about normal threads with RT priorities ..

> >> * If you really wanna solve the problem, please go ahead and _solve_
> >> it yourself. (read the rest of the mail)
> >
> > Your causing the problem, why should I solve it? My solution would just
> > be to NAK your patches.
>
> I don't have any problem with that. I would be almost happy to get
> your NACK.

Oh yeah? why is that?

> >> Because the workqueue might just go away in the next release or other
> >> unrelated work which shouldn't get high priority might be scheduled
> >> there. Maybe the name of the workqueue changes or it gets merged with
> >> another workqueue. Maybe it gets split. Maybe the system suspends
> >> and resumes and nobody knows that workers die and are created again
> >> over those events. Maybe the backend implementaiton changes so that
> >> workers are pooled.
> >
> > Changing the priorities is not fragile, your saying that ones ability to
> > adapt to changes in the kernel makes it hard to know what the workqueue
> > is actually doing.. Ok, that's fair.. This doesn't make it less useful
> > since people can discover thread dependencies without looking at the
> > kernel source.
>
> Sigh, so, yeah, the whole thing is fragile. When did I say nice(1) is
> fragile?

Like I said that type of "fragile" doesn't really matter that much,
since you can just discover any new thread dependencies on a new kernel.
Anyone running a RT thread would likely do that anyway, since even
syscall are "fragile" in this way.

> >> Gee, I don't know. These are pretty evident problems to me. Aren't
> >> they obvious?
> >
> > Your just looking at the problem through your specific use case glasses
> > without imagining what else people could be doing with the kernel.
> >
> > How often do you think workqueues change names anyway? It's not all that
> > often.
>
> So does most symbols in kernel. What you're saying applies almost
> word to word to grepping /proc/kallsyms. Can't you see it?

No .. I don't follow your grepping thing .. Symbols in the kernel change
pretty often, and generally aren't exposed to userspace in a way that
one can readily see them, like say using "top" or "ps" for
example . /proc/kallsyms is unknown man, I vague knew what that was till
you mentioned it, yet _everyone_ has see "kblockd".

> >> And you're assuming grepping /proc/kallsyms is not useful? It's
> >> useful in its adhoc unsupported hacky way.
> >
> > Well lets say it's useful and 100k people use that method in it's
> > "hacky" way .. When does it become a feature then?
>
> If 100k people actually want it, solve the damn problem instead of
> holding onto the bandaid which doesn't work anyway.

Again with the mass assumptions ..

> > So your totally unwilling to change your patches to correct this
> > problem? Is that what your getting at? Agree or disagree isn't relevant
> > it's a real problem or I wouldn't have brought it up.
> >
> > btw, I already gave you a relatively easy way to correct this.
>
> I'm sorry but the problem you brought up seems bogus to me. So does
> the solution. In debugfs? Is that debug feature or is it an API? Do
> we keep the workqueues stable then? Do we make announcements when
> moving one work from one workqueue to another? If it's a debug
> feature, why are we talking like this any way?

I was suggesting it as a debug feature, but if people screamed loud
enough then you would have to make it an API .. You need to have feature
parity with current mainline, which you don't have..

I dont't know what you mean by "keep the workqueues stable" , you mean
the naming? No you don't .. You don't make announcements either..

I suggested it as a debug feature yes, and your the one arguing with
_me_ ..

Daniel

2010-06-16 20:25:42

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 10:19 PM, Daniel Walker wrote:
>> I don't have any problem with that. I would be almost happy to get
>> your NACK.
>
> Oh yeah? why is that?

Because the discussion isn't leading anywhere and I've already thrown
almost everything I can think of, so if I haven't convinced you yet
there's very low probability that I'll be able to do with further
discussion. So, I duly collected your Nacked-by: on the reason that
cmwq doesn't allow setting priorities from userland.

Thanks and have a good night.

--
tejun

2010-06-16 20:40:16

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 22:24 +0200, Tejun Heo wrote:
> On 06/16/2010 10:19 PM, Daniel Walker wrote:
> >> I don't have any problem with that. I would be almost happy to get
> >> your NACK.
> >
> > Oh yeah? why is that?
>
> Because the discussion isn't leading anywhere and I've already thrown
> almost everything I can think of, so if I haven't convinced you yet
> there's very low probability that I'll be able to do with further
> discussion. So, I duly collected your Nacked-by: on the reason that
> cmwq doesn't allow setting priorities from userland.
>
> Thanks and have a good night.

Ok, and please don't try to summarize why there is a NAK'ed-by, if
people want to know they can contact me directly or read this thread.

Daniel

2010-06-16 21:42:33

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/16/2010 10:40 PM, Daniel Walker wrote:
>> Because the discussion isn't leading anywhere and I've already thrown
>> almost everything I can think of, so if I haven't convinced you yet
>> there's very low probability that I'll be able to do with further
>> discussion. So, I duly collected your Nacked-by: on the reason that
>> cmwq doesn't allow setting priorities from userland.
>
> Ok, and please don't try to summarize why there is a NAK'ed-by, if
> people want to know they can contact me directly or read this thread.

No, Daniel. You don't get to tell me what I write or not. If you
don't agree with the summary, feel free to offer a better one and I
would consider to update mine. I'll provide my summary, arguments and
the link to this discussion when I send a pull request and you'll be
cc'd. You're free to present your arguments there again if you feel
necessary.

--
tejun

2010-06-17 05:29:53

by Florian Mickler

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 16 Jun 2010 21:20:36 +0200
Tejun Heo <[email protected]> wrote:

> On 06/16/2010 08:46 PM, Tejun Heo wrote:
> > * I'm very sorry I'm breaking your hacky workaround but seriously
> > that's another problem to solve. Let's talk about the problem
> > itself instead of your hacky workaround. (I think for most cases
> > not using workqueue in RT path would be the right thing to do.)
>
> For example, for the actual case of amba-pl022.c you mentioned, where
> interrupt handler sometimes offloads to workqueue, convert
> amba-pl022.c to use threaded interrupt handler. That's why it's
> there.
>
> If you actually _solve_ the problem like this, other users wouldn't
> experience the problem at all once the update reaches them and you
> won't have to worry about your workaround breaking with the next
> kernel update or unexpected suspend/resume and we won't be having this
> discussion about adjusting workqueue priorities from userland.
>
> There are many wrong things about working around RT latency problems
> by setting workqueue priorities from userland. Please think about why
> the driver would have a separate workqueue for itself in the first
> place. It was to work around the limitation of workqueue facility and
> you're arguing that, because that work around allows yet another very
> fragile workaround, the property which made the original work around
> necessary in the first place needs to stay. That sounds really
> perverse to me.
>

For what its worth, IMO the right thing to do would probably be to
propagate the priority through the subsystem into the driver.

Not fumbling with thread priorities. As Tejun said, these are not
really userspace ABI ... (It's like hitting at the side of a vending
machine if the coin is stuck... may work, but definitely not
supported by the manufacturer)

Once you have the priority in the driver you could pass it to the
workqueue subsystem (i.e. set the priority of the work) and the worker
could then assume the priority of its work.

The tricky part is probably to pass the priority from the userspace
thread to the kernel?

Cheers,
Flo

2010-06-17 06:21:52

by Florian Mickler

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, 17 Jun 2010 07:29:20 +0200
Florian Mickler <[email protected]> wrote:

> On Wed, 16 Jun 2010 21:20:36 +0200
> Tejun Heo <[email protected]> wrote:
>
> > On 06/16/2010 08:46 PM, Tejun Heo wrote:
> > > * I'm very sorry I'm breaking your hacky workaround but seriously
> > > that's another problem to solve. Let's talk about the problem
> > > itself instead of your hacky workaround. (I think for most cases
> > > not using workqueue in RT path would be the right thing to do.)
> >
> > For example, for the actual case of amba-pl022.c you mentioned, where
> > interrupt handler sometimes offloads to workqueue, convert
> > amba-pl022.c to use threaded interrupt handler. That's why it's
> > there.
> >
> > If you actually _solve_ the problem like this, other users wouldn't
> > experience the problem at all once the update reaches them and you
> > won't have to worry about your workaround breaking with the next
> > kernel update or unexpected suspend/resume and we won't be having this
> > discussion about adjusting workqueue priorities from userland.
> >
> > There are many wrong things about working around RT latency problems
> > by setting workqueue priorities from userland. Please think about why
> > the driver would have a separate workqueue for itself in the first
> > place. It was to work around the limitation of workqueue facility and
> > you're arguing that, because that work around allows yet another very
> > fragile workaround, the property which made the original work around
> > necessary in the first place needs to stay. That sounds really
> > perverse to me.
> >
>
> For what its worth, IMO the right thing to do would probably be to
> propagate the priority through the subsystem into the driver.

I was thinking about input devices here... anyway, my point is, that
the user of the workqueue-interface should pass the priority-context to
the workqueue subsystem...

If a userspace needs a driver to have a specific priority... can it
inform the system about it (thinking sysfs here..)? that would prevent
userspace having "to kick" the kernel if smth get's stuck....

Cheers,
Flo

2010-06-17 08:29:40

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/17/2010 07:29 AM, Florian Mickler wrote:
> Once you have the priority in the driver you could pass it to the
> workqueue subsystem (i.e. set the priority of the work) and the worker
> could then assume the priority of its work.
>
> The tricky part is probably to pass the priority from the userspace
> thread to the kernel?

There are several things to consider. First of all, workqueue
latencies would be much lower with cmwq because works don't need to
wait for other works on the same wq to complete. So, in general, the
latencies will be more predictable and lower.

Secondly, work items don't tend to burn a lot of cpu cycles. Combined
with the first point, there might not be much benefit to doing all the
extra work for prioritizing work items (as what matters is getting
them to start running quickly and cmwq helps a lot in that direction).

Thirdly, on !RT kernel, RT doesn't have any guarantee anyway. It's
just higher priority. For real RT, if work items should be capable of
being part of critical path, implementing proper priority inheritance
is necessary so that flush_work() and possibly flush_workqueue() can
bump the waited works on behalf of RT thread waiting on them. Whether
that would be a worth investment or just not using work items in
critical path is better, I don't know.

Thanks.

--
tejun

2010-06-17 12:04:52

by Andy Walls

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 07:05 -0700, Daniel Walker wrote:
> On Wed, 2010-06-16 at 15:45 +0200, Tejun Heo wrote:
> > On 06/16/2010 03:41 PM, Daniel Walker wrote:
> > > Any workqueue that has a thread which can be prioritized from userspace.
> > > As long as there is a thread it can usually be given a priority from
> > > userspace, so any _current_ workqueue which uses a single thread or
> > > multiple threads is an example of what I'm talking about.
> >
> > Eh... what's the use case for that? That's just so wrong. What do
> > you do after a suspend/resume cycle? Reprioritize all of them from
> > suspend/resume hooks?
>
> The use case is any situation when the user wants to give higher
> priority to some set of work items, and there's nothing wrong with that.
> In fact there has been a lot of work in the RT kernel related to
> workqueue prioritization ..

I'm going to agree with Tejun, that tweaking worker thread priorities
seems like an odd thing, since they are meant to handle deferable
actions - things that can be put off until later.

If one needs to support Real Time deadlines on deferable actions,
wouldn't using dedicated kernel threads be more deterministic?
Would the user ever up the priority for a workqueue other than a
single-threaded workqueue?

Regards,
Andy

> Daniel


2010-06-17 16:56:55

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, 2010-06-17 at 08:01 -0400, Andy Walls wrote:
> On Wed, 2010-06-16 at 07:05 -0700, Daniel Walker wrote:
> > On Wed, 2010-06-16 at 15:45 +0200, Tejun Heo wrote:
> > > On 06/16/2010 03:41 PM, Daniel Walker wrote:
> > > > Any workqueue that has a thread which can be prioritized from userspace.
> > > > As long as there is a thread it can usually be given a priority from
> > > > userspace, so any _current_ workqueue which uses a single thread or
> > > > multiple threads is an example of what I'm talking about.
> > >
> > > Eh... what's the use case for that? That's just so wrong. What do
> > > you do after a suspend/resume cycle? Reprioritize all of them from
> > > suspend/resume hooks?
> >
> > The use case is any situation when the user wants to give higher
> > priority to some set of work items, and there's nothing wrong with that.
> > In fact there has been a lot of work in the RT kernel related to
> > workqueue prioritization ..
>
> I'm going to agree with Tejun, that tweaking worker thread priorities
> seems like an odd thing, since they are meant to handle deferable
> actions - things that can be put off until later.

Running RT threads at all can be thought of as odd , but it's an
available feature. In order to effectively run RT threads it's best if
your able to prioritize the system, and that is available with current
workqueues period ..

The problem is Tejun is removing that , ok so maybe this feature of
workqueues is an odd thing to use (just like using RT threads). However,
people are still doing it, it is useful, and it is available in mainline
(and has been for a long time) ..

All I'm asking Tejun to do is have feature parity with current mainline,
and it's not even that hard to do that.

> If one needs to support Real Time deadlines on deferable actions,
> wouldn't using dedicated kernel threads be more deterministic?
> Would the user ever up the priority for a workqueue other than a
> single-threaded workqueue?

It's in a thread and you can prioritize it , so it's only deferable if
the user defines it that way. What your suggesting with dedicated
threads _might_ be something you would do in the long run, but to
satisfy my current of-the-moment needs I would rather re-prioritize the
workqueue instead of investing a significant amount of time re-writing
the driver when it's unknown if that would even help..

Daniel

2010-06-17 18:03:57

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, 2010-06-17 at 07:29 +0200, Florian Mickler wrote:
>
>
> For what its worth, IMO the right thing to do would probably be to
> propagate the priority through the subsystem into the driver.
>
> Not fumbling with thread priorities. As Tejun said, these are not
> really userspace ABI ... (It's like hitting at the side of a vending
> machine if the coin is stuck... may work, but definitely not
> supported by the manufacturer)

I don't agree with your analogy here .. It's more like you have two
items in the vending machine item A and item B. Tejun is saying he likes
item A, his friends like item A so that must mean item B is not
interesting so he removes it (without knowing how many people want it).
So what happens? People use another vending machine. (Linux is the
vending machine) ..

>From my perspective this is like using Linux only for throughput which
is what Tejun is maximizing. In addition Tejun is blowing up the
alternative which is to prioritize the work items and sticking you with
strictly a throughput based system.

Do you thing maybe it's possible that the work items aren't all created
equal? Maybe one item _is_ more important than another one. Maybe on a
given system Tejun's workqueues runs a 1000 useless pointless work items
before he gets to the valuable one .. Yet the user is powerless to
dictate what is or is not important.

> Once you have the priority in the driver you could pass it to the
> workqueue subsystem (i.e. set the priority of the work) and the worker
> could then assume the priority of its work.
>
> The tricky part is probably to pass the priority from the userspace
> thread to the kernel?

Threads are designed to have priorities tho, and threads are pervasive
throughout Linux .. It seems like setting a priorities to drivers would
be like re-inventing the wheel ..

If you were to say make the driver use kthreads instead of workqueues,
then you could set the priority of the kthreads .. However, you said
this isn't part of the ABI and so your back to your original argument
which is that you shouldn't be setting priorities in the first place.

Daniel

2010-06-17 22:29:04

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 2010-06-16 at 15:45 +0200, Tejun Heo wrote:
> On 06/16/2010 03:41 PM, Daniel Walker wrote:
> > Any workqueue that has a thread which can be prioritized from userspace.
> > As long as there is a thread it can usually be given a priority from
> > userspace, so any _current_ workqueue which uses a single thread or
> > multiple threads is an example of what I'm talking about.
>
> Eh... what's the use case for that? That's just so wrong. What do
> you do after a suspend/resume cycle? Reprioritize all of them from
> suspend/resume hooks?


I tested your assertion about suspend/resume, and it doesn't seem to be
true.. I tested workqueues with nice levels, and real time priorities on
a random laptop using 2.6.31 both held their priorities across suspend
to RAM and suspend to disk .. Was this change added after 2.6.31 ?

Daniel

2010-06-17 23:16:22

by Andrew Morton

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Tue, 15 Jun 2010 20:25:28 +0200
Tejun Heo <[email protected]> wrote:

> Hello, all.

Thanks for doing this. It helps. And look at all the interest and
helpful suggestions!

> So, here's the overview I wrote up today. If anything needs more
> clarification, just ask. Thanks.
>
> == Overview
>
> There are many cases where an execution context is needed and there
> already are several mechanisms for them. The most commonly used one
> is workqueue and there are slow_work, async and a few other. Although
> workqueue has been serving the kernel for quite some time now, it has
> some limitations.
>
> There are two types of workqueues, single and multi threaded. MT wq
> keeps a bound thread for each online CPU, while ST wq uses single
> unbound thread. With the quickly rising number of CPU cores, there
> already are systems in which just booting up saturates the default 32k
> PID space.
>
> Frustratingly, although MT wqs end up spending a lot of resources, the
> level of concurrency provided is unsatisfactory. The concurrency
> limitation is common to both ST and MT wqs although it's less severe
> on MT ones. Worker pools of wqs are completely separate from each
> other. A MT wq provides one execution context per CPU while a ST wq
> one for the whole system. This leads to various problems.
>
> One such problem is possible deadlock through dependency on the same
> execution resource. These can be detected quite reliably with lockdep
> these days but in most cases the only solution is to create a
> dedicated wq for one of the parties involved in the deadlock, which
> feeds back into the waste of resources. Also, when creating such
> dedicated wq to avoid deadlock, to avoid wasting large number of
> threads just for that work, ST wqs are often used but in most cases ST
> wqs are suboptimal compared to MT wqs.

Does this approach actually *solve* the deadlocks due to work
dependencies? Or does it just make the deadlocks harder to hit by
throwing more threads at the problem?

ah, from reading on I see it's the make-them-harder-to-hit approach.

Deos lockdep still tell us that we're in a potentially deadlockable
situation?

> The tension between the provided level of concurrency and resource
> usage force its users to make unnecessary tradeoffs like libata
> choosing to use ST wq for polling PIOs and accepting a silly
> limitation that no two polling PIOs can be in progress at the same
> time. As MT wqs don't provide much better concurrency, users which
> require higher level of concurrency, like async or fscache, end up
> having to implement their own worker pool.
>
> cmwq extends workqueue with focus on the following goals.
>
> * Workqueue is already very widely used. Maintain compatibility with
> the current API while removing limitations of the current
> implementation.
>
> * Provide single unified worker pool per cpu which can be shared by
> all users. The worker pool and level of concurrency should be
> regulated automatically so that the API users don't need to worry
> about that.
>
> * Use what's necessary and allocate resources lazily on demand while
> still maintaining forward progress guarantee where necessary.

There are places where code creates workqueue threads and then fiddles
with those threads' scheduling priority or scheduling policy or
whatever. I'll address that in a different email.

>
> == Unified worklist
>
> There's a single global cwq, or gcwq, per each possible cpu which
> actually serves out the execution contexts. cpu_workqueues or cwqs of
> each wq are mostly simple frontends to the associated gcwq. Under
> normal operation, when a work is queued, it's queued to the gcwq on
> the same cpu. Each gcwq has its own pool of workers bound to the gcwq
> which will be used to process all the works queued on the cpu. For
> the most part, works don't care to which wqs they're queued to and
> using a unified worklist is pretty straight forward. There are a
> couple of areas where things are a bit more complicated.
>
> First, when queueing works from different wqs on the same queue,
> ordering of works needs special care. Originally, a MT wq allows a
> work to be executed simultaneously on multiple cpus although it
> doesn't allow the same one to execute simultaneously on the same cpu
> (reentrant). A ST wq allows only single work to be executed on any
> cpu which guarantees both non-reentrancy and single-threadedness.
>
> cmwq provides three different ordering modes - reentrant (default),
> non-reentrant and single-cpu, where single-cpu can be used to achieve
> single-threadedness and full ordering combined with in-flight work
> limit of 1. The default mode is basically the same as the original
> implementation. The distinction between non-reentrancy and single-cpu
> were made because some ST wq users didn't really need single
> threadedness but just non-reentrancy.
>
> Another area where things get more involved is workqueue flushing as
> for flushing to which wq a work is queued matters. cmwq tracks this
> using colors. When a work is queued to a cwq, it's assigned a color
> and each cwq maintains counters for each work color. The color
> assignment changes on each wq flush attempt. A cwq can tell that all
> works queued before a certain wq flush attempt have finished by
> waiting for all the colors upto that point to drain. This maintains
> the original workqueue flush semantics without adding unscalable
> overhead.

flush_workqueue() sucks. It's a stupid, accidental,
internal-implementation-dependent interface. We should deprecate it
and try to get rid of it, migrating to the eminently more sensible
flush_work().

I guess the first step is to add a dont-do-that checkpatch warning when
people try to add new flush_workqueue() calls.

165 instances tree-wide, sigh.

>
> == Automatically regulated shared worker pool
>
> For any worker pool, managing the concurrency level (how many workers
> are executing simultaneously) is an important issue.

Why? What are we trying to avoid here?

> cmwq tries to
> keep the concurrency at minimum but sufficient level.

I don't have a hope of remembering what all the new three-letter and
four-letter acronyms mean :(

> Concurrency management is implemented by hooking into the scheduler.
> gcwq is notified whenever a busy worker wakes up or sleeps and thus

<tries to work out what gcwq means, and not just "what it expands to">

> can keep track of the current level of concurrency. Works aren't
> supposed to be cpu cycle hogs and maintaining just enough concurrency
> to prevent work processing from stalling due to lack of processing
> context should be optimal. gcwq keeps the number of concurrent active
> workers to minimum but no less.

Is that "the number of concurrent active workers per cpu"?

> As long as there's one or more
> running workers on the cpu, no new worker is scheduled so that works
> can be processed in batch as much as possible but when the last
> running worker blocks, gcwq immediately schedules new worker so that
> the cpu doesn't sit idle while there are works to be processed.

"immediately schedules": I assume that this means that the thread is
made runnable, but isn't necessarily immediately executed?

If it _is_ immediately given the CPU then it sounds locky uppy?

> This allows using minimal number of workers without losing execution
> bandwidth. Keeping idle workers around doesn't cost much other than
> the memory space, so cmwq holds onto idle ones for a while before
> killing them.
>
> As multiple execution contexts are available for each wq, deadlocks
> around execution contexts is much harder to create. The default
> workqueue, system_wq, has maximum concurrency level of 256 and unless
> there is a use case which can result in a dependency loop involving
> more than 254 workers, it won't deadlock.

ah, there we go.

hm.

> Such forward progress guarantee relies on that workers can be created
> when more execution contexts are necessary. This is guaranteed by
> using emergency workers. All wqs which can be used in allocation path

allocation of what?

> are required to have emergency workers which are reserved for
> execution of that specific workqueue so that allocation needed for
> worker creation doesn't deadlock on workers.
>
>
> == Benefits
>
> * Less to worry about causing deadlocks around execution resources.
>
> * Far fewer number of kthreads.
>
> * More flexibility without runtime overhead.
>
> * As concurrency is no longer a problem, workloads which needed
> separate mechanisms can now use generic workqueue instead. This
> easy access to concurrency also allows stuff which wasn't worth
> implementing a dedicated mechanism for but still needed flexible
> concurrency.
>
>
> == Numbers (this is with the third take but nothing which could affect
> performance has changed since then. Eh well, very little has
> changed since then in fact.)

yes, it's hard to see how any of these changes could affect CPU
consumption in any way. Perhaps something like padata might care. Did
you look at padata much?


2010-06-17 23:16:40

by Andrew Morton

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Wed, 16 Jun 2010 18:55:05 +0200
Tejun Heo <[email protected]> wrote:

> It was about using wq for cpu intensive / RT stuff. Linus said,
>
> So stop arguing about irrelevancies. Nobody uses workqueues for RT
> or for CPU-intensive crap. It's not what they were designed for, or
> used for.

kernel/padata.c uses workqueues for cpu-intensive work, as I understand
it.

I share Daniel's concerns here. Being able to set a worker thread's
priority or policy isn't a crazy thing. Also one might want to specify
that a work item be executed on one of a node's CPUs, or within a
cpuset's CPUs, maybe other stuff. I have vague feelings that there's
already code in the kernel somewhere which does some of these things.

(Please remind me what your patches did about create_rt_workqueue and
stop_machine?)

(Please note that drivers/media/video/ivtv/ivtv-irq.c is currently
running sched_setscheduler() against a workqueue thread of its own
creation, so we have precedent).

If someone wants realtime service for a work item then at present, the
way to do that is to create your own kernel threads, set their policy
and start feeding them work items. That sounds like a sensible
requirement and implementation to me. But how does it translate into
the new implementation?

The priority/policy logically attaches to the work itself, not to the
thread which serves it. So one would want to be able to provide that
info at queue_work()-time. Could the workqueue core then find a thread,
set its policy/priority, schedule it and then let the CPU scheduler do
its usual thing with it?

That doesn't sound too bad? Add policy/priority/etc fields to the
work_struct?

2010-06-17 23:17:29

by Andrew Morton

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, 17 Jun 2010 08:01:06 -0400
Andy Walls <[email protected]> wrote:

> I'm going to agree with Tejun, that tweaking worker thread priorities
> seems like an odd thing, since they are meant to handle deferable
> actions - things that can be put off until later.

Disagree. If you're in an interrupt handler and have some work which
you want done in process context and you want it done RIGHT NOW then
handing that work off to a realtime-policy worker thread is a fine way of
doing that.

2010-06-17 23:28:32

by Joel Becker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, Jun 17, 2010 at 04:14:12PM -0700, Andrew Morton wrote:
> flush_workqueue() sucks. It's a stupid, accidental,
> internal-implementation-dependent interface. We should deprecate it
> and try to get rid of it, migrating to the eminently more sensible
> flush_work().
>
> I guess the first step is to add a dont-do-that checkpatch warning when
> people try to add new flush_workqueue() calls.
>
> 165 instances tree-wide, sigh.

What would the API be for "I want this workqueue emptied before
I shut this thing down?" It seems silly to have everyone open-code a
loop trying to flush_work() every item in the queue.

Joel

--

Life's Little Instruction Book #20

"Be forgiving of yourself and others."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2010-06-17 23:59:19

by Andrew Morton

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, 17 Jun 2010 16:25:03 -0700
Joel Becker <[email protected]> wrote:

> On Thu, Jun 17, 2010 at 04:14:12PM -0700, Andrew Morton wrote:
> > flush_workqueue() sucks. It's a stupid, accidental,
> > internal-implementation-dependent interface. We should deprecate it
> > and try to get rid of it, migrating to the eminently more sensible
> > flush_work().
> >
> > I guess the first step is to add a dont-do-that checkpatch warning when
> > people try to add new flush_workqueue() calls.
> >
> > 165 instances tree-wide, sigh.
>
> What would the API be for "I want this workqueue emptied before
> I shut this thing down?"

Um, yeah. flush_workqueue() is legitimate. I was thinking of
flush_scheduled_work() - the one which operates on the keventd queue.

2010-06-18 06:37:06

by Florian Mickler

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Thu, 17 Jun 2010 11:03:49 -0700
Daniel Walker <[email protected]> wrote:

>
> I don't agree with your analogy here .. It's more like you have two
> items in the vending machine item A and item B. Tejun is saying he likes
> item A, his friends like item A so that must mean item B is not
> interesting so he removes it (without knowing how many people want it).
> So what happens? People use another vending machine. (Linux is the
> vending machine) ..

fair enough. let's stop the analogies here, i'm already sorry I
started with it. :)

> >From my perspective this is like using Linux only for throughput which
> is what Tejun is maximizing. In addition Tejun is blowing up the
> alternative which is to prioritize the work items and sticking you with
> strictly a throughput based system.

No, the fact that multiple workers work on a workqueue decreases
latency as well... (worst case latency would be the same, if there were
a minimum limit for worker threads per queue. I'm not shure if this is
implemented.. )

>
> Do you thing maybe it's possible that the work items aren't all created
> equal? Maybe one item _is_ more important than another one. Maybe on a
> given system Tejun's workqueues runs a 1000 useless pointless work items
> before he gets to the valuable one .. Yet the user is powerless to
> dictate what is or is not important.

you have a point here. I think the priority should go with the
work-item though. Or do you think that is not a good solution?

one priority-inheritance would be to increase the priority of all
work-items queued before the "valuable" work item. another way is to
just have enough worker-threads so that all work-items are executed
as fast as possible (if necessary summoning new workers). a third is to
have a separate worker-pool and workqueue for high-priority work. and
i'm guessing there are more solutions...

> > Once you have the priority in the driver you could pass it to the
> > workqueue subsystem (i.e. set the priority of the work) and the worker
> > could then assume the priority of its work.
> >
> > The tricky part is probably to pass the priority from the userspace
> > thread to the kernel?
>
> Threads are designed to have priorities tho, and threads are pervasive
> throughout Linux .. It seems like setting a priorities to drivers would
> be like re-inventing the wheel ..

No, no. Not setting priorities to drivers. What I wanted to get at, is
that the one who shedules the work has to/can decide what priority that
work should run as. I.e. the priority has to go with the work.

Because you are upping the priority of a thread not for the thread's
sake but for the work the thread is going to execute/executing?

>
> If you were to say make the driver use kthreads instead of workqueues,
> then you could set the priority of the kthreads .. However, you said
> this isn't part of the ABI and so your back to your original argument
> which is that you shouldn't be setting priorities in the first place.

I don't follow. What I thought about was, that the "workqueue"
interface is defined in workqueue.h.

There is no "increase_priority_of_workqueue()" or
"increase_work_priority()" at the moment in the interface description.

Fumbling at the threads is using _implementation knowledge_ that should
be hidden by the interface.
I'm not saying this is a binary true/false kind of "fact", just one
view point.

Also I agree that some ability to prioritize work
items has to be enabled. And using the scheduler for that is the only
sane alternative. But I think the workqueue-implementation should do
it, so it has the freedom to dispatch threads at will.

>
> Daniel
>

2010-06-18 07:15:57

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/18/2010 01:56 AM, Andrew Morton wrote:
> On Thu, 17 Jun 2010 16:25:03 -0700
> Joel Becker <[email protected]> wrote:
>
>> On Thu, Jun 17, 2010 at 04:14:12PM -0700, Andrew Morton wrote:
>>> flush_workqueue() sucks. It's a stupid, accidental,
>>> internal-implementation-dependent interface. We should deprecate it
>>> and try to get rid of it, migrating to the eminently more sensible
>>> flush_work().
>>>
>>> I guess the first step is to add a dont-do-that checkpatch warning when
>>> people try to add new flush_workqueue() calls.
>>>
>>> 165 instances tree-wide, sigh.
>>
>> What would the API be for "I want this workqueue emptied before
>> I shut this thing down?"
>
> Um, yeah. flush_workqueue() is legitimate. I was thinking of
> flush_scheduled_work() - the one which operates on the keventd queue.

With cmwq, each wq now costs much less, so we should be able to
convert them to use their own workqueues without too much problem and
deprecate flushing of system workqueues.

Thanks.

--
tejun

2010-06-18 07:16:51

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/18/2010 01:16 AM, Andrew Morton wrote:
> On Thu, 17 Jun 2010 08:01:06 -0400
> Andy Walls <[email protected]> wrote:
>
>> I'm going to agree with Tejun, that tweaking worker thread priorities
>> seems like an odd thing, since they are meant to handle deferable
>> actions - things that can be put off until later.
>
> Disagree. If you're in an interrupt handler and have some work which
> you want done in process context and you want it done RIGHT NOW then
> handing that work off to a realtime-policy worker thread is a fine way of
> doing that.

In that case, the right thing to do would be using threaded interrupt
handler. It's not only easier but also provide enough context such
that RT kernel can do the right thing.

Thanks.

--
tejun

2010-06-18 07:32:35

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/18/2010 01:14 AM, Andrew Morton wrote:
> Thanks for doing this. It helps. And look at all the interest and
> helpful suggestions!

Yay!

>> One such problem is possible deadlock through dependency on the same
>> execution resource. These can be detected quite reliably with lockdep
>> these days but in most cases the only solution is to create a
>> dedicated wq for one of the parties involved in the deadlock, which
>> feeds back into the waste of resources. Also, when creating such
>> dedicated wq to avoid deadlock, to avoid wasting large number of
>> threads just for that work, ST wqs are often used but in most cases ST
>> wqs are suboptimal compared to MT wqs.
>
> Does this approach actually *solve* the deadlocks due to work
> dependencies? Or does it just make the deadlocks harder to hit by
> throwing more threads at the problem?
>
> ah, from reading on I see it's the make-them-harder-to-hit approach.

Yeah, the latter, much harder.

> Deos lockdep still tell us that we're in a potentially deadlockable
> situation?

Lockdep wouldn't apply as-is. I _think_ it's possible to calculate
the possibility of simultaneous works hitting the limit by extending
lockdep but given the use cases we currently have (all are very
shallow dependency chains, most of them being 2), I don't think it's
urgent.

> There are places where code creates workqueue threads and then fiddles
> with those threads' scheduling priority or scheduling policy or
> whatever. I'll address that in a different email.

Alright.

> flush_workqueue() sucks. It's a stupid, accidental,
> internal-implementation-dependent interface. We should deprecate it
> and try to get rid of it, migrating to the eminently more sensible
> flush_work().
>
> I guess the first step is to add a dont-do-that checkpatch warning when
> people try to add new flush_workqueue() calls.
>
> 165 instances tree-wide, sigh.

I would prefer sweeping fix followed by deprecation of the function.
Gradual changes sound nice but in most cases they just result in
postponing what needs to be done anyway.

>> == Automatically regulated shared worker pool
>>
>> For any worker pool, managing the concurrency level (how many workers
>> are executing simultaneously) is an important issue.
>
> Why? What are we trying to avoid here?

Unnecessary heuristics which may sometimes schedule too many wasting
resources and polluting cachelines while other times schedules too
few introducing unnecessary latencies.

>> cmwq tries to
>> keep the concurrency at minimum but sufficient level.
>
> I don't have a hope of remembering what all the new three-letter and
> four-letter acronyms mean :(

It stands for Concurrency Managed WorkQueue. Eh well, as long as it
works as an identifier.

>> Concurrency management is implemented by hooking into the scheduler.
>> gcwq is notified whenever a busy worker wakes up or sleeps and thus
>
> <tries to work out what gcwq means, and not just "what it expands to">

Global cpu workqueue. It's the actual percpu workqueue which does all
the hard work. Workqueues and their associated cpu workqueues works
as frontends to gcwqs.

>> can keep track of the current level of concurrency. Works aren't
>> supposed to be cpu cycle hogs and maintaining just enough concurrency
>> to prevent work processing from stalling due to lack of processing
>> context should be optimal. gcwq keeps the number of concurrent active
>> workers to minimum but no less.
>
> Is that "the number of concurrent active workers per cpu"?

I don't really understand your question.

>> As long as there's one or more
>> running workers on the cpu, no new worker is scheduled so that works
>> can be processed in batch as much as possible but when the last
>> running worker blocks, gcwq immediately schedules new worker so that
>> the cpu doesn't sit idle while there are works to be processed.
>
> "immediately schedules": I assume that this means that the thread is
> made runnable, but isn't necessarily immediately executed?
>
> If it _is_ immediately given the CPU then it sounds locky uppy?

It's made runnable.

>> This allows using minimal number of workers without losing execution
>> bandwidth. Keeping idle workers around doesn't cost much other than
>> the memory space, so cmwq holds onto idle ones for a while before
>> killing them.
>>
>> As multiple execution contexts are available for each wq, deadlocks
>> around execution contexts is much harder to create. The default
>> workqueue, system_wq, has maximum concurrency level of 256 and unless
>> there is a use case which can result in a dependency loop involving
>> more than 254 workers, it won't deadlock.
>
> ah, there we go.
>
> hm.
>
>> Such forward progress guarantee relies on that workers can be created
>> when more execution contexts are necessary. This is guaranteed by
>> using emergency workers. All wqs which can be used in allocation path
>
> allocation of what?

Memory to create new kthreads.

>> == Numbers (this is with the third take but nothing which could affect
>> performance has changed since then. Eh well, very little has
>> changed since then in fact.)
>
> yes, it's hard to see how any of these changes could affect CPU
> consumption in any way. Perhaps something like padata might care. Did
> you look at padata much?

I've read about it. Haven't read the code yet tho. Accomodating it
isn't difficult. We just need an interface which works used by padata
can call which tell wq not to track concurrency for the worker as it's
serving cpu intensive job.

Thanks.

--
tejun

2010-06-18 07:32:47

by Andrew Morton

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Fri, 18 Jun 2010 09:16:15 +0200 Tejun Heo <[email protected]> wrote:

> On 06/18/2010 01:16 AM, Andrew Morton wrote:
> > On Thu, 17 Jun 2010 08:01:06 -0400
> > Andy Walls <[email protected]> wrote:
> >
> >> I'm going to agree with Tejun, that tweaking worker thread priorities
> >> seems like an odd thing, since they are meant to handle deferable
> >> actions - things that can be put off until later.
> >
> > Disagree. If you're in an interrupt handler and have some work which
> > you want done in process context and you want it done RIGHT NOW then
> > handing that work off to a realtime-policy worker thread is a fine way of
> > doing that.
>
> In that case, the right thing to do would be using threaded interrupt
> handler. It's not only easier but also provide enough context such
> that RT kernel can do the right thing.

Nope. Consider a simple byte-at-a-time rx handler. The ISR grabs the
byte, stashes it away, bangs on the hardware a bit then signals
userspace to promptly start processing that byte. Very simple,
legitimate and a valid thing to do.

Also the "interrupt" code might be running from a timer handler. Or it
might just be in process context, buried in a forest of locks and wants
to punt further processing into a separate process.

2010-06-18 08:04:35

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Helo,

On 06/18/2010 01:15 AM, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:55:05 +0200
> Tejun Heo <[email protected]> wrote:
>
>> It was about using wq for cpu intensive / RT stuff. Linus said,
>>
>> So stop arguing about irrelevancies. Nobody uses workqueues for RT
>> or for CPU-intensive crap. It's not what they were designed for, or
>> used for.
>
> kernel/padata.c uses workqueues for cpu-intensive work, as I understand
> it.

Replied in the other mail but supporting padata isn't hard and I think
padata is actually the right way to support cpu intensive workload.
wq works as (coneceptually) simple concurrency provider and another
core layer can manage its priority and re-export it as necessary.

> I share Daniel's concerns here. Being able to set a worker thread's
> priority or policy isn't a crazy thing.

Well, priority itself isn't but doing that from userland is and most
of the conversation was about cmwq taking away the ability to do that
from userland.

> Also one might want to specify that a work item be executed on one
> of a node's CPUs, or within a cpuset's CPUs, maybe other stuff. I
> have vague feelings that there's already code in the kernel
> somewhere which does some of these things.

There was virtual driver which wanted to use put workers into cpusets.
I'll talk about it below w/ ivtv.

> (Please remind me what your patches did about create_rt_workqueue and
> stop_machine?)

stop_machine was using wq as frontend to threads and repeatedly
creating and destroying them on demand which caused scalability issues
on machines with a lot of cpus. Scheduler had per-cpu persistent RT
threads which were multiplexed in ad-hoc way to serve other purposes
too. cpu_stop implements per-cpu persistent RT workers with proper
interface and now both scheduler and stop_machine use them.

> (Please note that drivers/media/video/ivtv/ivtv-irq.c is currently
> running sched_setscheduler() against a workqueue thread of its own
> creation, so we have precedent).

Oooh... missed that.

> If someone wants realtime service for a work item then at present, the
> way to do that is to create your own kernel threads, set their policy
> and start feeding them work items. That sounds like a sensible
> requirement and implementation to me. But how does it translate into
> the new implementation?
>
> The priority/policy logically attaches to the work itself, not to the
> thread which serves it. So one would want to be able to provide that
> info at queue_work()-time. Could the workqueue core then find a thread,
> set its policy/priority, schedule it and then let the CPU scheduler do
> its usual thing with it?
>
> That doesn't sound too bad? Add policy/priority/etc fields to the
> work_struct?

Yeah, sure, we can do that but I think it would be an over engineering
effort. Vast majority of use cases use workqueues as simple execution
context provider and work much better with the worker sharing
implemented by cmwq (generally lower latency, much less restrictions).

Cases where special per-thread attribute adjustments are necessary can
be better served in more flexible way by making kthread easier to use.
Priority is one thing but someone wants cpuset affinity, there's no
way to do that with shared workers and it's silly to not share workers
at all for those few exceptions.

ST wq essentially worked as simple thread wrapper and it grew a few of
those usages but they can be counted with one hand in the whole
kernel. Converting to kthread is usually okay to do but getting the
kthread_stop() and memory barriers right can be pain in the ass, so
having a easier wrapper there would be pretty helpful.

Thanks.

--
tejun

2010-06-18 08:10:50

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/18/2010 09:31 AM, Andrew Morton wrote:
> On Fri, 18 Jun 2010 09:16:15 +0200 Tejun Heo <[email protected]> wrote:
> Nope. Consider a simple byte-at-a-time rx handler. The ISR grabs the
> byte, stashes it away, bangs on the hardware a bit then signals
> userspace to promptly start processing that byte. Very simple,
> legitimate and a valid thing to do.
>
> Also the "interrupt" code might be running from a timer handler. Or it
> might just be in process context, buried in a forest of locks and wants
> to punt further processing into a separate process.

Sure, there'll be cases which would be better served that way but
things which fit neither the traditional interrupt handler nor the
threaded one are in very small minority. I think having niche
solutions for those niche problems would be far better than trying to
engineer generic async mechanism to serve all of them.

Thanks.

--
tejun

2010-06-18 08:23:51

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/18/2010 10:03 AM, Tejun Heo wrote:
> Converting to kthread is usually okay to do but getting the
> kthread_stop() and memory barriers right can be pain in the ass, so
> having a easier wrapper there would be pretty helpful.

Thinking more about it the interface could be pretty similar to wq.
The only qualm I have w/ wq is that it requires allowing works to be
freed once execution starts, which is sometimes convenient but a major
pain to implement correctly w/ flushing, requeueing and all. Such
complexities end up visible to the users too through quirkiness in
flush semantics. But, other than that, wrapping kthread in a prettier
outfit for cases which require a dedicated thread and don't wanna
bother with kthread directly should only take few hundred lines of
code.

Thanks.

--
tejun

2010-06-18 16:38:51

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Fri, 2010-06-18 at 08:36 +0200, Florian Mickler wrote:
> On Thu, 17 Jun 2010 11:03:49 -0700
> Daniel Walker <[email protected]> wrote:
>
> >
> > I don't agree with your analogy here .. It's more like you have two
> > items in the vending machine item A and item B. Tejun is saying he likes
> > item A, his friends like item A so that must mean item B is not
> > interesting so he removes it (without knowing how many people want it).
> > So what happens? People use another vending machine. (Linux is the
> > vending machine) ..
>
> fair enough. let's stop the analogies here, i'm already sorry I
> started with it. :)

Glad that's over ;)

> > >From my perspective this is like using Linux only for throughput which
> > is what Tejun is maximizing. In addition Tejun is blowing up the
> > alternative which is to prioritize the work items and sticking you with
> > strictly a throughput based system.
>
> No, the fact that multiple workers work on a workqueue decreases
> latency as well... (worst case latency would be the same, if there were
> a minimum limit for worker threads per queue. I'm not shure if this is
> implemented.. )

Your right it would reduce system wide latency .. The issues isn't
latency in the entire system tho, the issue is latency that specifically
matters to me as a user. Tejun's patches maximize throughput across all
work items regardless of my given system priorities ..

> > Do you thing maybe it's possible that the work items aren't all created
> > equal? Maybe one item _is_ more important than another one. Maybe on a
> > given system Tejun's workqueues runs a 1000 useless pointless work items
> > before he gets to the valuable one .. Yet the user is powerless to
> > dictate what is or is not important.
>
> you have a point here. I think the priority should go with the
> work-item though. Or do you think that is not a good solution?

I do agree that we would want the work items to continue having
priorities .. The current system does that by using specific threads.

> one priority-inheritance would be to increase the priority of all
> work-items queued before the "valuable" work item. another way is to
> just have enough worker-threads so that all work-items are executed
> as fast as possible (if necessary summoning new workers). a third is to
> have a separate worker-pool and workqueue for high-priority work. and
> i'm guessing there are more solutions...

I don't think there is an easy way to add that into Tejun's system tho.
He could raise and low the thread priorities of the worker threads
depending on the work item, but you would also have to sort the work
items by priority ..

> > > Once you have the priority in the driver you could pass it to the
> > > workqueue subsystem (i.e. set the priority of the work) and the worker
> > > could then assume the priority of its work.
> > >
> > > The tricky part is probably to pass the priority from the userspace
> > > thread to the kernel?
> >
> > Threads are designed to have priorities tho, and threads are pervasive
> > throughout Linux .. It seems like setting a priorities to drivers would
> > be like re-inventing the wheel ..
>
> No, no. Not setting priorities to drivers. What I wanted to get at, is
> that the one who shedules the work has to/can decide what priority that
> work should run as. I.e. the priority has to go with the work.
>
> Because you are upping the priority of a thread not for the thread's
> sake but for the work the thread is going to execute/executing?

Sounds like priority inheritance which is ideal I think, but it's not
part of mainline. I've been trying to avoid talking about features not
already part of the mainline implementation ..

Daniel

--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2010-06-18 17:03:37

by Andrew Morton

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Fri, 18 Jun 2010 10:09:35 +0200 Tejun Heo <[email protected]> wrote:

> Hello,
>
> On 06/18/2010 09:31 AM, Andrew Morton wrote:
> > On Fri, 18 Jun 2010 09:16:15 +0200 Tejun Heo <[email protected]> wrote:
> > Nope. Consider a simple byte-at-a-time rx handler. The ISR grabs the
> > byte, stashes it away, bangs on the hardware a bit then signals
> > userspace to promptly start processing that byte. Very simple,
> > legitimate and a valid thing to do.
> >
> > Also the "interrupt" code might be running from a timer handler. Or it
> > might just be in process context, buried in a forest of locks and wants
> > to punt further processing into a separate process.
>
> Sure, there'll be cases which would be better served that way but
> things which fit neither the traditional interrupt handler nor the
> threaded one are in very small minority. I think having niche
> solutions for those niche problems would be far better than trying to
> engineer generic async mechanism to serve all of them.

um. We've *already* engineered a mechanism which serves the
requirements which we've been discussing. Then Tejun came along,
called it "niche" and busted it!

Oh well. Kernel threads should not be running with RT policy anyway.
RT is a userspace feature, and whenever a kernel thread uses RT it
degrades userspace RT qos. But I expect that using RT in kernel
threads is sometimes the best tradeoff, so let's not pretend that we're
getting something for nothing here!

2010-06-18 17:29:29

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/18/2010 07:02 PM, Andrew Morton wrote:
> um. We've *already* engineered a mechanism which serves the
> requirements which we've been discussing. Then Tejun came along,
> called it "niche" and busted it!

Andrew, can we turn down dramatization a little bit? I'll try to
explain my viewpoint again, hopefully in clearer manner.

Workqueue is primarily an async mechanism. It's designed to be that
way and most of its current users don't care that much about execution
priority. As everything else in the kernel, it started simple and
continued to evolve. Up until now, workqueue - thread association has
been largely fixed and it naturally grew some number of users which
depended on that property. Whether you like the term "niche" or not,
they are few by large proportion.

Please note that such association isn't something workqueue strives to
achieve. For example, tryinig to exploit the association on MT
workqueue will be cumbersome over suspend/resume cycles and there's no
provision to manage such events.

Workqueues have become quite popular and grew a lot of users.
Compounded by the increasing number of CPU cores, it's also showing
various limitations, some of which affect ease of use while others
scalability or unexpectedly long latencies. Many of these issues for
most workqueue users can be addressed by sharing workers and managing
work execution centrally, which is what cmwq does.

Sharing workers does cause problems for users which have assumed fixed
workqueue - thread association, but as mentioned earlier, they're few
and converting them to directly use kthread or a (yet non-existing)
facility which guarantees fixed thread association wouldn't be
difficult.

So, it seems logical to move those few cases over to kthread or a new
facility and improve workqueue for the majority of users. I'm not
planning on "busting" anything. All such users of course will be
converted so that they keep working properly.

> Oh well. Kernel threads should not be running with RT policy anyway.
> RT is a userspace feature, and whenever a kernel thread uses RT it
> degrades userspace RT qos. But I expect that using RT in kernel
> threads is sometimes the best tradeoff, so let's not pretend that we're
> getting something for nothing here!

Such use is limited to ST workqueues and I can write up a simple
wrapper around kthread in a few hundred lines if necessary. So, yes,
we're losing one way of using ST workqueues but it sure wasn't very
popular way to use it and it can easily be replaced.

Thanks.

--
tejun

2010-06-18 17:29:36

by Daniel Walker

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On Fri, 2010-06-18 at 10:03 +0200, Tejun Heo wrote:

> > I share Daniel's concerns here. Being able to set a worker thread's
> > priority or policy isn't a crazy thing.
>
> Well, priority itself isn't but doing that from userland is and most
> of the conversation was about cmwq taking away the ability to do that
> from userland.

I did a little test of this on v2.6.31 with my laptop.

I used this test along with "fio" ,

[random-read]
rw=randread
size=128m
directory=/tmp/

and I got these results,

clat (usec): min=196, max=185962, avg=11820.81, stdev=5961.96
bw (KB/s) : min= 67, max= 665, per=100.20%, avg=337.68, stdev=36.55

then I raise the priority of kblockd to FIFO priority 50 with the
following results,

clat (usec): min=198, max=118528, avg=11749.48, stdev=5280.79
bw (KB/s) : min= 184, max= 696, per=100.20%, avg=339.66, stdev=32.98

We ended up with a %36 max latency reduction, a slight increase in the
min latency (~%2) and a slight decrease in the average latency. The
latency became more deterministic ..

Now for the bandwidth we have a ~%5 maximum bandwidth increase, a huge
increase in minimum bandwidth (%174 am I calculating that right?), and a
slight increase in the average (%0.5) ..

These results are just one off, so please re-test and check up on me.
I'm not very familiar with "fio" or kblockd in general.

However, these results seem far from crazy to me .. In fact I think
people who really care about this sort of stuff might want to look into
this type of tuning .. You get more deterministic latency, and you get
slightly better performance in the average but way better performance in
the corner cases.

Daniel

--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2010-06-19 08:38:08

by Andi Kleen

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Andy Walls <[email protected]> writes:
>
> I'm going to agree with Tejun, that tweaking worker thread priorities
> seems like an odd thing, since they are meant to handle deferable
> actions - things that can be put off until later.

> If one needs to support Real Time deadlines on deferable actions,
> wouldn't using dedicated kernel threads be more deterministic?
> Would the user ever up the priority for a workqueue other than a
> single-threaded workqueue?

One exceptional case here are things like high priority error handling
which is rare.

For example you get an MCE that tells you some of your
memory got corrupted and you should handle it ASAP.
Better give it high priority then.

But it's still a rare event so you don't want dedicated
threads hanging around for it all time
(that's what we currently have and it causes all sorts
of problems)

So yes I think having a priority mechanism for work items
is useful.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 08:41:25

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/19/2010 10:38 AM, Andi Kleen wrote:
> Andy Walls <[email protected]> writes:
>>
>> I'm going to agree with Tejun, that tweaking worker thread priorities
>> seems like an odd thing, since they are meant to handle deferable
>> actions - things that can be put off until later.
>
>> If one needs to support Real Time deadlines on deferable actions,
>> wouldn't using dedicated kernel threads be more deterministic?
>> Would the user ever up the priority for a workqueue other than a
>> single-threaded workqueue?
>
> One exceptional case here are things like high priority error handling
> which is rare.
>
> For example you get an MCE that tells you some of your
> memory got corrupted and you should handle it ASAP.
> Better give it high priority then.
>
> But it's still a rare event so you don't want dedicated
> threads hanging around for it all time
> (that's what we currently have and it causes all sorts
> of problems)
>
> So yes I think having a priority mechanism for work items
> is useful.

Wouldn't that be better served by cpu_stop?

Thanks.

--
tejun

2010-06-19 08:55:44

by Andi Kleen

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Tejun Heo <[email protected]> writes:
>
> Wouldn't that be better served by cpu_stop?

No the error handling has to be able to sleep to take VM
locks. That's the whole point its handed off to a
workqueue. Otherwise it could be done directly.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 09:02:59

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/19/2010 10:55 AM, Andi Kleen wrote:
> Tejun Heo <[email protected]> writes:
>>
>> Wouldn't that be better served by cpu_stop?
>
> No the error handling has to be able to sleep to take VM
> locks. That's the whole point its handed off to a
> workqueue. Otherwise it could be done directly.

I see. The thing is that if you have "as soon as possible" + "high
priority", you're basically required to have a dedicated worker or
dedicated pool of them. Making cmwq to support some level of priority
definitely is possible (multiple prioritized queues or pushing work at
the front at the simplest) but for such emergency works it doesn't
make sense to share the usual worker pool, as resource pressure can
easily make any work wait regardless of where they're in the queue.

If there are multiple of such use cases, it would make sense to create
a prioritized worker pools along with prioritized per-cpu queues but
if there are only a few of them, I think it makes more sense to use
dedicated threads for them. Do those threads need to be per-cpu?

Thanks.

--
tejun

2010-06-19 09:08:54

by Andi Kleen

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

> I see. The thing is that if you have "as soon as possible" + "high
> priority", you're basically required to have a dedicated worker or
> dedicated pool of them. Making cmwq to support some level of priority
> definitely is possible (multiple prioritized queues or pushing work at
> the front at the simplest) but for such emergency works it doesn't
> make sense to share the usual worker pool, as resource pressure can
> easily make any work wait regardless of where they're in the queue.

I think it's reasonable to just put on front. The individual
items shouldn't take that long, right?

(in fact I have an older patch for work queues which implemented
that)

> If there are multiple of such use cases, it would make sense to create
> a prioritized worker pools along with prioritized per-cpu queues but
> if there are only a few of them, I think it makes more sense to use
> dedicated threads for them. Do those threads need to be per-cpu?

Not strictly, although it might be useful on a error flood when
a whole DIMM goes bad.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 09:14:10

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/19/2010 11:08 AM, Andi Kleen wrote:
> I think it's reasonable to just put on front. The individual
> items shouldn't take that long, right?
>
> (in fact I have an older patch for work queues which implemented
> that)

Well, in general, queueing to execution latency should be fairly low
especially if it's put at the front of the queue but well it's nothing
with any kind of guarantee.

>> If there are multiple of such use cases, it would make sense to create
>> a prioritized worker pools along with prioritized per-cpu queues but
>> if there are only a few of them, I think it makes more sense to use
>> dedicated threads for them. Do those threads need to be per-cpu?
>
> Not strictly, although it might be useful on a error flood when
> a whole DIMM goes bad.

I'm currently writing a kthread wrapper which basically provides
similar interface to wq but guarantees binding to a specific thread
which can be RT of course. If single threadedness is acceptable, I
think this would render better behavior. What do you think?

Thanks.

--
tejun

2010-06-19 09:15:22

by Andi Kleen

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

> Well, in general, queueing to execution latency should be fairly low
> especially if it's put at the front of the queue but well it's nothing
> with any kind of guarantee.

This is not a hard real time situation with a hard deadline,
just "ASAP"

> I'm currently writing a kthread wrapper which basically provides
> similar interface to wq but guarantees binding to a specific thread
> which can be RT of course. If single threadedness is acceptable, I
> think this would render better behavior. What do you think?

I think I would prefer simply high priority, but normal work item.

Otherwise we have the thread hanging around all the time
and on a large system it's still only a single one, so
it'll never scale.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 09:18:30

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/19/2010 11:15 AM, Andi Kleen wrote:
>> Well, in general, queueing to execution latency should be fairly low
>> especially if it's put at the front of the queue but well it's nothing
>> with any kind of guarantee.
>
> This is not a hard real time situation with a hard deadline,
> just "ASAP"
>
>> I'm currently writing a kthread wrapper which basically provides
>> similar interface to wq but guarantees binding to a specific thread
>> which can be RT of course. If single threadedness is acceptable, I
>> think this would render better behavior. What do you think?
>
> I think I would prefer simply high priority, but normal work item.
>
> Otherwise we have the thread hanging around all the time
> and on a large system it's still only a single one, so
> it'll never scale.

Hmmm... yeah, adding it isn't hard. I'm just a bit skeptical how
useful it would be. Having only single user would be a bit silly.
Can you think of anything else which could benefit from high priority
queueing?

Thanks.

--
tejun

2010-06-19 09:28:02

by Andi Kleen

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

> Hmmm... yeah, adding it isn't hard. I'm just a bit skeptical how
> useful it would be. Having only single user would be a bit silly.

Why? If it's an important user...

> Can you think of anything else which could benefit from high priority
> queueing?

Over time we'll get more error handling, e.g. advanced NMI handling.
Maybe it could be useful for thermal handling too which is a similar
situation.

To be honest I would prefer if there aren't that many more users,
the more users the less useful it becomes.

-Andi
--
[email protected] -- Speaking for myself only.

2010-06-19 09:44:10

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

Hello,

On 06/19/2010 11:27 AM, Andi Kleen wrote:
>> Can you think of anything else which could benefit from high priority
>> queueing?
>
> Over time we'll get more error handling, e.g. advanced NMI handling.
> Maybe it could be useful for thermal handling too which is a similar
> situation.
>
> To be honest I would prefer if there aren't that many more users,
> the more users the less useful it becomes.

As long as the actual frequency is low, the number of users should be
okay. Okay, just one more question before adding it to todo list. Do
you think it would really benefit from scalability provided by
multiple workers?

* Do machines ever report that many MCE errors? The usual rate seems
like one per weeks or months even when they're frequent.

* If a machine is actually reporting enough errors to overwhelm single
error handling thread, does it even matter what we do?

Thanks.

--
tejun

2010-06-19 12:20:18

by Andi Kleen

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

> * Do machines ever report that many MCE errors? The usual rate seems
> like one per weeks or months even when they're frequent.

Yes you can even get a flood when the system or parts
of the system is dying. There's some throttling in the error handling,
but yes in some cases you can get high rates.

> * If a machine is actually reporting enough errors to overwhelm single
> error handling thread, does it even matter what we do?

There are cases where you can have many errors but still survive.

-Andi
--
[email protected] -- Speaking for myself only.

2010-06-19 12:49:06

by Tejun Heo

[permalink] [raw]
Subject: Re: Overview of concurrency managed workqueue

On 06/19/2010 02:20 PM, Andi Kleen wrote:
>> * Do machines ever report that many MCE errors? The usual rate seems
>> like one per weeks or months even when they're frequent.
>
> Yes you can even get a flood when the system or parts
> of the system is dying. There's some throttling in the error handling,
> but yes in some cases you can get high rates.
>
>> * If a machine is actually reporting enough errors to overwhelm single
>> error handling thread, does it even matter what we do?
>
> There are cases where you can have many errors but still survive.

Alright, adding to todo list.

Thanks.

--
tejun

2010-06-19 15:54:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] kthread: implement kthread_worker

Implement simple work processor for kthread. This is to ease using
kthread. Single thread workqueue used to be used for things like this
but workqueue won't guarantee fixed kthread association anymore to
enable worker sharing.

This can be used in cases where specific kthread association is
necessary, for example, when it should have RT priority or be assigned
to certain cgroup.

Signed-off-by: Tejun Heo <[email protected]>
---
So, basically something like this. This doesn't support all the
features of workqueue but should be enough for most cases and we can
always add missing bits as necessary. This is basically the same
logic that I wrote for vhost (vhost people cc'd) and vhost should be
able to switch easily. It's simple but efficient and makes using
kthread a lot easier.

Andrew, does this look like a reasonable compromise? If so, I'll put
it in the series, convert ivtv and integrate padata and post another
round.

Thanks.

include/linux/kthread.h | 63 +++++++++++++++++++++++++
kernel/kthread.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 181 insertions(+)

Index: work/kernel/kthread.c
===================================================================
--- work.orig/kernel/kthread.c
+++ work/kernel/kthread.c
@@ -14,6 +14,8 @@
#include <linux/file.h>
#include <linux/module.h>
#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/freezer.h>
#include <trace/events/sched.h>

static DEFINE_SPINLOCK(kthread_create_lock);
@@ -247,3 +249,119 @@ int kthreadd(void *unused)

return 0;
}
+
+/**
+ * kthread_worker_fn - kthread function to process kthread_worker
+ * @data: pointer to initialized kthread_worker
+ *
+ * This function can be used as @threadfn to kthread_create() or
+ * kthread_run() with @data argument pointing to an initialized
+ * kthread_worker. The started kthread will process work_list until
+ * the it is stopped with kthread_stop(). A kthread can also call
+ * this function directly after extra initialization.
+ *
+ * Different kthreads can be used for the same kthread_worker as long
+ * as there's only one kthread attached to it at any given time. A
+ * kthread_worker without an attached kthread simply collects queued
+ * kthread_works.
+ */
+int kthread_worker_fn(void *worker_ptr)
+{
+ struct kthread_worker *worker = worker_ptr;
+ struct kthread_work *work;
+
+ WARN_ON(worker->task);
+ worker->task = current;
+repeat:
+ set_current_state(TASK_INTERRUPTIBLE); /* mb paired w/ kthread_stop */
+
+ if (kthread_should_stop()) {
+ __set_current_state(TASK_RUNNING);
+ spin_lock_irq(&worker->lock);
+ worker->task = NULL;
+ spin_unlock_irq(&worker->lock);
+ return 0;
+ }
+
+ work = NULL;
+ spin_lock_irq(&worker->lock);
+ if (!list_empty(&worker->work_list)) {
+ work = list_first_entry(&worker->work_list,
+ struct kthread_work, node);
+ list_del_init(&work->node);
+ }
+ spin_unlock_irq(&worker->lock);
+
+ if (work) {
+ __set_current_state(TASK_RUNNING);
+ work->func(work);
+ smp_wmb(); /* wmb worker-b0 paired with flush-b1 */
+ work->done_seq = work->queue_seq;
+ smp_mb(); /* mb worker-b1 paired with flush-b0 */
+ if (atomic_read(&work->flushing))
+ wake_up_all(&work->done);
+ } else if (!freezing(current))
+ schedule();
+
+ try_to_freeze();
+ goto repeat;
+}
+EXPORT_SYMBOL_GPL(kthread_worker_fn);
+
+/**
+ * queue_kthread_work - queue a kthread_work
+ * @worker: target kthread_worker
+ * @work: kthread_work to queue
+ *
+ * Queue @work to work processor @task for async execution. @task
+ * must have been created with kthread_worker_create(). Returns %true
+ * if @work was successfully queued, %false if it was already pending.
+ */
+bool queue_kthread_work(struct kthread_worker *worker,
+ struct kthread_work *work)
+{
+ bool ret = false;
+ unsigned long flags;
+
+ spin_lock_irqsave(&worker->lock, flags);
+ if (list_empty(&work->node)) {
+ list_add_tail(&work->node, &worker->work_list);
+ work->queue_seq++;
+ if (likely(worker->task))
+ wake_up_process(worker->task);
+ ret = true;
+ }
+ spin_unlock_irqrestore(&worker->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(queue_kthread_work);
+
+/**
+ * flush_kthread_work - flush a kthread_work
+ * @work: work to flush
+ *
+ * If @work is queued or executing, wait for it to finish execution.
+ */
+void flush_kthread_work(struct kthread_work *work)
+{
+ int seq = work->queue_seq;
+
+ atomic_inc(&work->flushing);
+
+ /*
+ * mb flush-b0 paired with worker-b1, to make sure either
+ * worker sees the above increment or we see done_seq update.
+ */
+ smp_mb__after_atomic_inc();
+
+ /* A - B <= 0 tests whether B is in front of A regardless of overflow */
+ wait_event(work->done, seq - work->done_seq <= 0);
+ atomic_dec(&work->flushing);
+
+ /*
+ * rmb flush-b1 paired with worker-b0, to make sure our caller
+ * sees every change made by work->func().
+ */
+ smp_mb__after_atomic_dec();
+}
+EXPORT_SYMBOL_GPL(flush_kthread_work);
Index: work/include/linux/kthread.h
===================================================================
--- work.orig/include/linux/kthread.h
+++ work/include/linux/kthread.h
@@ -34,4 +34,67 @@ int kthread_should_stop(void);
int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;

+/*
+ * Simple work processor based on kthread.
+ *
+ * This provides easier way to make use of kthreads. A kthread_work
+ * can be queued and flushed using queue/flush_kthread_work()
+ * respectively. Queued kthread_works are processed by a kthread
+ * running kthread_worker_fn().
+ *
+ * A kthread_work can't be freed while it is executing.
+ */
+struct kthread_work;
+typedef void (*kthread_work_func_t)(struct kthread_work *work);
+
+struct kthread_worker {
+ spinlock_t lock;
+ struct list_head work_list;
+ struct task_struct *task;
+};
+
+struct kthread_work {
+ struct list_head node;
+ kthread_work_func_t func;
+ wait_queue_head_t done;
+ atomic_t flushing;
+ int queue_seq;
+ int done_seq;
+};
+
+#define KTHREAD_WORKER_INIT(worker) { \
+ .lock = SPIN_LOCK_UNLOCKED, \
+ .work_list = LIST_HEAD_INIT((worker).work_list), \
+ }
+
+#define KTHREAD_WORK_INIT(work, fn) { \
+ .node = LIST_HEAD_INIT((work).node), \
+ .func = (fn), \
+ .done = __WAIT_QUEUE_HEAD_INITIALIZER((work).done), \
+ .flushing = ATOMIC_INIT(0), \
+ }
+
+#define DEFINE_KTHREAD_WORKER(worker) \
+ struct kthread_worker worker = KTHREAD_WORKER_INIT(worker)
+
+#define DEFINE_KTHREAD_WORK(work, fn) \
+ struct kthread_work work = KTHREAD_WORK_INIT(work, fn)
+
+static inline void init_kthread_worker(struct kthread_worker *worker)
+{
+ *worker = (struct kthread_worker)KTHREAD_WORKER_INIT(*worker);
+}
+
+static inline void init_kthread_work(struct kthread_work *work,
+ kthread_work_func_t fn)
+{
+ *work = (struct kthread_work)KTHREAD_WORK_INIT(*work, fn);
+}
+
+int kthread_worker_fn(void *worker_ptr);
+
+bool queue_kthread_work(struct kthread_worker *worker,
+ struct kthread_work *work);
+void flush_kthread_work(struct kthread_work *work);
+
#endif /* _LINUX_KTHREAD_H */

2010-06-21 20:35:53

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] kthread: implement kthread_worker

On Sat, 19 Jun 2010 17:53:55 +0200 Tejun Heo wrote:

> include/linux/kthread.h | 63 +++++++++++++++++++++++++
> kernel/kthread.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 181 insertions(+)
>
> Index: work/kernel/kthread.c
> ===================================================================
> --- work.orig/kernel/kthread.c
> +++ work/kernel/kthread.c

> @@ -247,3 +249,119 @@ int kthreadd(void *unused)
>
> return 0;
> }
> +
> +/**
> + * kthread_worker_fn - kthread function to process kthread_worker
> + * @data: pointer to initialized kthread_worker

s/data/worker_ptr/

> + *
> + * This function can be used as @threadfn to kthread_create() or
> + * kthread_run() with @data argument pointing to an initialized

ditto.

> + * kthread_worker. The started kthread will process work_list until
> + * the it is stopped with kthread_stop(). A kthread can also call
> + * this function directly after extra initialization.
> + *
> + * Different kthreads can be used for the same kthread_worker as long
> + * as there's only one kthread attached to it at any given time. A
> + * kthread_worker without an attached kthread simply collects queued
> + * kthread_works.
> + */
> +int kthread_worker_fn(void *worker_ptr)
> +{
> + struct kthread_worker *worker = worker_ptr;
> + struct kthread_work *work;
> +
> + WARN_ON(worker->task);
> + worker->task = current;
> +repeat:


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2010-06-22 07:32:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] kthread: implement kthread_worker

Hello,

On 06/21/2010 10:33 PM, Randy Dunlap wrote:
> On Sat, 19 Jun 2010 17:53:55 +0200 Tejun Heo wrote:
>
>> include/linux/kthread.h | 63 +++++++++++++++++++++++++
>> kernel/kthread.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 181 insertions(+)
>>
>> Index: work/kernel/kthread.c
>> ===================================================================
>> --- work.orig/kernel/kthread.c
>> +++ work/kernel/kthread.c
>
>> @@ -247,3 +249,119 @@ int kthreadd(void *unused)
>>
>> return 0;
>> }
>> +
>> +/**
>> + * kthread_worker_fn - kthread function to process kthread_worker
>> + * @data: pointer to initialized kthread_worker
>
> s/data/worker_ptr/
>
>> + *
>> + * This function can be used as @threadfn to kthread_create() or
>> + * kthread_run() with @data argument pointing to an initialized
>
> ditto.

Yeah, that would be a much better idea. Updated.

Thanks.

--
tejun