2009-10-01 08:10:02

by Tejun Heo

[permalink] [raw]
Subject: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Hello, all.

This rather large patchset implements concurrency managed workqueue.
It's not complete yet. Singlethread workqueue handling needs more
work and workqueue users need to be audited and simplified and async
and slow-work should be reimplemented in terms of workqueue. Although
this patchset currently adds ~2000 lines of code, I'm fairly
optimistic that after the whole conversion is done, it would be a net
decrease in lines of code.

This patchset reimplements workqueue such that it auto-regulates
concurrency and thus relieves its users from the managing duty. It
works by managing single shared pool of per-cpu workers and hooking
into the scheduler to get notifications about workers going to sleep
and waking up. Using the mechanism, workqueue implementation keeps
track of the current level of concurrency and schedules only the
necessary number of workers to keep the cpu occupied.

Concurrency managed workqueue has the following benefits.

* Workqueue users no longer have to worry about managing concurrency
and, in most cases, deadlocks. The workqueue will manage it
automatically and unless the deadlock chain involves many (currently
127) works, it won't happen.

* There's one single shared pool of workers per cpu and one rescuer
for each workqueue which requires it, so there are far fewer number
of kthreads.

* More efficient. Although it adds considerable amount of code, the
code added to hot path isn't big and works will be executed on the
local cpu and in batch as much as possible using minimal number of
kthreads leading to fewer task switches and lower cache
footprint. <NEED SOME BACKING NUMBERS>

* As concurrency is no longer a problem, most types of asynchronous
jobs can be done using generic workqueue and other async mechanisms,
including slow-work, async and adhoc subsystem custom ones, can be
removed. ie. It can serve as the unified async thread pool
mechanism.

Please read the patch description of the last patch for more details.

This patchset contains the following 19 patches and most of these are
not signed off yet.

0001-freezer-don-t-get-over-anxious-while-waiting.patch
0002-scheduler-implement-sched_class_equal.patch
0003-scheduler-implement-workqueue-scheduler-class.patch
0004-scheduler-implement-force_cpus_allowed_ptr.patch
0005-kthread-implement-kthread_data.patch
0006-acpi-use-queue_work_on-instead-of-binding-workqueue-.patch
0007-stop_machine-reimplement-without-using-workqueue.patch
0008-workqueue-misc-cosmetic-updates.patch
0009-workqueue-merge-feature-parametesr-into-flags.patch
0010-workqueue-update-cwq-alignement-and-make-one-more-fl.patch
0011-workqueue-define-both-bit-position-and-mask-for-work.patch
0012-workqueue-separate-out-process_one_work.patch
0013-workqueue-temporarily-disable-workqueue-tracing.patch
0014-workqueue-TEMPORARY-kill-singlethread-variant.patch
0015-workqueue-reimplement-workqueue-flushing-using-color.patch
0016-workqueue-introduce-worker.patch
0017-workqueue-reimplement-work-flushing-using-linked-wor.patch
0018-workqueue-reimplement-workqueue-freeze-using-cwq-fro.patch
0019-workqueue-implement-concurrency-managed-workqueue.patch

0001 makes freezer less anxious. It's a mostly unrelated change I did
while looking at the code.

0002-0004 implements scheduler callback for workqueue.

0005 implements kthread_data() which will be used to get back at the
workqueue worker data structure from scheduler callbacks.

0006-0007 kill exotic usages of workqueue.

0008-0010 does misc preparations in workqueue code.

0011-0014 strip out some features from workqueue to prepare for new
implementation. Please note that 0014 is incorrect and I'm currently
working on proper solution.

0015-0018 reimplement flushing and freezing in a way which is friendly
to multiple workers per cpu.

0019 implements concurrency managed workqueue.

Please note that none of the workqueue users has been updated yet. It
will work fine but most workqueues which don't need rescuer would
create one.

I'm attaching test-wq source and Makefile which I used to verify each
aspect of the new workqueue. It's pretty easy to write new test
scenarios using the module so if you're interested in how this
actually work, it is quite useful.

This patchset is on top of the current linus#master
(817b33d38f81c8736d39283c35c886ae4668f1af) and available in the
following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git concurrency-managed-workqueue

and contains the following changes.

drivers/acpi/osl.c | 41
include/linux/kthread.h | 1
include/linux/sched.h | 9
include/linux/stop_machine.h | 6
include/linux/workqueue.h | 63 -
init/main.c | 2
kernel/kthread.c | 15
kernel/power/process.c | 34
kernel/sched.c | 105 +
kernel/sched_fair.c | 62 -
kernel/sched_idletask.c | 1
kernel/sched_rt.c | 1
kernel/sched_workqueue.c | 130 ++
kernel/sched_workqueue.h | 17
kernel/stop_machine.c | 151 ++
kernel/trace/Kconfig | 4
kernel/workqueue.c | 2268 ++++++++++++++++++++++++++++++++++++-------
17 files changed, 2417 insertions(+), 493 deletions(-)

So, what do you guys think?

PS. I'm still working on it and will probably write up about overall
design under Documentation.

Thanks.

--
tejun


2009-10-01 08:10:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/19] freezer: don't get over-anxious while waiting

Freezing isn't exactly the most latency sensitive operation and
there's no reason to burn cpu cycles and power waiting for it to
complete. msleep(10) instead of yield(). This should improve
reliability of emergency hibernation.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/power/process.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index cc2e553..9d26a0a 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
do_gettimeofday(&start);

end_time = jiffies + TIMEOUT;
- do {
+ while (true) {
todo = 0;
read_lock(&tasklist_lock);
do_each_thread(g, p) {
@@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
todo++;
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
- yield(); /* Yield is okay here */
- if (time_after(jiffies, end_time))
+ if (!todo || time_after(jiffies, end_time))
break;
- } while (todo);
+
+ /*
+ * We need to retry. There's no reason to be
+ * over-anxious about it and waste power.
+ */
+ msleep(10);
+ }

do_gettimeofday(&end);
elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
--
1.6.4.2

2009-10-01 08:16:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/19] scheduler: implement sched_class_equal()

Add ->identity to sched_class and implement and use
sched_class_equal() which compares the field to test for equality.
This is to allow sub-classing scheduler classes so that part of it can
be overridden while maintaining most of the original behavior. Please
note that __setscheduler() only switches sched_class if the new
sched_class's identity different from the current one.

NOT_SIGNED_OFF_YET
---
include/linux/sched.h | 1 +
kernel/sched.c | 16 +++++++++++-----
kernel/sched_fair.c | 5 +++--
kernel/sched_idletask.c | 1 +
kernel/sched_rt.c | 1 +
5 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..02f505d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1065,6 +1065,7 @@ struct sched_domain;

struct sched_class {
const struct sched_class *next;
+ const struct sched_class *identity;

void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
diff --git a/kernel/sched.c b/kernel/sched.c
index ee61f45..66d918a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1813,6 +1813,8 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)

static void calc_load_account_active(struct rq *this_rq);

+#define sched_class_equal(a, b) ((a)->identity == (b)->identity)
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -1987,7 +1989,7 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
const struct sched_class *prev_class,
int oldprio, int running)
{
- if (prev_class != p->sched_class) {
+ if (!sched_class_equal(prev_class, p->sched_class)) {
if (prev_class->switched_from)
prev_class->switched_from(rq, p, running);
p->sched_class->switched_to(rq, p, running);
@@ -2012,7 +2014,7 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
&p->se == cfs_rq_of(&p->se)->last))
return 1;

- if (p->sched_class != &fair_sched_class)
+ if (!sched_class_equal(p->sched_class, &fair_sched_class))
return 0;

if (sysctl_sched_migration_cost == -1)
@@ -6139,6 +6141,8 @@ static struct task_struct *find_process_by_pid(pid_t pid)
static void
__setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
{
+ const struct sched_class *new_class = NULL;
+
BUG_ON(p->se.on_rq);

p->policy = policy;
@@ -6146,13 +6150,15 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
- p->sched_class = &fair_sched_class;
+ new_class = &fair_sched_class;
break;
case SCHED_FIFO:
case SCHED_RR:
- p->sched_class = &rt_sched_class;
+ new_class = &rt_sched_class;
break;
}
+ if (!sched_class_equal(p->sched_class, new_class))
+ p->sched_class = new_class;

p->rt_priority = prio;
p->normal_prio = normal_prio(p);
@@ -10384,7 +10390,7 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
return -EINVAL;
#else
/* We don't support RT-tasks being in separate groups */
- if (tsk->sched_class != &fair_sched_class)
+ if (!sched_class_equal(tsk->sched_class, &fair_sched_class))
return -EINVAL;
#endif
return 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..a12d1bd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -961,7 +961,7 @@ static void hrtick_update(struct rq *rq)
{
struct task_struct *curr = rq->curr;

- if (curr->sched_class != &fair_sched_class)
+ if (!sched_class_equal(curr->sched_class, &fair_sched_class))
return;

if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
@@ -1576,7 +1576,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;
}

- if (unlikely(p->sched_class != &fair_sched_class))
+ if (unlikely(!sched_class_equal(p->sched_class, &fair_sched_class)))
return;

if (unlikely(se == pse))
@@ -1962,6 +1962,7 @@ unsigned int get_rr_interval_fair(struct task_struct *task)
* All the scheduling class methods:
*/
static const struct sched_class fair_sched_class = {
+ .identity = &fair_sched_class,
.next = &idle_sched_class,
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index b133a28..57a0c4b 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -106,6 +106,7 @@ unsigned int get_rr_interval_idle(struct task_struct *task)
* Simple, special scheduling class for the per-CPU idle tasks:
*/
static const struct sched_class idle_sched_class = {
+ .identity = &idle_sched_class,
/* .next is NULL */
/* no enqueue/yield_task for idle tasks */

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a4d790c..06a8106 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1746,6 +1746,7 @@ unsigned int get_rr_interval_rt(struct task_struct *task)
}

static const struct sched_class rt_sched_class = {
+ .identity = &rt_sched_class,
.next = &fair_sched_class,
.enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt,
--
1.6.4.2

2009-10-01 08:13:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/19] scheduler: implement workqueue scheduler class

Implement workqueue scheduler class. Workqueue sched_class inherits
fair sched_class and behaves exactly the same as sched_class except
that it has two callback functions which get called when a task is put
to sleep and wakes up and doesn't allow switching to different
scheduler class.

workqueue sched_class can only be selected by calling
switch_sched_workqueue() when the current sched_class is fair.
workqueue is updated to select workqueue sched_class for all workers.
Both scheduler callbacks are noop now. They'll be used to implement
concurrency-managed workqueue.

This patch also updates current_is_keventd() to check for the
scheduler class instead of directly matching the keventd workers, so
the function will return true for any workqueue workers. For the
current users, this shouldn't be a problem.

NOT_SIGNED_OFF_YET
---
include/linux/sched.h | 1 +
kernel/sched.c | 1 +
kernel/sched_fair.c | 59 +++++++++++++--------
kernel/sched_workqueue.c | 130 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_workqueue.h | 17 ++++++
kernel/workqueue.c | 34 ++++++++----
6 files changed, 207 insertions(+), 35 deletions(-)
create mode 100644 kernel/sched_workqueue.c
create mode 100644 kernel/sched_workqueue.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 02f505d..cbebadf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1952,6 +1952,7 @@ extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
struct sched_param *);
+extern void sched_setscheduler_workqueue(struct task_struct *p);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 66d918a..4e3e789 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1819,6 +1819,7 @@ static void calc_load_account_active(struct rq *this_rq);
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_workqueue.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
#endif
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a12d1bd..eb116f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1961,37 +1961,50 @@ unsigned int get_rr_interval_fair(struct task_struct *task)
/*
* All the scheduling class methods:
*/
-static const struct sched_class fair_sched_class = {
- .identity = &fair_sched_class,
- .next = &idle_sched_class,
- .enqueue_task = enqueue_task_fair,
- .dequeue_task = dequeue_task_fair,
- .yield_task = yield_task_fair,
-
- .check_preempt_curr = check_preempt_wakeup,
-
- .pick_next_task = pick_next_task_fair,
- .put_prev_task = put_prev_task_fair,
+#define FAIR_SCHED_CLASS_INIT_BASE \
+ .identity = &fair_sched_class, \
+ .next = &idle_sched_class, \
+ .enqueue_task = enqueue_task_fair, \
+ .dequeue_task = dequeue_task_fair, \
+ .yield_task = yield_task_fair, \
+ \
+ .check_preempt_curr = check_preempt_wakeup, \
+ \
+ .pick_next_task = pick_next_task_fair, \
+ .put_prev_task = put_prev_task_fair, \
+ \
+ .set_curr_task = set_curr_task_fair, \
+ .task_tick = task_tick_fair, \
+ .task_new = task_new_fair, \
+ \
+ .prio_changed = prio_changed_fair, \
+ .switched_to = switched_to_fair, \
+ \
+ .get_rr_interval = get_rr_interval_fair,

#ifdef CONFIG_SMP
- .select_task_rq = select_task_rq_fair,
-
- .load_balance = load_balance_fair,
+#define FAIR_SCHED_CLASS_INIT_SMP \
+ .select_task_rq = select_task_rq_fair, \
+ .load_balance = load_balance_fair, \
.move_one_task = move_one_task_fair,
+#else
+#define FAIR_SCHED_CLASS_INIT_SMP
#endif

- .set_curr_task = set_curr_task_fair,
- .task_tick = task_tick_fair,
- .task_new = task_new_fair,
-
- .prio_changed = prio_changed_fair,
- .switched_to = switched_to_fair,
-
- .get_rr_interval = get_rr_interval_fair,
-
#ifdef CONFIG_FAIR_GROUP_SCHED
+#define FAIR_SCHED_CLASS_INIT_GROUP \
.moved_group = moved_group_fair,
+#else
+#define FAIR_SCHED_CLASS_INIT_GROUP
#endif
+
+#define FAIR_SCHED_CLASS_INIT \
+ FAIR_SCHED_CLASS_INIT_BASE \
+ FAIR_SCHED_CLASS_INIT_SMP \
+ FAIR_SCHED_CLASS_INIT_GROUP
+
+static const struct sched_class fair_sched_class = {
+ FAIR_SCHED_CLASS_INIT
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched_workqueue.c b/kernel/sched_workqueue.c
new file mode 100644
index 0000000..d8d6cb2
--- /dev/null
+++ b/kernel/sched_workqueue.c
@@ -0,0 +1,130 @@
+/*
+ * kernel/sched_workqueue.c - workqueue scheduler class
+ *
+ * Copyright (C) 2009 SUSE Linux Products GmbH
+ * Copyright (C) 2009 Tejun Heo <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This scheduler class wraps the fair class and provide scheduling
+ * hints to workqueue to help it maintain proper level of concurrency.
+ * Other than calling workqueue hook functions and disallowing
+ * switching to other classes, this scheduler class is identical to
+ * the fair class.
+ */
+#include "sched_workqueue.h"
+
+static void enqueue_task_wq(struct rq *rq, struct task_struct *p, int wakeup)
+{
+ if (wakeup)
+ sched_workqueue_worker_wakeup(p);
+
+ return enqueue_task_fair(rq, p, wakeup);
+}
+
+static void dequeue_task_wq(struct rq *rq, struct task_struct *p, int sleep)
+{
+ if (sleep)
+ sched_workqueue_worker_sleep(p);
+
+ return dequeue_task_fair(rq, p, sleep);
+}
+
+static void switched_from_wq(struct rq *this_rq, struct task_struct *task,
+ int running)
+{
+ BUG(); /* no can do */
+}
+
+/*
+ * If you want to add more override methods, please check sched.c
+ * _CAREFULLY_ before doing so. There are several places where fair
+ * sched specific optimizations are made and the overrides might not
+ * work as expected.
+ */
+static const struct sched_class workqueue_sched_class = {
+ FAIR_SCHED_CLASS_INIT
+ .enqueue_task = enqueue_task_wq,
+ .dequeue_task = dequeue_task_wq,
+ .switched_from = switched_from_wq,
+};
+
+/**
+ * sched_workqueue_wake_up_process - wake up a process from sched callbacks
+ * @p: task to wake up
+ *
+ * Wake up @p. This function can only be called from workqueue
+ * scheduler callbacks and can only wake up tasks which are bound to
+ * the cpu in question.
+ *
+ * CONTEXT:
+ * workqueue scheduler callbacks.
+ *
+ * RETURNS:
+ * true if @p was waken up, false if @p was already awake.
+ */
+bool sched_workqueue_wake_up_process(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ bool success = false;
+
+ assert_spin_locked(&rq->lock);
+
+ if (!p->se.on_rq) {
+ schedstat_inc(p, se.nr_wakeups);
+ schedstat_inc(p, se.nr_wakeups_local);
+ activate_task(rq, p, 1);
+ success = true;
+ }
+
+ trace_sched_wakeup(rq, p, success);
+ p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+ if (p->sched_class->task_wake_up)
+ p->sched_class->task_wake_up(rq, p);
+#endif
+ return success;
+}
+
+/**
+ * switch_sched_workqueue - switch workqueue scheduler class
+ * @p: target task
+ * @enable: enable or disable sched_workqueue
+ *
+ * Switch @p to or from workqueue scheduler class. @p is assumed to
+ * have either fair or one of its alias classes on entry.
+ *
+ * CONTEXT:
+ * !in_interrupt().
+ */
+void switch_sched_workqueue(struct task_struct *p, bool enable)
+{
+ struct sched_param sched_param = { .sched_priority = 0 };
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ BUG_ON(!sched_class_equal(p->sched_class, &fair_sched_class));
+ p->sched_class = enable ? &workqueue_sched_class : &fair_sched_class;
+ task_rq_unlock(rq, &flags);
+
+ BUG_ON(sched_setscheduler_nocheck(p, SCHED_NORMAL, &sched_param));
+}
+
+/**
+ * is_sched_workqueue - test whether a task is in workqueue scheduler class
+ * @p: target task
+ *
+ * Tests whether @p is in workqueue scheduler class.
+ *
+ * CONTEXT:
+ * The caller is responsible for ensuring that @p doesn't go away or
+ * change scheduler class.
+ *
+ * RETURNS:
+ * true if @p is in workerqueue scheduler class, false otherwise.
+ */
+bool is_sched_workqueue(struct task_struct *p)
+{
+ return p->sched_class == &workqueue_sched_class;
+}
diff --git a/kernel/sched_workqueue.h b/kernel/sched_workqueue.h
new file mode 100644
index 0000000..5a52a4f
--- /dev/null
+++ b/kernel/sched_workqueue.h
@@ -0,0 +1,17 @@
+/*
+ * kernel/sched_workqueue.h - workqueue scheduler class interface
+ *
+ * Copyright (C) 2009 SUSE Linux Products GmbH
+ * Copyright (C) 2009 Tejun Heo <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This is interface between sched_workqueue and workqueue. Read
+ * comments in sched_workqueue.c and workqueue.c for details.
+ */
+void sched_workqueue_worker_wakeup(struct task_struct *task);
+void sched_workqueue_worker_sleep(struct task_struct *task);
+
+bool sched_workqueue_wake_up_process(struct task_struct *p);
+void switch_sched_workqueue(struct task_struct *p, bool enable);
+bool is_sched_workqueue(struct task_struct *p);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index addfe2d..b56737b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -36,6 +36,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/workqueue.h>

+#include "sched_workqueue.h"
+
/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
@@ -125,6 +127,22 @@ struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
}

+/*
+ * Scheduler callbacks. These functions are called during schedule()
+ * with rq lock held. Don't try to acquire any lock and only access
+ * fields which are safe with preemption disabled from local cpu.
+ */
+
+/* called when a worker task wakes up from sleep */
+void sched_workqueue_worker_wakeup(struct task_struct *task)
+{
+}
+
+/* called when a worker task goes into sleep */
+void sched_workqueue_worker_sleep(struct task_struct *task)
+{
+}
+
static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head)
{
@@ -314,6 +332,9 @@ static int worker_thread(void *__cwq)
struct cpu_workqueue_struct *cwq = __cwq;
DEFINE_WAIT(wait);

+ /* set workqueue scheduler */
+ switch_sched_workqueue(current, true);
+
if (cwq->wq->freezeable)
set_freezable();

@@ -726,18 +747,7 @@ int keventd_up(void)

int current_is_keventd(void)
{
- struct cpu_workqueue_struct *cwq;
- int cpu = raw_smp_processor_id(); /* preempt-safe: keventd is per-cpu */
- int ret = 0;
-
- BUG_ON(!keventd_wq);
-
- cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
- if (current == cwq->thread)
- ret = 1;
-
- return ret;
-
+ return is_sched_workqueue(current);
}

static struct cpu_workqueue_struct *
--
1.6.4.2

2009-10-01 08:10:24

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/19] scheduler: implement force_cpus_allowed_ptr()

Implement force_cpus_allowed_ptr() which is similar to
set_cpus_allowed_ptr() but bypasses PF_THREAD_BOUND check and ignores
cpu_active() status as long as the target cpu is online. This will be
used for concurrency-managed workqueue.

NOT_SIGNED_OFF_YET
---
include/linux/sched.h | 7 ++++
kernel/sched.c | 88 +++++++++++++++++++++++++++++++++----------------
2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbebadf..5fe60bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1810,6 +1810,8 @@ static inline void rcu_copy_process(struct task_struct *p)
#ifdef CONFIG_SMP
extern int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask);
+extern int force_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *new_mask);
#else
static inline int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask)
@@ -1818,6 +1820,11 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
return -EINVAL;
return 0;
}
+static inline int force_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ return set_cpus_allowed_ptr(p, new_mask);
+}
#endif

#ifndef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/sched.c b/kernel/sched.c
index 4e3e789..02f07b2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2070,6 +2070,7 @@ struct migration_req {

struct task_struct *task;
int dest_cpu;
+ bool force;

struct completion done;
};
@@ -2078,8 +2079,8 @@ struct migration_req {
* The task's runqueue lock must be held.
* Returns true if you have to wait for migration thread.
*/
-static int
-migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
+static int migrate_task(struct task_struct *p, int dest_cpu,
+ struct migration_req *req, bool force)
{
struct rq *rq = task_rq(p);

@@ -2095,6 +2096,7 @@ migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
init_completion(&req->done);
req->task = p;
req->dest_cpu = dest_cpu;
+ req->force = force;
list_add(&req->list, &rq->migration_queue);

return 1;
@@ -3096,7 +3098,7 @@ static void sched_migrate_task(struct task_struct *p, int dest_cpu)
goto out;

/* force the process onto the specified CPU */
- if (migrate_task(p, dest_cpu, &req)) {
+ if (migrate_task(p, dest_cpu, &req, false)) {
/* Need to wait for migration thread (might exit: take ref). */
struct task_struct *mt = rq->migration_thread;

@@ -7015,34 +7017,19 @@ static inline void sched_init_granularity(void)
* 7) we wake up and the migration is done.
*/

-/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
- *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
- */
-int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+static inline int __set_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *new_mask,
+ struct rq *rq, unsigned long *flags,
+ bool force)
{
struct migration_req req;
- unsigned long flags;
- struct rq *rq;
int ret = 0;

- rq = task_rq_lock(p, &flags);
if (!cpumask_intersects(new_mask, cpu_online_mask)) {
ret = -EINVAL;
goto out;
}

- if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
- !cpumask_equal(&p->cpus_allowed, new_mask))) {
- ret = -EINVAL;
- goto out;
- }
-
if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, new_mask);
else {
@@ -7054,12 +7041,13 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
if (cpumask_test_cpu(task_cpu(p), new_mask))
goto out;

- if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
+ if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req,
+ force)) {
/* Need help from migration thread: drop lock and wait. */
struct task_struct *mt = rq->migration_thread;

get_task_struct(mt);
- task_rq_unlock(rq, &flags);
+ task_rq_unlock(rq, flags);
wake_up_process(rq->migration_thread);
put_task_struct(mt);
wait_for_completion(&req.done);
@@ -7067,13 +7055,53 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
return 0;
}
out:
- task_rq_unlock(rq, &flags);
+ task_rq_unlock(rq, flags);

return ret;
}
+
+/*
+ * Change a given task's CPU affinity. Migrate the thread to a
+ * proper CPU and schedule it away if the CPU it's executing on
+ * is removed from the allowed bitmask.
+ *
+ * NOTE: the caller must have a valid reference to the task, the
+ * task must not exit() & deallocate itself prematurely. The
+ * call is not atomic; no spinlocks may be held.
+ */
+int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+{
+ unsigned long flags;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+
+ if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+ !cpumask_equal(&p->cpus_allowed, new_mask))) {
+ task_rq_unlock(rq, &flags);
+ return -EINVAL;
+ }
+
+ return __set_cpus_allowed_ptr(p, new_mask, rq, &flags, false);
+}
EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

/*
+ * Similar to set_cpus_allowed_ptr() but bypasses PF_THREAD_BOUND
+ * check and ignores cpu_active() status as long as the cpu is online.
+ * The caller is responsible for ensuring things don't go bonkers.
+ */
+int force_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ unsigned long flags;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ return __set_cpus_allowed_ptr(p, new_mask, rq, &flags, true);
+}
+
+/*
* Move (not current) task off this cpu, onto dest cpu. We're doing
* this because either it can't run here any more (set_cpus_allowed()
* away from this CPU, or CPU going down), or because we're
@@ -7084,12 +7112,13 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
*
* Returns non-zero if task was successfully migrated.
*/
-static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
+static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu,
+ bool force)
{
struct rq *rq_dest, *rq_src;
int ret = 0, on_rq;

- if (unlikely(!cpu_active(dest_cpu)))
+ if (!force && unlikely(!cpu_active(dest_cpu)))
return ret;

rq_src = cpu_rq(src_cpu);
@@ -7168,7 +7197,8 @@ static int migration_thread(void *data)

if (req->task != NULL) {
spin_unlock(&rq->lock);
- __migrate_task(req->task, cpu, req->dest_cpu);
+ __migrate_task(req->task, cpu, req->dest_cpu,
+ req->force);
} else if (likely(cpu == (badcpu = smp_processor_id()))) {
req->dest_cpu = RCU_MIGRATION_GOT_QS;
spin_unlock(&rq->lock);
@@ -7193,7 +7223,7 @@ static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
int ret;

local_irq_disable();
- ret = __migrate_task(p, src_cpu, dest_cpu);
+ ret = __migrate_task(p, src_cpu, dest_cpu, false);
local_irq_enable();
return ret;
}
--
1.6.4.2

2009-10-01 08:10:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/19] kthread: implement kthread_data()

Implement kthread_data() which takes @task pointing to a kthread and
returns @data specified when creating the kthread. The caller is
responsible for ensuring the validity of @task when calling this
function.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 15 +++++++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..14f63e8 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -30,6 +30,7 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
void kthread_bind(struct task_struct *k, unsigned int cpu);
int kthread_stop(struct task_struct *k);
int kthread_should_stop(void);
+void *kthread_data(struct task_struct *k);

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5fe7099..bd4cb7f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,6 +35,7 @@ struct kthread_create_info

struct kthread {
int should_stop;
+ void *data;
struct completion exited;
};

@@ -54,6 +55,19 @@ int kthread_should_stop(void)
}
EXPORT_SYMBOL(kthread_should_stop);

+/**
+ * kthread_data - return data value specified on kthread creation
+ * @task: kthread task in question
+ *
+ * Return the data value specified when kthread @task was created.
+ * The caller is responsible for ensuring the validity of @task when
+ * calling this function.
+ */
+void *kthread_data(struct task_struct *task)
+{
+ return to_kthread(task)->data;
+}
+
static int kthread(void *_create)
{
/* Copy data: it's on kthread's stack */
@@ -64,6 +78,7 @@ static int kthread(void *_create)
int ret;

self.should_stop = 0;
+ self.data = data;
init_completion(&self.exited);
current->vfork_done = &self.exited;

--
1.6.4.2

2009-10-01 08:16:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0

ACPI works need to be executed on cpu0 and acpi/osl.c achieves this by
creating singlethread workqueue and then binding it to cpu0 from a
work which is quite unorthodox. Make it create regular workqueues and
use queue_work_on() instead. This is in preparation of concurrency
managed workqueue and the extra workers won't be a problem after it's
implemented.

NOT_SIGNED_OFF_YET
---
drivers/acpi/osl.c | 41 ++++++++++++-----------------------------
1 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 7c1c59e..f742b7b 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -191,36 +191,11 @@ acpi_status __init acpi_os_initialize(void)
return AE_OK;
}

-static void bind_to_cpu0(struct work_struct *work)
-{
- set_cpus_allowed_ptr(current, cpumask_of(0));
- kfree(work);
-}
-
-static void bind_workqueue(struct workqueue_struct *wq)
-{
- struct work_struct *work;
-
- work = kzalloc(sizeof(struct work_struct), GFP_KERNEL);
- INIT_WORK(work, bind_to_cpu0);
- queue_work(wq, work);
-}
-
acpi_status acpi_os_initialize1(void)
{
- /*
- * On some machines, a software-initiated SMI causes corruption unless
- * the SMI runs on CPU 0. An SMI can be initiated by any AML, but
- * typically it's done in GPE-related methods that are run via
- * workqueues, so we can avoid the known corruption cases by binding
- * the workqueues to CPU 0.
- */
- kacpid_wq = create_singlethread_workqueue("kacpid");
- bind_workqueue(kacpid_wq);
- kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
- bind_workqueue(kacpi_notify_wq);
- kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
- bind_workqueue(kacpi_hotplug_wq);
+ kacpid_wq = create_workqueue("kacpid");
+ kacpi_notify_wq = create_workqueue("kacpi_notify");
+ kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
BUG_ON(!kacpid_wq);
BUG_ON(!kacpi_notify_wq);
BUG_ON(!kacpi_hotplug_wq);
@@ -759,7 +734,15 @@ static acpi_status __acpi_os_execute(acpi_execute_type type,
(type == OSL_NOTIFY_HANDLER ? kacpi_notify_wq : kacpid_wq);
dpc->wait = hp ? 1 : 0;
INIT_WORK(&dpc->work, acpi_os_execute_deferred);
- ret = queue_work(queue, &dpc->work);
+
+ /*
+ * On some machines, a software-initiated SMI causes corruption unless
+ * the SMI runs on CPU 0. An SMI can be initiated by any AML, but
+ * typically it's done in GPE-related methods that are run via
+ * workqueues, so we can avoid the known corruption cases by always
+ * queueing on CPU 0.
+ */
+ ret = queue_work_on(0, queue, &dpc->work);

if (!ret) {
printk(KERN_ERR PREFIX
--
1.6.4.2

2009-10-01 08:10:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/19] stop_machine: reimplement without using workqueue

stop_machine() is the only user of RT workqueue. Reimplement it using
kthreads directly and rip RT support from workqueue. This is in
preparation of concurrency managed workqueue.

NOT_SIGNED_OFF_YET
---
include/linux/stop_machine.h | 6 ++
include/linux/workqueue.h | 20 +++---
init/main.c | 2 +
kernel/stop_machine.c | 151 ++++++++++++++++++++++++++++++++++-------
kernel/workqueue.c | 6 --
5 files changed, 142 insertions(+), 43 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index baba3a2..2d32e06 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -53,6 +53,11 @@ int stop_machine_create(void);
*/
void stop_machine_destroy(void);

+/**
+ * init_stop_machine: initialize stop_machine during boot
+ */
+void init_stop_machine(void);
+
#else

static inline int stop_machine(int (*fn)(void *), void *data,
@@ -67,6 +72,7 @@ static inline int stop_machine(int (*fn)(void *), void *data,

static inline int stop_machine_create(void) { return 0; }
static inline void stop_machine_destroy(void) { }
+static inline void init_stop_machine(void) { }

#endif /* CONFIG_SMP */
#endif /* _LINUX_STOP_MACHINE */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7ef0c7b..05f0998 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -165,12 +165,11 @@ struct execute_work {


extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
- int freezeable, int rt, struct lock_class_key *key,
- const char *lock_name);
+__create_workqueue_key(const char *name, int singlethread, int freezeable,
+ struct lock_class_key *key, const char *lock_name);

#ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt) \
+#define __create_workqueue(name, singlethread, freezeable) \
({ \
static struct lock_class_key __key; \
const char *__lock_name; \
@@ -181,19 +180,18 @@ __create_workqueue_key(const char *name, int singlethread,
__lock_name = #name; \
\
__create_workqueue_key((name), (singlethread), \
- (freezeable), (rt), &__key, \
+ (freezeable), &__key, \
__lock_name); \
})
#else
-#define __create_workqueue(name, singlethread, freezeable, rt) \
- __create_workqueue_key((name), (singlethread), (freezeable), (rt), \
+#define __create_workqueue(name, singlethread, freezeable) \
+ __create_workqueue_key((name), (singlethread), (freezeable), \
NULL, NULL)
#endif

-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/init/main.c b/init/main.c
index 7449819..5ae8d76 100644
--- a/init/main.c
+++ b/init/main.c
@@ -34,6 +34,7 @@
#include <linux/security.h>
#include <linux/smp.h>
#include <linux/workqueue.h>
+#include <linux/stop_machine.h>
#include <linux/profile.h>
#include <linux/rcupdate.h>
#include <linux/moduleparam.h>
@@ -780,6 +781,7 @@ static void __init do_basic_setup(void)
{
rcu_init_sched(); /* needed by module_init stage. */
init_workqueues();
+ init_stop_machine();
cpuset_init_smp();
usermodehelper_init();
init_tmpfs();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 912823e..671a4ac 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -25,6 +25,8 @@ enum stopmachine_state {
STOPMACHINE_RUN,
/* Exit */
STOPMACHINE_EXIT,
+ /* Done */
+ STOPMACHINE_DONE,
};
static enum stopmachine_state state;

@@ -42,10 +44,9 @@ static DEFINE_MUTEX(lock);
static DEFINE_MUTEX(setup_lock);
/* Users of stop_machine. */
static int refcount;
-static struct workqueue_struct *stop_machine_wq;
+static struct task_struct **stop_machine_threads;
static struct stop_machine_data active, idle;
static const struct cpumask *active_cpus;
-static void *stop_machine_work;

static void set_state(enum stopmachine_state newstate)
{
@@ -63,14 +64,31 @@ static void ack_state(void)
}

/* This is the actual function which stops the CPU. It runs
- * in the context of a dedicated stopmachine workqueue. */
-static void stop_cpu(struct work_struct *unused)
+ * on dedicated per-cpu kthreads. */
+static int stop_cpu(void *unused)
{
enum stopmachine_state curstate = STOPMACHINE_NONE;
- struct stop_machine_data *smdata = &idle;
+ struct stop_machine_data *smdata;
int cpu = smp_processor_id();
int err;

+repeat:
+ /* Wait for __stop_machine() to initiate */
+ while (true) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ /* <- kthread_stop() and __stop_machine()::smp_wmb() */
+ if (kthread_should_stop()) {
+ __set_current_state(TASK_RUNNING);
+ return 0;
+ }
+ if (state == STOPMACHINE_PREPARE)
+ break;
+ schedule();
+ }
+ smp_rmb(); /* <- __stop_machine()::set_state() */
+
+ /* Okay, let's go */
+ smdata = &idle;
if (!active_cpus) {
if (cpu == cpumask_first(cpu_online_mask))
smdata = &active;
@@ -104,6 +122,7 @@ static void stop_cpu(struct work_struct *unused)
} while (curstate != STOPMACHINE_EXIT);

local_irq_enable();
+ goto repeat;
}

/* Callback for CPUs which aren't supposed to do anything. */
@@ -112,46 +131,122 @@ static int chill(void *unused)
return 0;
}

+static int create_stop_machine_thread(unsigned int cpu)
+{
+ struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+ struct task_struct *p;
+
+ if (*pp)
+ return -EBUSY;
+
+ p = kthread_create(stop_cpu, NULL, "kstop/%u", cpu);
+ if (IS_ERR(p))
+ return PTR_ERR(p);
+
+ sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+ *pp = p;
+ return 0;
+}
+
+/* Should be called with cpu hotplug disabled and setup_lock held */
+static void kill_stop_machine_threads(void)
+{
+ unsigned int cpu;
+
+ if (!stop_machine_threads)
+ return;
+
+ for_each_online_cpu(cpu) {
+ struct task_struct *p = *per_cpu_ptr(stop_machine_threads, cpu);
+ if (p)
+ kthread_stop(p);
+ }
+ free_percpu(stop_machine_threads);
+ stop_machine_threads = NULL;
+}
+
int stop_machine_create(void)
{
+ unsigned int cpu;
+
+ get_online_cpus();
mutex_lock(&setup_lock);
if (refcount)
goto done;
- stop_machine_wq = create_rt_workqueue("kstop");
- if (!stop_machine_wq)
- goto err_out;
- stop_machine_work = alloc_percpu(struct work_struct);
- if (!stop_machine_work)
+
+ stop_machine_threads = alloc_percpu(struct task_struct *);
+ if (!stop_machine_threads)
goto err_out;
+
+ /*
+ * cpu hotplug is disabled, create only for online cpus,
+ * cpu_callback() will handle cpu hot [un]plugs.
+ */
+ for_each_online_cpu(cpu) {
+ if (create_stop_machine_thread(cpu))
+ goto err_out;
+ kthread_bind(*per_cpu_ptr(stop_machine_threads, cpu), cpu);
+ }
done:
refcount++;
mutex_unlock(&setup_lock);
+ put_online_cpus();
return 0;

err_out:
- if (stop_machine_wq)
- destroy_workqueue(stop_machine_wq);
+ kill_stop_machine_threads();
mutex_unlock(&setup_lock);
+ put_online_cpus();
return -ENOMEM;
}
EXPORT_SYMBOL_GPL(stop_machine_create);

void stop_machine_destroy(void)
{
+ get_online_cpus();
mutex_lock(&setup_lock);
- refcount--;
- if (refcount)
- goto done;
- destroy_workqueue(stop_machine_wq);
- free_percpu(stop_machine_work);
-done:
+ if (!--refcount)
+ kill_stop_machine_threads();
mutex_unlock(&setup_lock);
+ put_online_cpus();
}
EXPORT_SYMBOL_GPL(stop_machine_destroy);

+static int __cpuinit stop_machine_cpu_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+ struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+
+ /* Hotplug exclusion is enough, no need to worry about setup_lock */
+ if (!stop_machine_threads)
+ return NOTIFY_OK;
+
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_UP_PREPARE:
+ if (create_stop_machine_thread(cpu)) {
+ printk(KERN_ERR "failed to create stop machine "
+ "thread for %u\n", cpu);
+ return NOTIFY_BAD;
+ }
+ break;
+
+ case CPU_ONLINE:
+ kthread_bind(*pp, cpu);
+ break;
+
+ case CPU_UP_CANCELED:
+ case CPU_POST_DEAD:
+ kthread_stop(*pp);
+ *pp = NULL;
+ break;
+ }
+ return NOTIFY_OK;
+}
+
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
- struct work_struct *sm_work;
int i, ret;

/* Set up initial state. */
@@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
idle.fn = chill;
idle.data = NULL;

- set_state(STOPMACHINE_PREPARE);
+ set_state(STOPMACHINE_PREPARE); /* -> stop_cpu()::smp_rmb() */
+ smp_wmb(); /* -> stop_cpu()::set_current_state() */

/* Schedule the stop_cpu work on all cpus: hold this CPU so one
* doesn't hit this CPU until we're ready. */
get_cpu();
- for_each_online_cpu(i) {
- sm_work = per_cpu_ptr(stop_machine_work, i);
- INIT_WORK(sm_work, stop_cpu);
- queue_work_on(i, stop_machine_wq, sm_work);
- }
+ for_each_online_cpu(i)
+ wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
/* This will release the thread on our CPU. */
put_cpu();
- flush_workqueue(stop_machine_wq);
+ while (state < STOPMACHINE_DONE)
+ yield();
ret = active.fnret;
mutex_unlock(&lock);
return ret;
@@ -197,3 +291,8 @@ int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
return ret;
}
EXPORT_SYMBOL_GPL(stop_machine);
+
+void __init init_stop_machine(void)
+{
+ hotcpu_notifier(stop_machine_cpu_callback, 0);
+}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b56737b..a8a35a9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -64,7 +64,6 @@ struct workqueue_struct {
const char *name;
int singlethread;
int freezeable; /* Freeze threads during suspend */
- int rt;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -765,7 +764,6 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)

static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
@@ -781,8 +779,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
*/
if (IS_ERR(p))
return PTR_ERR(p);
- if (cwq->wq->rt)
- sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
cwq->thread = p;

trace_workqueue_creation(cwq->thread, cpu);
@@ -804,7 +800,6 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
struct workqueue_struct *__create_workqueue_key(const char *name,
int singlethread,
int freezeable,
- int rt,
struct lock_class_key *key,
const char *lock_name)
{
@@ -826,7 +821,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
wq->singlethread = singlethread;
wq->freezeable = freezeable;
- wq->rt = rt;
INIT_LIST_HEAD(&wq->list);

if (singlethread) {
--
1.6.4.2

2009-10-01 08:14:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/19] workqueue: misc/cosmetic updates

Make the following updates in preparation of concurrency managed
workqueue. None of these changes causes any visible behavior
difference.

* Add comments and adjust indentations to data structures and several
functions.

* Rename wq_per_cpu() to get_cwq() and swap the position of two
parameters for consistency. Convert a direct per_cpu_ptr() access
to wq->cpu_wq to get_cwq().

* Update set_wq_data() such that it sets the flags part to
WORK_STRUCT_PENDING | @extra_flags.

* Move santiy check on work->entry emptiness from queue_work_on() to
__queue_work() which all queueing paths share.

* Make __queue_work() take @cpu and @wq instead of @cwq.

* Restructure flush_work() and __create_workqueue_key() to make them
easier to modify.

NOT_SIGNED_OFF_YET
---
kernel/workqueue.c | 122 ++++++++++++++++++++++++++++++++++------------------
1 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a8a35a9..10c0cc4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -39,6 +39,16 @@
#include "sched_workqueue.h"

/*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Set during initialization and read-only afterwards.
+ *
+ * L: cwq->lock protected. Access with cwq->lock held.
+ *
+ * W: workqueue_lock protected.
+ */
+
+/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
*/
@@ -50,8 +60,8 @@ struct cpu_workqueue_struct {
wait_queue_head_t more_work;
struct work_struct *current_work;

- struct workqueue_struct *wq;
- struct task_struct *thread;
+ struct workqueue_struct *wq; /* I: the owning workqueue */
+ struct task_struct *thread;
} ____cacheline_aligned;

/*
@@ -59,13 +69,13 @@ struct cpu_workqueue_struct {
* per-CPU workqueues:
*/
struct workqueue_struct {
- struct cpu_workqueue_struct *cpu_wq;
- struct list_head list;
- const char *name;
+ struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
+ struct list_head list; /* W: list of all workqueues */
+ const char *name; /* I: workqueue name */
int singlethread;
int freezeable; /* Freeze threads during suspend */
#ifdef CONFIG_LOCKDEP
- struct lockdep_map lockdep_map;
+ struct lockdep_map lockdep_map;
#endif
};

@@ -96,8 +106,8 @@ static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
? cpu_singlethread_map : cpu_populated_map;
}

-static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+ struct workqueue_struct *wq)
{
if (unlikely(is_wq_single_threaded(wq)))
cpu = singlethread_cpu;
@@ -109,15 +119,12 @@ struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
* - Must *only* be called if the pending flag is set
*/
static inline void set_wq_data(struct work_struct *work,
- struct cpu_workqueue_struct *cwq)
+ struct cpu_workqueue_struct *cwq,
+ unsigned long extra_flags)
{
- unsigned long new;
-
BUG_ON(!work_pending(work));
-
- new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
- new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
- atomic_long_set(&work->data, new);
+ atomic_long_set(&work->data, (unsigned long)cwq |
+ (1UL << WORK_STRUCT_PENDING) | extra_flags);
}

static inline
@@ -142,28 +149,46 @@ void sched_workqueue_worker_sleep(struct task_struct *task)
{
}

+/**
+ * insert_work - insert a work into cwq
+ * @cwq: cwq @work belongs to
+ * @work: work to insert
+ * @head: insertion point
+ * @extra_flags: extra WORK_STRUCT_* flags to set
+ *
+ * Insert @work into @cwq after @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
static void insert_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work, struct list_head *head)
+ struct work_struct *work, struct list_head *head,
+ unsigned int extra_flags)
{
trace_workqueue_insertion(cwq->thread, work);

- set_wq_data(work, cwq);
+ /* we own @work, set data and link */
+ set_wq_data(work, cwq, extra_flags);
+
/*
* Ensure that we get the right work->data if we see the
* result of list_add() below, see try_to_grab_pending().
*/
smp_wmb();
+
list_add_tail(&work->entry, head);
wake_up(&cwq->more_work);
}

-static void __queue_work(struct cpu_workqueue_struct *cwq,
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
unsigned long flags;

spin_lock_irqsave(&cwq->lock, flags);
- insert_work(cwq, work, &cwq->worklist);
+ BUG_ON(!list_empty(&work->entry));
+ insert_work(cwq, work, &cwq->worklist, 0);
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -205,8 +230,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
int ret = 0;

if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
- BUG_ON(!list_empty(&work->entry));
- __queue_work(wq_per_cpu(wq, cpu), work);
+ __queue_work(cpu, wq, work);
ret = 1;
}
return ret;
@@ -217,9 +241,8 @@ static void delayed_work_timer_fn(unsigned long __data)
{
struct delayed_work *dwork = (struct delayed_work *)__data;
struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
- struct workqueue_struct *wq = cwq->wq;

- __queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+ __queue_work(smp_processor_id(), cwq->wq, &dwork->work);
}

/**
@@ -263,7 +286,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
timer_stats_timer_set_start_info(&dwork->timer);

/* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+ set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
@@ -326,6 +349,12 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
spin_unlock_irq(&cwq->lock);
}

+/**
+ * worker_thread - the worker thread function
+ * @__cwq: cwq to serve
+ *
+ * The cwq worker thread function.
+ */
static int worker_thread(void *__cwq)
{
struct cpu_workqueue_struct *cwq = __cwq;
@@ -367,15 +396,25 @@ static void wq_barrier_func(struct work_struct *work)
complete(&barr->done);
}

+/**
+ * insert_wq_barrier - insert a barrier work
+ * @cwq: cwq to insert barrier into
+ * @barr: wq_barrier to insert
+ * @head: insertion point
+ *
+ * Insert barrier @barr into @cwq before @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
struct wq_barrier *barr, struct list_head *head)
{
INIT_WORK(&barr->work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
-
init_completion(&barr->done);

- insert_work(cwq, &barr->work, head);
+ insert_work(cwq, &barr->work, head, 0);
}

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -407,9 +446,6 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*
* We sleep until all works which were queued on entry have been handled,
* but we are not livelocked by new incoming ones.
- *
- * This function used to run the workqueues itself. Now we just wait for the
- * helper threads to do it.
*/
void flush_workqueue(struct workqueue_struct *wq)
{
@@ -448,7 +484,6 @@ int flush_work(struct work_struct *work)
lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_release(&cwq->wq->lockdep_map);

- prev = NULL;
spin_lock_irq(&cwq->lock);
if (!list_empty(&work->entry)) {
/*
@@ -457,21 +492,20 @@ int flush_work(struct work_struct *work)
*/
smp_rmb();
if (unlikely(cwq != get_wq_data(work)))
- goto out;
+ goto already_gone;
prev = &work->entry;
} else {
if (cwq->current_work != work)
- goto out;
+ goto already_gone;
prev = &cwq->worklist;
}
insert_wq_barrier(cwq, &barr, prev->next);
-out:
spin_unlock_irq(&cwq->lock);
- if (!prev)
- return 0;
-
wait_for_completion(&barr.done);
return 1;
+already_gone:
+ spin_unlock_irq(&cwq->lock);
+ return 0;
}
EXPORT_SYMBOL_GPL(flush_work);

@@ -551,7 +585,7 @@ static void wait_on_work(struct work_struct *work)
cpu_map = wq_cpu_map(wq);

for_each_cpu(cpu, cpu_map)
- wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+ wait_on_cpu_work(get_cwq(cpu, wq), work);
}

static int __cancel_work_timer(struct work_struct *work,
@@ -809,13 +843,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
- return NULL;
+ goto err;

wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
- if (!wq->cpu_wq) {
- kfree(wq);
- return NULL;
- }
+ if (!wq->cpu_wq)
+ goto err;

wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
@@ -859,6 +891,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
wq = NULL;
}
return wq;
+err:
+ if (wq) {
+ free_percpu(wq->cpu_wq);
+ kfree(wq);
+ }
+ return NULL;
}
EXPORT_SYMBOL_GPL(__create_workqueue_key);

--
1.6.4.2

2009-10-01 08:10:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/19] workqueue: merge feature parametesr into flags

Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 25 +++++++++++++++----------
kernel/workqueue.c | 17 +++++++----------
2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 05f0998..d3334a0 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -163,13 +163,17 @@ struct execute_work {
#define work_clear_pending(work) \
clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))

+enum {
+ WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
+ WQ_SINGLE_THREAD = 1 << 1, /* no per-cpu worker */
+};

extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread, int freezeable,
+__create_workqueue_key(const char *name, unsigned int flags,
struct lock_class_key *key, const char *lock_name);

#ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable) \
+#define __create_workqueue(name, flags) \
({ \
static struct lock_class_key __key; \
const char *__lock_name; \
@@ -179,19 +183,20 @@ __create_workqueue_key(const char *name, int singlethread, int freezeable,
else \
__lock_name = #name; \
\
- __create_workqueue_key((name), (singlethread), \
- (freezeable), &__key, \
+ __create_workqueue_key((name), (flags), &__key, \
__lock_name); \
})
#else
-#define __create_workqueue(name, singlethread, freezeable) \
- __create_workqueue_key((name), (singlethread), (freezeable), \
- NULL, NULL)
+#define __create_workqueue(name, flags) \
+ __create_workqueue_key((name), (flags), NULL, NULL)
#endif

-#define create_workqueue(name) __create_workqueue((name), 0, 0)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
+#define create_workqueue(name) \
+ __create_workqueue((name), 0)
+#define create_freezeable_workqueue(name) \
+ __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+#define create_singlethread_workqueue(name) \
+ __create_workqueue((name), WQ_SINGLE_THREAD)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 10c0cc4..befda6c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -69,11 +69,10 @@ struct cpu_workqueue_struct {
* per-CPU workqueues:
*/
struct workqueue_struct {
+ unsigned int flags; /* I: WQ_* flags */
struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
struct list_head list; /* W: list of all workqueues */
const char *name; /* I: workqueue name */
- int singlethread;
- int freezeable; /* Freeze threads during suspend */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -95,9 +94,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
static cpumask_var_t cpu_populated_map __read_mostly;

/* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
{
- return wq->singlethread;
+ return wq->flags & WQ_SINGLE_THREAD;
}

static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -363,7 +362,7 @@ static int worker_thread(void *__cwq)
/* set workqueue scheduler */
switch_sched_workqueue(current, true);

- if (cwq->wq->freezeable)
+ if (cwq->wq->flags & WQ_FREEZEABLE)
set_freezable();

for (;;) {
@@ -832,8 +831,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
}

struct workqueue_struct *__create_workqueue_key(const char *name,
- int singlethread,
- int freezeable,
+ unsigned int flags,
struct lock_class_key *key,
const char *lock_name)
{
@@ -849,13 +847,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
if (!wq->cpu_wq)
goto err;

+ wq->flags = flags;
wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
- wq->singlethread = singlethread;
- wq->freezeable = freezeable;
INIT_LIST_HEAD(&wq->list);

- if (singlethread) {
+ if (flags & WQ_SINGLE_THREAD) {
cwq = init_cpu_workqueue(wq, singlethread_cpu);
err = create_workqueue_thread(cwq, singlethread_cpu);
start_workqueue_thread(cwq, -1);
--
1.6.4.2

2009-10-01 08:10:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

Currently cwqs are aligned to cacheline and lower two bits of
work_struct->data are considered to be available and used as flags.
Make the alignement requirement official by defining
WORK_STRUCT_FLAG_BITS and aligning cwqs to two's power of it.

This is in preparation of concurrency managed workqueue and cwqs being
aligned to cacheline wouldn't matter as much. While at it, this patch
reserves one more bit for work flags and make sure the resulting
alignment is at least equal to or larger than that of long long.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 17 ++++++++++++++---
kernel/workqueue.c | 15 +++++++++++++--
2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d3334a0..3d11ce3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -22,11 +22,22 @@ typedef void (*work_func_t)(struct work_struct *work);
*/
#define work_data_bits(work) ((unsigned long *)(&(work)->data))

+enum {
+ WORK_STRUCT_PENDING = 0, /* work item is pending execution */
+
+ /*
+ * Reserve 3bits off of cwq pointer. This is enough and
+ * provides acceptable alignment on both 32 and 64bit
+ * machines.
+ */
+ WORK_STRUCT_FLAG_BITS = 3,
+
+ WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
+ WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+};
+
struct work_struct {
atomic_long_t data;
-#define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
-#define WORK_STRUCT_FLAG_MASK (3UL)
-#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index befda6c..28de966 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -50,7 +50,9 @@

/*
* The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).
+ * possible cpu). The lower WORK_STRUCT_FLAG_BITS of
+ * work_struct->data are used for flags and thus cwqs need to be
+ * aligned at two's power of the number of flag bits.
*/
struct cpu_workqueue_struct {

@@ -62,7 +64,7 @@ struct cpu_workqueue_struct {

struct workqueue_struct *wq; /* I: the owning workqueue */
struct task_struct *thread;
-} ____cacheline_aligned;
+} __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));

/*
* The externally visible workqueue abstraction is an array of
@@ -1049,6 +1051,15 @@ EXPORT_SYMBOL_GPL(work_on_cpu);

void __init init_workqueues(void)
{
+ /*
+ * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
+ * Make sure that the alignment isn't lower than that of
+ * unsigned long long in case this code survives for longer
+ * than twenty years. :-P
+ */
+ BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
+ __alignof__(unsigned long long));
+
alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);

cpumask_copy(cpu_populated_map, cpu_online_mask);
--
1.6.4.2

2009-10-01 08:13:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/19] workqueue: define both bit position and mask for work flags

Work flags are about to see more traditional mask handling. Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 8 +++++---
kernel/workqueue.c | 12 ++++++------
2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 3d11ce3..541c5eb 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -23,7 +23,9 @@ typedef void (*work_func_t)(struct work_struct *work);
#define work_data_bits(work) ((unsigned long *)(&(work)->data))

enum {
- WORK_STRUCT_PENDING = 0, /* work item is pending execution */
+ WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
+
+ WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,

/*
* Reserve 3bits off of cwq pointer. This is enough and
@@ -157,7 +159,7 @@ struct execute_work {
* @work: The work item in question
*/
#define work_pending(work) \
- test_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+ test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

/**
* delayed_work_pending - Find out whether a delayable work item is currently
@@ -172,7 +174,7 @@ struct execute_work {
* @work: The work item in question
*/
#define work_clear_pending(work) \
- clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 28de966..1678dd1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -124,8 +124,8 @@ static inline void set_wq_data(struct work_struct *work,
unsigned long extra_flags)
{
BUG_ON(!work_pending(work));
- atomic_long_set(&work->data, (unsigned long)cwq |
- (1UL << WORK_STRUCT_PENDING) | extra_flags);
+ atomic_long_set(&work->data,
+ (unsigned long)cwq | WORK_STRUCT_PENDING | extra_flags);
}

static inline
@@ -230,7 +230,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
{
int ret = 0;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
__queue_work(cpu, wq, work);
ret = 1;
}
@@ -280,7 +280,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct timer_list *timer = &dwork->timer;
struct work_struct *work = &dwork->work;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
BUG_ON(timer_pending(timer));
BUG_ON(!list_empty(&work->entry));

@@ -412,7 +412,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
struct wq_barrier *barr, struct list_head *head)
{
INIT_WORK(&barr->work, wq_barrier_func);
- __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
+ __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
init_completion(&barr->done);

insert_work(cwq, &barr->work, head, 0);
@@ -519,7 +519,7 @@ static int try_to_grab_pending(struct work_struct *work)
struct cpu_workqueue_struct *cwq;
int ret = -1;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
return 0;

/*
--
1.6.4.2

2009-10-01 08:14:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/19] workqueue: separate out process_one_work()

Separate out process_one_work() out of run_workqueue(). This patch
doesn't cause any behavior change.

NOT_SIGNED_OFF_YET
---
kernel/workqueue.c | 98 ++++++++++++++++++++++++++++++++--------------------
1 files changed, 60 insertions(+), 38 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1678dd1..0f94eb5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -302,50 +302,72 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+/**
+ * process_one_work - process single work
+ * @cwq: cwq to process work for
+ * @work: work to process
+ *
+ * Process @work. This function contains all the logics necessary to
+ * process a single work including synchronization against and
+ * interaction with other workers on the same cpu, queueing and
+ * flushing. As long as context requirement is met, any worker can
+ * call this function to process a work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ */
+static void process_one_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work)
+{
+ work_func_t f = work->func;
+#ifdef CONFIG_LOCKDEP
+ /*
+ * It is permissible to free the struct work_struct from
+ * inside the function that is called from it, this we need to
+ * take into account for lockdep too. To avoid bogus "held
+ * lock freed" warnings as well as problems when looking into
+ * work->lockdep_map, make a copy and use that here.
+ */
+ struct lockdep_map lockdep_map = work->lockdep_map;
+#endif
+ /* claim and process */
+ trace_workqueue_execution(cwq->thread, work);
+ cwq->current_work = work;
+ list_del_init(&work->entry);
+
+ spin_unlock_irq(&cwq->lock);
+
+ BUG_ON(get_wq_data(work) != cwq);
+ work_clear_pending(work);
+ lock_map_acquire(&cwq->wq->lockdep_map);
+ lock_map_acquire(&lockdep_map);
+ f(work);
+ lock_map_release(&lockdep_map);
+ lock_map_release(&cwq->wq->lockdep_map);
+
+ if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+ printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
+ "%s/0x%08x/%d\n",
+ current->comm, preempt_count(), task_pid_nr(current));
+ printk(KERN_ERR " last function: ");
+ print_symbol("%s\n", (unsigned long)f);
+ debug_show_held_locks(current);
+ dump_stack();
+ }
+
+ spin_lock_irq(&cwq->lock);
+
+ /* we're done with it, release */
+ cwq->current_work = NULL;
+}
+
static void run_workqueue(struct cpu_workqueue_struct *cwq)
{
spin_lock_irq(&cwq->lock);
while (!list_empty(&cwq->worklist)) {
struct work_struct *work = list_entry(cwq->worklist.next,
struct work_struct, entry);
- work_func_t f = work->func;
-#ifdef CONFIG_LOCKDEP
- /*
- * It is permissible to free the struct work_struct
- * from inside the function that is called from it,
- * this we need to take into account for lockdep too.
- * To avoid bogus "held lock freed" warnings as well
- * as problems when looking into work->lockdep_map,
- * make a copy and use that here.
- */
- struct lockdep_map lockdep_map = work->lockdep_map;
-#endif
- trace_workqueue_execution(cwq->thread, work);
- cwq->current_work = work;
- list_del_init(cwq->worklist.next);
- spin_unlock_irq(&cwq->lock);
-
- BUG_ON(get_wq_data(work) != cwq);
- work_clear_pending(work);
- lock_map_acquire(&cwq->wq->lockdep_map);
- lock_map_acquire(&lockdep_map);
- f(work);
- lock_map_release(&lockdep_map);
- lock_map_release(&cwq->wq->lockdep_map);
-
- if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
- printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
- "%s/0x%08x/%d\n",
- current->comm, preempt_count(),
- task_pid_nr(current));
- printk(KERN_ERR " last function: ");
- print_symbol("%s\n", (unsigned long)f);
- debug_show_held_locks(current);
- dump_stack();
- }
-
- spin_lock_irq(&cwq->lock);
- cwq->current_work = NULL;
+ process_one_work(cwq, work);
}
spin_unlock_irq(&cwq->lock);
}
--
1.6.4.2

2009-10-01 08:14:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/19] workqueue: temporarily disable workqueue tracing

Strip tracing code from workqueue and disable workqueue tracing. This
is temporary measure till concurrency managed workqueue is complete.

NOT_SIGNED_OFF_YET
---
kernel/trace/Kconfig | 4 +++-
kernel/workqueue.c | 14 +++-----------
2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index b416512..0e14ecf 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -393,7 +393,9 @@ config KMEMTRACE
If unsure, say N.

config WORKQUEUE_TRACER
- bool "Trace workqueues"
+# Temporarily disabled during workqueue reimplementation
+# bool "Trace workqueues"
+ def_bool n
select GENERIC_TRACER
help
The workqueue tracer provides some statistical informations
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0f94eb5..39a04ec 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,8 +33,6 @@
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
-#define CREATE_TRACE_POINTS
-#include <trace/events/workqueue.h>

#include "sched_workqueue.h"

@@ -128,10 +126,10 @@ static inline void set_wq_data(struct work_struct *work,
(unsigned long)cwq | WORK_STRUCT_PENDING | extra_flags);
}

-static inline
-struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
+static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
{
- return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
+ return (void *)(atomic_long_read(&work->data) &
+ WORK_STRUCT_WQ_DATA_MASK);
}

/*
@@ -166,8 +164,6 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head,
unsigned int extra_flags)
{
- trace_workqueue_insertion(cwq->thread, work);
-
/* we own @work, set data and link */
set_wq_data(work, cwq, extra_flags);

@@ -331,7 +327,6 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
/* claim and process */
- trace_workqueue_execution(cwq->thread, work);
cwq->current_work = work;
list_del_init(&work->entry);

@@ -838,8 +833,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
return PTR_ERR(p);
cwq->thread = p;

- trace_workqueue_creation(cwq->thread, cpu);
-
return 0;
}

@@ -944,7 +937,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
* checks list_empty(), and a "normal" queue_work() can't use
* a dead CPU.
*/
- trace_workqueue_destruction(cwq->thread);
kthread_stop(cwq->thread);
cwq->thread = NULL;
}
--
1.6.4.2

2009-10-01 08:10:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/19] workqueue: (TEMPORARY) kill singlethread variant

This is incorrect. There are workqueue users which depend on single
thread for synchronization purpose. Working on proper solution.

NOT_SIGNED_OFF
---
include/linux/workqueue.h | 5 +-
kernel/workqueue.c | 128 ++++++++++++---------------------------------
2 files changed, 36 insertions(+), 97 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 541c5eb..5aa0e15 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -178,7 +178,6 @@ struct execute_work {

enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
- WQ_SINGLE_THREAD = 1 << 1, /* no per-cpu worker */
};

extern struct workqueue_struct *
@@ -207,9 +206,9 @@ __create_workqueue_key(const char *name, unsigned int flags,
#define create_workqueue(name) \
__create_workqueue((name), 0)
#define create_freezeable_workqueue(name) \
- __create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+ __create_workqueue((name), WQ_FREEZEABLE)
#define create_singlethread_workqueue(name) \
- __create_workqueue((name), WQ_SINGLE_THREAD)
+ __create_workqueue((name), 0)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 39a04ec..6370c9b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -47,8 +47,7 @@
*/

/*
- * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu). The lower WORK_STRUCT_FLAG_BITS of
+ * The per-CPU workqueue. The lower WORK_STRUCT_FLAG_BITS of
* work_struct->data are used for flags and thus cwqs need to be
* aligned at two's power of the number of flag bits.
*/
@@ -82,34 +81,9 @@ struct workqueue_struct {
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);

-static int singlethread_cpu __read_mostly;
-static const struct cpumask *cpu_singlethread_map __read_mostly;
-/*
- * _cpu_down() first removes CPU from cpu_online_map, then CPU_DEAD
- * flushes cwq->worklist. This means that flush_workqueue/wait_on_work
- * which comes in between can't use for_each_online_cpu(). We could
- * use cpu_possible_map, the cpumask below is more a documentation
- * than optimization.
- */
-static cpumask_var_t cpu_populated_map __read_mostly;
-
-/* If it's single threaded, it isn't in the list of workqueues. */
-static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
-{
- return wq->flags & WQ_SINGLE_THREAD;
-}
-
-static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
-{
- return is_wq_single_threaded(wq)
- ? cpu_singlethread_map : cpu_populated_map;
-}
-
static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
struct workqueue_struct *wq)
{
- if (unlikely(is_wq_single_threaded(wq)))
- cpu = singlethread_cpu;
return per_cpu_ptr(wq->cpu_wq, cpu);
}

@@ -467,13 +441,12 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*/
void flush_workqueue(struct workqueue_struct *wq)
{
- const struct cpumask *cpu_map = wq_cpu_map(wq);
int cpu;

might_sleep();
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
- for_each_cpu(cpu, cpu_map)
+ for_each_possible_cpu(cpu)
flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
}
EXPORT_SYMBOL_GPL(flush_workqueue);
@@ -587,7 +560,6 @@ static void wait_on_work(struct work_struct *work)
{
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
- const struct cpumask *cpu_map;
int cpu;

might_sleep();
@@ -600,9 +572,8 @@ static void wait_on_work(struct work_struct *work)
return;

wq = cwq->wq;
- cpu_map = wq_cpu_map(wq);

- for_each_cpu(cpu, cpu_map)
+ for_each_possible_cpu(cpu)
wait_on_cpu_work(get_cwq(cpu, wq), work);
}

@@ -801,26 +772,12 @@ int current_is_keventd(void)
return is_sched_workqueue(current);
}

-static struct cpu_workqueue_struct *
-init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
-{
- struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
-
- cwq->wq = wq;
- spin_lock_init(&cwq->lock);
- INIT_LIST_HEAD(&cwq->worklist);
- init_waitqueue_head(&cwq->more_work);
-
- return cwq;
-}
-
static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
struct workqueue_struct *wq = cwq->wq;
- const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;

- p = kthread_create(worker_thread, cwq, fmt, wq->name, cpu);
+ p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
/*
* Nobody can add the work_struct to this cwq,
* if (caller is __create_workqueue)
@@ -853,7 +810,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
const char *lock_name)
{
struct workqueue_struct *wq;
- struct cpu_workqueue_struct *cwq;
int err = 0, cpu;

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -869,36 +825,36 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);

- if (flags & WQ_SINGLE_THREAD) {
- cwq = init_cpu_workqueue(wq, singlethread_cpu);
- err = create_workqueue_thread(cwq, singlethread_cpu);
- start_workqueue_thread(cwq, -1);
- } else {
- cpu_maps_update_begin();
- /*
- * We must place this wq on list even if the code below fails.
- * cpu_down(cpu) can remove cpu from cpu_populated_map before
- * destroy_workqueue() takes the lock, in that case we leak
- * cwq[cpu]->thread.
- */
- spin_lock(&workqueue_lock);
- list_add(&wq->list, &workqueues);
- spin_unlock(&workqueue_lock);
- /*
- * We must initialize cwqs for each possible cpu even if we
- * are going to call destroy_workqueue() finally. Otherwise
- * cpu_up() can hit the uninitialized cwq once we drop the
- * lock.
- */
- for_each_possible_cpu(cpu) {
- cwq = init_cpu_workqueue(wq, cpu);
- if (err || !cpu_online(cpu))
- continue;
- err = create_workqueue_thread(cwq, cpu);
- start_workqueue_thread(cwq, cpu);
- }
- cpu_maps_update_done();
+ cpu_maps_update_begin();
+ /*
+ * We must place this wq on list even if the code below fails.
+ * cpu_down(cpu) can remove cpu from cpu_populated_map before
+ * destroy_workqueue() takes the lock, in that case we leak
+ * cwq[cpu]->thread.
+ */
+ spin_lock(&workqueue_lock);
+ list_add(&wq->list, &workqueues);
+ spin_unlock(&workqueue_lock);
+ /*
+ * We must initialize cwqs for each possible cpu even if we
+ * are going to call destroy_workqueue() finally. Otherwise
+ * cpu_up() can hit the uninitialized cwq once we drop the
+ * lock.
+ */
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+
+ cwq->wq = wq;
+ spin_lock_init(&cwq->lock);
+ INIT_LIST_HEAD(&cwq->worklist);
+ init_waitqueue_head(&cwq->more_work);
+
+ if (err || !cpu_online(cpu))
+ continue;
+ err = create_workqueue_thread(cwq, cpu);
+ start_workqueue_thread(cwq, cpu);
}
+ cpu_maps_update_done();

if (err) {
destroy_workqueue(wq);
@@ -949,7 +905,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- const struct cpumask *cpu_map = wq_cpu_map(wq);
int cpu;

cpu_maps_update_begin();
@@ -957,7 +912,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
list_del(&wq->list);
spin_unlock(&workqueue_lock);

- for_each_cpu(cpu, cpu_map)
+ for_each_possible_cpu(cpu)
cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
cpu_maps_update_done();

@@ -977,10 +932,6 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,

action &= ~CPU_TASKS_FROZEN;

- switch (action) {
- case CPU_UP_PREPARE:
- cpumask_set_cpu(cpu, cpu_populated_map);
- }
undo:
list_for_each_entry(wq, &workqueues, list) {
cwq = per_cpu_ptr(wq->cpu_wq, cpu);
@@ -1007,12 +958,6 @@ undo:
}
}

- switch (action) {
- case CPU_UP_CANCELED:
- case CPU_POST_DEAD:
- cpumask_clear_cpu(cpu, cpu_populated_map);
- }
-
return ret;
}

@@ -1074,11 +1019,6 @@ void __init init_workqueues(void)
BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
__alignof__(unsigned long long));

- alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-
- cpumask_copy(cpu_populated_map, cpu_online_mask);
- singlethread_cpu = cpumask_first(cpu_possible_mask);
- cpu_singlethread_map = cpumask_of(singlethread_cpu);
hotcpu_notifier(workqueue_cpu_callback, 0);
keventd_wq = create_workqueue("events");
BUG_ON(!keventd_wq);
--
1.6.4.2

2009-10-01 08:10:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works

Reimplement workqueue flushing using color coded works. There are two
colors and each cwq has the current color which is painted on the
works being issued via the cwq. Flushing a workqueue is achieved by
flipping the current colors of each cwq and wait for the works which
have the old color to drain. This new implementation is to allow
having and sharing multiple workers per cpu. One restriction this
implementation has is that there can only be single workqueue flushing
in progress at any given time. If one is in progress, others should
wait for their turn.

This new flush implementation leaves only cleanup_workqueue_thread()
as the user of flush_cpu_workqueue(). Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 2 +
kernel/workqueue.c | 151 ++++++++++++++++++++++++++++-----------------
2 files changed, 97 insertions(+), 56 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 5aa0e15..78fd6eb 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -24,8 +24,10 @@ typedef void (*work_func_t)(struct work_struct *work);

enum {
WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
+ WORK_STRUCT_COLOR_BIT = 1, /* color for workqueue flushing */

WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
+ WORK_STRUCT_COLOR = 1 << WORK_STRUCT_COLOR_BIT,

/*
* Reserve 3bits off of cwq pointer. This is enough and
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6370c9b..269f6c5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -59,6 +59,9 @@ struct cpu_workqueue_struct {
wait_queue_head_t more_work;
struct work_struct *current_work;

+ int nr_in_flight; /* L: nr of in_flight works */
+ unsigned int flush_color; /* L: current flush color */
+ int flush_cnt; /* L: in-progress flush count */
struct workqueue_struct *wq; /* I: the owning workqueue */
struct task_struct *thread;
} __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));
@@ -71,6 +74,11 @@ struct workqueue_struct {
unsigned int flags; /* I: WQ_* flags */
struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
struct list_head list; /* W: list of all workqueues */
+
+ struct mutex flush_mutex; /* single flush at a time */
+ atomic_t nr_cwqs_to_flush; /* flush in progress */
+ struct completion *flush_done; /* flush done */
+
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
@@ -138,8 +146,10 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head,
unsigned int extra_flags)
{
+ cwq->nr_in_flight++;
+
/* we own @work, set data and link */
- set_wq_data(work, cwq, extra_flags);
+ set_wq_data(work, cwq, cwq->flush_color | extra_flags);

/*
* Ensure that we get the right work->data if we see the
@@ -273,6 +283,28 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

/**
+ * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
+ * @cwq: cwq of interest
+ * @work_color: color of work which left the queue
+ *
+ * A work either has completed or is removed from pending queue,
+ * decrement nr_in_flight of its cwq and handle workqueue flushing.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
+ unsigned int work_color)
+{
+ cwq->nr_in_flight--;
+ if (unlikely(cwq->flush_cnt)) {
+ if (work_color ^ cwq->flush_color && !--cwq->flush_cnt &&
+ atomic_dec_and_test(&cwq->wq->nr_cwqs_to_flush))
+ complete(cwq->wq->flush_done);
+ }
+}
+
+/**
* process_one_work - process single work
* @cwq: cwq to process work for
* @work: work to process
@@ -290,6 +322,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
work_func_t f = work->func;
+ unsigned int work_color;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -302,6 +335,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
#endif
/* claim and process */
cwq->current_work = work;
+ work_color = *work_data_bits(work) & WORK_STRUCT_COLOR;
list_del_init(&work->entry);

spin_unlock_irq(&cwq->lock);
@@ -328,6 +362,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,

/* we're done with it, release */
cwq->current_work = NULL;
+ cwq_dec_nr_in_flight(cwq, work_color);
}

static void run_workqueue(struct cpu_workqueue_struct *cwq)
@@ -409,26 +444,6 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
insert_work(cwq, &barr->work, head, 0);
}

-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
-{
- int active = 0;
- struct wq_barrier barr;
-
- WARN_ON(cwq->thread == current);
-
- spin_lock_irq(&cwq->lock);
- if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
- insert_wq_barrier(cwq, &barr, &cwq->worklist);
- active = 1;
- }
- spin_unlock_irq(&cwq->lock);
-
- if (active)
- wait_for_completion(&barr.done);
-
- return active;
-}
-
/**
* flush_workqueue - ensure that any scheduled work has run to completion.
* @wq: workqueue to flush
@@ -441,13 +456,44 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*/
void flush_workqueue(struct workqueue_struct *wq)
{
- int cpu;
+ DECLARE_COMPLETION_ONSTACK(flush_done);
+ bool wait = false;
+ unsigned int cpu;

- might_sleep();
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
- for_each_possible_cpu(cpu)
- flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+
+ /* only single flush can be in progress at any given time */
+ mutex_lock(&wq->flush_mutex);
+
+ BUG_ON(atomic_read(&wq->nr_cwqs_to_flush) || wq->flush_done);
+
+ wq->flush_done = &flush_done;
+
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ spin_lock_irq(&cwq->lock);
+
+ BUG_ON(cwq->flush_cnt);
+
+ cwq->flush_color ^= WORK_STRUCT_COLOR;
+ cwq->flush_cnt = cwq->nr_in_flight;
+
+ if (cwq->flush_cnt) {
+ atomic_inc(&wq->nr_cwqs_to_flush);
+ wait = true;
+ }
+
+ spin_unlock_irq(&cwq->lock);
+ }
+
+ if (wait)
+ wait_for_completion(&flush_done);
+
+ wq->flush_done = NULL;
+
+ mutex_unlock(&wq->flush_mutex);
}
EXPORT_SYMBOL_GPL(flush_workqueue);

@@ -531,6 +577,8 @@ static int try_to_grab_pending(struct work_struct *work)
smp_rmb();
if (cwq == get_wq_data(work)) {
list_del_init(&work->entry);
+ cwq_dec_nr_in_flight(cwq,
+ *work_data_bits(work) & WORK_STRUCT_COLOR);
ret = 1;
}
}
@@ -821,6 +869,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
goto err;

wq->flags = flags;
+ mutex_init(&wq->flush_mutex);
+ atomic_set(&wq->nr_cwqs_to_flush, 0);
wq->name = name;
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);
@@ -842,7 +892,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
* lock.
*/
for_each_possible_cpu(cpu) {
- struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

cwq->wq = wq;
spin_lock_init(&cwq->lock);
@@ -870,33 +920,6 @@ err:
}
EXPORT_SYMBOL_GPL(__create_workqueue_key);

-static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
-{
- /*
- * Our caller is either destroy_workqueue() or CPU_POST_DEAD,
- * cpu_add_remove_lock protects cwq->thread.
- */
- if (cwq->thread == NULL)
- return;
-
- lock_map_acquire(&cwq->wq->lockdep_map);
- lock_map_release(&cwq->wq->lockdep_map);
-
- flush_cpu_workqueue(cwq);
- /*
- * If the caller is CPU_POST_DEAD and cwq->worklist was not empty,
- * a concurrent flush_workqueue() can insert a barrier after us.
- * However, in that case run_workqueue() won't return and check
- * kthread_should_stop() until it flushes all work_struct's.
- * When ->worklist becomes empty it is safe to exit because no
- * more work_structs can be queued on this cwq: flush_workqueue
- * checks list_empty(), and a "normal" queue_work() can't use
- * a dead CPU.
- */
- kthread_stop(cwq->thread);
- cwq->thread = NULL;
-}
-
/**
* destroy_workqueue - safely terminate a workqueue
* @wq: target workqueue
@@ -912,8 +935,19 @@ void destroy_workqueue(struct workqueue_struct *wq)
list_del(&wq->list);
spin_unlock(&workqueue_lock);

- for_each_possible_cpu(cpu)
- cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
+ flush_workqueue(wq);
+
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ /* cpu_add_remove_lock protects cwq->thread */
+ if (cwq->thread) {
+ kthread_stop(cwq->thread);
+ cwq->thread = NULL;
+ }
+ BUG_ON(cwq->nr_in_flight);
+ }
+
cpu_maps_update_done();

free_percpu(wq->cpu_wq);
@@ -953,7 +987,12 @@ undo:
case CPU_UP_CANCELED:
start_workqueue_thread(cwq, -1);
case CPU_POST_DEAD:
- cleanup_workqueue_thread(cwq);
+ flush_workqueue(wq);
+ /* cpu_add_remove_lock protects cwq->thread */
+ if (cwq->thread) {
+ kthread_stop(cwq->thread);
+ cwq->thread = NULL;
+ }
break;
}
}
--
1.6.4.2

2009-10-01 08:13:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/19] workqueue: introduce worker

Separate out worker thread related information to struct worker from
struct cpu_workqueue_struct and implement helper functions to deal
with the new struct worker. This patch only shifts things around
without any actual behavior change. This is in preparation of
concurrency managed workqueue where shared multiple workers would be
available per cpu.

NOT_SIGNED_OFF_YET
---
kernel/workqueue.c | 186 +++++++++++++++++++++++++++++++++++-----------------
1 files changed, 125 insertions(+), 61 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 269f6c5..f10fe4a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -46,6 +46,14 @@
* W: workqueue_lock protected.
*/

+struct cpu_workqueue_struct;
+
+struct worker {
+ struct work_struct *current_work; /* L: work being processed */
+ struct task_struct *task; /* I: worker task */
+ struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
+};
+
/*
* The per-CPU workqueue. The lower WORK_STRUCT_FLAG_BITS of
* work_struct->data are used for flags and thus cwqs need to be
@@ -57,13 +65,13 @@ struct cpu_workqueue_struct {

struct list_head worklist;
wait_queue_head_t more_work;
- struct work_struct *current_work;
+ unsigned int cpu;
+ struct worker *worker;

int nr_in_flight; /* L: nr of in_flight works */
unsigned int flush_color; /* L: current flush color */
int flush_cnt; /* L: in-progress flush count */
struct workqueue_struct *wq; /* I: the owning workqueue */
- struct task_struct *thread;
} __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));

/*
@@ -89,6 +97,8 @@ struct workqueue_struct {
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);

+static int worker_thread(void *__worker);
+
static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
struct workqueue_struct *wq)
{
@@ -282,6 +292,82 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+static struct worker *alloc_worker(void)
+{
+ struct worker *worker;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+ return worker;
+}
+
+/**
+ * create_worker - create a new workqueue worker
+ * @cwq: cwq the new worker will belong to
+ * @bind: whether to set affinity to @cpu or not
+ *
+ * Create a new worker which is bound to @cwq. The returned worker
+ * can be started by calling start_worker() or destroyed using
+ * destroy_worker().
+ *
+ * CONTEXT:
+ * Might sleep. Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * Pointer to the newly created worker.
+ */
+static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+{
+ struct worker *worker;
+
+ worker = alloc_worker();
+ if (!worker)
+ goto fail;
+
+ worker->cwq = cwq;
+
+ worker->task = kthread_create(worker_thread, worker, "kworker/%u",
+ cwq->cpu);
+ if (IS_ERR(worker->task))
+ goto fail;
+
+ if (bind)
+ kthread_bind(worker->task, cwq->cpu);
+
+ return worker;
+fail:
+ kfree(worker);
+ return NULL;
+}
+
+/**
+ * start_worker - start a newly created worker
+ * @worker: worker to start
+ *
+ * Start @worker.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void start_worker(struct worker *worker)
+{
+ wake_up_process(worker->task);
+}
+
+/**
+ * destroy_worker - destroy a workqueue worker
+ * @worker: worker to be destroyed
+ *
+ * Destroy @worker.
+ */
+static void destroy_worker(struct worker *worker)
+{
+ /* sanity check frenzy */
+ BUG_ON(worker->current_work);
+
+ kthread_stop(worker->task);
+ kfree(worker);
+}
+
/**
* cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
* @cwq: cwq of interest
@@ -306,7 +392,7 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,

/**
* process_one_work - process single work
- * @cwq: cwq to process work for
+ * @worker: self
* @work: work to process
*
* Process @work. This function contains all the logics necessary to
@@ -318,9 +404,9 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
* CONTEXT:
* spin_lock_irq(cwq->lock) which is released and regrabbed.
*/
-static void process_one_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work)
+static void process_one_work(struct worker *worker, struct work_struct *work)
{
+ struct cpu_workqueue_struct *cwq = worker->cwq;
work_func_t f = work->func;
unsigned int work_color;
#ifdef CONFIG_LOCKDEP
@@ -334,7 +420,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
/* claim and process */
- cwq->current_work = work;
+ worker->current_work = work;
work_color = *work_data_bits(work) & WORK_STRUCT_COLOR;
list_del_init(&work->entry);

@@ -361,30 +447,33 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
spin_lock_irq(&cwq->lock);

/* we're done with it, release */
- cwq->current_work = NULL;
+ worker->current_work = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}

-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static void run_workqueue(struct worker *worker)
{
+ struct cpu_workqueue_struct *cwq = worker->cwq;
+
spin_lock_irq(&cwq->lock);
while (!list_empty(&cwq->worklist)) {
struct work_struct *work = list_entry(cwq->worklist.next,
struct work_struct, entry);
- process_one_work(cwq, work);
+ process_one_work(worker, work);
}
spin_unlock_irq(&cwq->lock);
}

/**
* worker_thread - the worker thread function
- * @__cwq: cwq to serve
+ * @__worker: self
*
* The cwq worker thread function.
*/
-static int worker_thread(void *__cwq)
+static int worker_thread(void *__worker)
{
- struct cpu_workqueue_struct *cwq = __cwq;
+ struct worker *worker = __worker;
+ struct cpu_workqueue_struct *cwq = worker->cwq;
DEFINE_WAIT(wait);

/* set workqueue scheduler */
@@ -406,7 +495,7 @@ static int worker_thread(void *__cwq)
if (kthread_should_stop())
break;

- run_workqueue(cwq);
+ run_workqueue(worker);
}

return 0;
@@ -532,7 +621,7 @@ int flush_work(struct work_struct *work)
goto already_gone;
prev = &work->entry;
} else {
- if (cwq->current_work != work)
+ if (!cwq->worker || cwq->worker->current_work != work)
goto already_gone;
prev = &cwq->worklist;
}
@@ -594,7 +683,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
int running = 0;

spin_lock_irq(&cwq->lock);
- if (unlikely(cwq->current_work == work)) {
+ if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
insert_wq_barrier(cwq, &barr, cwq->worklist.next);
running = 1;
}
@@ -820,45 +909,14 @@ int current_is_keventd(void)
return is_sched_workqueue(current);
}

-static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
- struct workqueue_struct *wq = cwq->wq;
- struct task_struct *p;
-
- p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
- /*
- * Nobody can add the work_struct to this cwq,
- * if (caller is __create_workqueue)
- * nobody should see this wq
- * else // caller is CPU_UP_PREPARE
- * cpu is not on cpu_online_map
- * so we can abort safely.
- */
- if (IS_ERR(p))
- return PTR_ERR(p);
- cwq->thread = p;
-
- return 0;
-}
-
-static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
- struct task_struct *p = cwq->thread;
-
- if (p != NULL) {
- if (cpu >= 0)
- kthread_bind(p, cpu);
- wake_up_process(p);
- }
-}
-
struct workqueue_struct *__create_workqueue_key(const char *name,
unsigned int flags,
struct lock_class_key *key,
const char *lock_name)
{
struct workqueue_struct *wq;
- int err = 0, cpu;
+ bool failed = false;
+ unsigned int cpu;

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
@@ -894,19 +952,23 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

+ cwq->cpu = cpu;
cwq->wq = wq;
spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
init_waitqueue_head(&cwq->more_work);

- if (err || !cpu_online(cpu))
+ if (failed || !cpu_online(cpu))
continue;
- err = create_workqueue_thread(cwq, cpu);
- start_workqueue_thread(cwq, cpu);
+ cwq->worker = create_worker(cwq, true);
+ if (cwq->worker)
+ start_worker(cwq->worker);
+ else
+ failed = true;
}
cpu_maps_update_done();

- if (err) {
+ if (failed) {
destroy_workqueue(wq);
wq = NULL;
}
@@ -941,9 +1003,9 @@ void destroy_workqueue(struct workqueue_struct *wq)
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

/* cpu_add_remove_lock protects cwq->thread */
- if (cwq->thread) {
- kthread_stop(cwq->thread);
- cwq->thread = NULL;
+ if (cwq->worker) {
+ destroy_worker(cwq->worker);
+ cwq->worker = NULL;
}
BUG_ON(cwq->nr_in_flight);
}
@@ -972,7 +1034,8 @@ undo:

switch (action) {
case CPU_UP_PREPARE:
- if (!create_workqueue_thread(cwq, cpu))
+ cwq->worker = create_worker(cwq, false);
+ if (cwq->worker)
break;
printk(KERN_ERR "workqueue [%s] for %i failed\n",
wq->name, cpu);
@@ -981,17 +1044,18 @@ undo:
goto undo;

case CPU_ONLINE:
- start_workqueue_thread(cwq, cpu);
+ kthread_bind(cwq->worker->task, cpu);
+ start_worker(cwq->worker);
break;

case CPU_UP_CANCELED:
- start_workqueue_thread(cwq, -1);
+ start_worker(cwq->worker);
case CPU_POST_DEAD:
flush_workqueue(wq);
/* cpu_add_remove_lock protects cwq->thread */
- if (cwq->thread) {
- kthread_stop(cwq->thread);
- cwq->thread = NULL;
+ if (cwq->worker) {
+ destroy_worker(cwq->worker);
+ cwq->worker = NULL;
}
break;
}
--
1.6.4.2

2009-10-01 08:12:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/19] workqueue: reimplement work flushing using linked works

A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained. When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.

Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference. This change is to
prepare for multiple shared workers per cpu.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 2 +
kernel/workqueue.c | 152 ++++++++++++++++++++++++++++++++++++++-------
2 files changed, 131 insertions(+), 23 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 78fd6eb..a6136ca 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -25,9 +25,11 @@ typedef void (*work_func_t)(struct work_struct *work);
enum {
WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
WORK_STRUCT_COLOR_BIT = 1, /* color for workqueue flushing */
+ WORK_STRUCT_LINKED_BIT = 2, /* next work is linked to this one */

WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
WORK_STRUCT_COLOR = 1 << WORK_STRUCT_COLOR_BIT,
+ WORK_STRUCT_LINKED = 1 << WORK_STRUCT_LINKED_BIT,

/*
* Reserve 3bits off of cwq pointer. This is enough and
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f10fe4a..e234604 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -50,6 +50,7 @@ struct cpu_workqueue_struct;

struct worker {
struct work_struct *current_work; /* L: work being processed */
+ struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
};
@@ -297,6 +298,8 @@ static struct worker *alloc_worker(void)
struct worker *worker;

worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+ if (worker)
+ INIT_LIST_HEAD(&worker->scheduled);
return worker;
}

@@ -363,12 +366,56 @@ static void destroy_worker(struct worker *worker)
{
/* sanity check frenzy */
BUG_ON(worker->current_work);
+ BUG_ON(!list_empty(&worker->scheduled));

kthread_stop(worker->task);
kfree(worker);
}

/**
+ * schedule_work_to_worker - schedule linked works to a worker
+ * @worker: target worker
+ * @work: start of series of works to be scheduled
+ * @nextp: out paramter for nested worklist walking
+ *
+ * Schedule linked works starting from @work to @worker. Work series
+ * to be scheduled starts at @work and includes any consecutive work
+ * with WORK_STRUCT_LINKED set in its predecessor.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of
+ * the last scheduled work. This allows schedule_work_to_worker() to
+ * be nested inside outer list_for_each_entry_safe().
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void schedule_work_to_worker(struct worker *worker,
+ struct work_struct *work,
+ struct work_struct **nextp)
+{
+ struct work_struct *n;
+
+ /*
+ * Linked worklist will always end before the end of the list,
+ * use NULL for list head.
+ */
+ work = list_entry(work->entry.prev, struct work_struct, entry);
+ list_for_each_entry_safe_continue(work, n, NULL, entry) {
+ list_move_tail(&work->entry, &worker->scheduled);
+ if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
+ break;
+ }
+
+ /*
+ * If we're already inside safe list traversal and have moved
+ * multiple works to the scheduled queue, the next position
+ * needs to be updated.
+ */
+ if (nextp)
+ *nextp = n;
+}
+
+/**
* cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
* @cwq: cwq of interest
* @work_color: color of work which left the queue
@@ -451,17 +498,25 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
cwq_dec_nr_in_flight(cwq, work_color);
}

-static void run_workqueue(struct worker *worker)
+/**
+ * process_scheduled_works - process scheduled works
+ * @worker: self
+ *
+ * Process all scheduled works. Please note that the scheduled list
+ * may change while processing a work, so this function repeatedly
+ * fetches a work from the top and executes it.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * multiple times.
+ */
+static void process_scheduled_works(struct worker *worker)
{
- struct cpu_workqueue_struct *cwq = worker->cwq;
-
- spin_lock_irq(&cwq->lock);
- while (!list_empty(&cwq->worklist)) {
- struct work_struct *work = list_entry(cwq->worklist.next,
+ while (!list_empty(&worker->scheduled)) {
+ struct work_struct *work = list_first_entry(&worker->scheduled,
struct work_struct, entry);
process_one_work(worker, work);
}
- spin_unlock_irq(&cwq->lock);
}

/**
@@ -495,7 +550,26 @@ static int worker_thread(void *__worker)
if (kthread_should_stop())
break;

- run_workqueue(worker);
+ spin_lock_irq(&cwq->lock);
+
+ while (!list_empty(&cwq->worklist)) {
+ struct work_struct *work =
+ list_first_entry(&cwq->worklist,
+ struct work_struct, entry);
+
+ if (likely(!(*work_data_bits(work) &
+ WORK_STRUCT_LINKED))) {
+ /* optimization path, not strictly necessary */
+ process_one_work(worker, work);
+ if (unlikely(!list_empty(&worker->scheduled)))
+ process_scheduled_works(worker);
+ } else {
+ schedule_work_to_worker(worker, work, NULL);
+ process_scheduled_works(worker);
+ }
+ }
+
+ spin_unlock_irq(&cwq->lock);
}

return 0;
@@ -514,23 +588,51 @@ static void wq_barrier_func(struct work_struct *work)

/**
* insert_wq_barrier - insert a barrier work
- * @cwq: cwq to insert barrier into
* @barr: wq_barrier to insert
- * @head: insertion point
+ * @target: target work to attach @barr to
+ * @worker: worker currently executing @target, NULL if @target is not executing
+ *
+ * @barr is linked to @target such that @barr is completed only after
+ * @target finishes execution. Please note that the ordering
+ * guarantee is observed only with respect to @target and on the local
+ * cpu.
*
- * Insert barrier @barr into @cwq before @head.
+ * Currently, a queued barrier can't be canceled. This is because
+ * try_to_grab_pending() can't determine whether the work to be
+ * grabbed is at the head of the queue and thus can't clear LINKED
+ * flag of the previous work while there must be a valid next work
+ * after a work with LINKED flag set.
*
* CONTEXT:
* spin_lock_irq(cwq->lock).
*/
-static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
- struct wq_barrier *barr, struct list_head *head)
+static void insert_wq_barrier(struct wq_barrier *barr,
+ struct work_struct *target, struct worker *worker)
{
+ struct cpu_workqueue_struct *cwq = get_wq_data(target);
+ struct list_head *head;
+ unsigned int linked = 0;
+
INIT_WORK(&barr->work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
init_completion(&barr->done);

- insert_work(cwq, &barr->work, head, 0);
+ /*
+ * If @target is currently being executed, schedule the
+ * barrier to the worker; otherwise, put it after @target.
+ */
+ if (worker)
+ head = worker->scheduled.next;
+ else {
+ unsigned long *bits = work_data_bits(target);
+
+ head = target->entry.next;
+ /* there can already be other linked works, inherit and set */
+ linked = *bits & WORK_STRUCT_LINKED;
+ *bits |= WORK_STRUCT_LINKED;
+ }
+
+ insert_work(cwq, &barr->work, head, linked);
}

/**
@@ -598,8 +700,8 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
*/
int flush_work(struct work_struct *work)
{
+ struct worker *worker = NULL;
struct cpu_workqueue_struct *cwq;
- struct list_head *prev;
struct wq_barrier barr;

might_sleep();
@@ -619,13 +721,14 @@ int flush_work(struct work_struct *work)
smp_rmb();
if (unlikely(cwq != get_wq_data(work)))
goto already_gone;
- prev = &work->entry;
} else {
- if (!cwq->worker || cwq->worker->current_work != work)
+ if (cwq->worker && cwq->worker->current_work == work)
+ worker = cwq->worker;
+ if (!worker)
goto already_gone;
- prev = &cwq->worklist;
}
- insert_wq_barrier(cwq, &barr, prev->next);
+
+ insert_wq_barrier(&barr, work, worker);
spin_unlock_irq(&cwq->lock);
wait_for_completion(&barr.done);
return 1;
@@ -680,16 +783,19 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
struct wq_barrier barr;
- int running = 0;
+ struct worker *worker;

spin_lock_irq(&cwq->lock);
+
+ worker = NULL;
if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
- insert_wq_barrier(cwq, &barr, cwq->worklist.next);
- running = 1;
+ worker = cwq->worker;
+ insert_wq_barrier(&barr, work, worker);
}
+
spin_unlock_irq(&cwq->lock);

- if (unlikely(running))
+ if (unlikely(worker))
wait_for_completion(&barr.done);
}

--
1.6.4.2

2009-10-01 08:10:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/19] workqueue: reimplement workqueue freeze using cwq->frozen_works queue

Currently, workqueue freezing is implemented by marking the worker
freezeable and calling try_to_freeze() from dispatch loop.
Reimplement it so that the workqueue is frozen instead of the worker.

* cwq->cur_worklist and cwq->frozen_works are added. During normal
operation cwq->cur_worklist points to cwq->worklist.

* When freezing starts, cwq->cur_worklist is switched to
cwq->frozen_works so that new works are stored in cwq->frozen_works
instead of being processed.

* Freezing is complete when cwq->nr_in_flight equals the number of
works on cwq->frozen_works for all cwqs of all freezeable
workqueues.

* Thawing is done by restoring cwq->cur_worklist to cwq->worklist and
splicing cwq->frozen_works to cwq->worklist.

This new implementation allows having multiple shared workers per cpu.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 7 ++
kernel/power/process.c | 21 +++++-
kernel/workqueue.c | 173 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a6136ca..351466d 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -299,4 +299,11 @@ static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
#else
long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
#endif /* CONFIG_SMP */
+
+#ifdef CONFIG_FREEZER
+extern void freeze_workqueues_begin(void);
+extern bool freeze_workqueues_busy(void);
+extern void thaw_workqueues(void);
+#endif /* CONFIG_FREEZER */
+
#endif
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 9d26a0a..18d4835 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/syscalls.h>
#include <linux/freezer.h>
+#include <linux/workqueue.h>

/*
* Timeout for stopping processes
@@ -34,6 +35,7 @@ static int try_to_freeze_tasks(bool sig_only)
struct task_struct *g, *p;
unsigned long end_time;
unsigned int todo;
+ bool wq_busy = false;
struct timeval start, end;
u64 elapsed_csecs64;
unsigned int elapsed_csecs;
@@ -41,6 +43,10 @@ static int try_to_freeze_tasks(bool sig_only)
do_gettimeofday(&start);

end_time = jiffies + TIMEOUT;
+
+ if (!sig_only)
+ freeze_workqueues_begin();
+
while (true) {
todo = 0;
read_lock(&tasklist_lock);
@@ -62,6 +68,12 @@ static int try_to_freeze_tasks(bool sig_only)
todo++;
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
+
+ if (!sig_only) {
+ wq_busy = freeze_workqueues_busy();
+ todo += wq_busy;
+ }
+
if (!todo || time_after(jiffies, end_time))
break;

@@ -85,9 +97,13 @@ static int try_to_freeze_tasks(bool sig_only)
*/
printk("\n");
printk(KERN_ERR "Freezing of tasks failed after %d.%02d seconds "
- "(%d tasks refusing to freeze):\n",
- elapsed_csecs / 100, elapsed_csecs % 100, todo);
+ "(%d tasks refusing to freeze, wq_busy=%d):\n",
+ elapsed_csecs / 100, elapsed_csecs % 100,
+ todo - wq_busy, wq_busy);
show_state();
+
+ thaw_workqueues();
+
read_lock(&tasklist_lock);
do_each_thread(g, p) {
task_lock(p);
@@ -157,6 +173,7 @@ void thaw_processes(void)
oom_killer_enable();

printk("Restarting tasks ... ");
+ thaw_workqueues();
thaw_tasks(true);
thaw_tasks(false);
schedule();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e234604..097da97 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,10 @@
* L: cwq->lock protected. Access with cwq->lock held.
*
* W: workqueue_lock protected.
+ *
+ * V: Similar to L except that operation is limited to only one
+ * direction if workqueues are frozen (ie. can be added but can't
+ * be removed).
*/

struct cpu_workqueue_struct;
@@ -69,10 +73,12 @@ struct cpu_workqueue_struct {
unsigned int cpu;
struct worker *worker;

+ struct list_head *cur_worklist; /* L: current worklist */
int nr_in_flight; /* L: nr of in_flight works */
unsigned int flush_color; /* L: current flush color */
int flush_cnt; /* L: in-progress flush count */
struct workqueue_struct *wq; /* I: the owning workqueue */
+ struct list_head frozen_works; /* V: used while frozen */
} __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));

/*
@@ -97,6 +103,7 @@ struct workqueue_struct {
/* Serializes the accesses to the list of workqueues. */
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
+static bool workqueue_frozen;

static int worker_thread(void *__worker);

@@ -180,7 +187,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,

spin_lock_irqsave(&cwq->lock, flags);
BUG_ON(!list_empty(&work->entry));
- insert_work(cwq, work, &cwq->worklist, 0);
+ insert_work(cwq, work, cwq->cur_worklist, 0);
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -545,8 +552,6 @@ static int worker_thread(void *__worker)
schedule();
finish_wait(&cwq->more_work, &wait);

- try_to_freeze();
-
if (kthread_should_stop())
break;

@@ -1048,6 +1053,14 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
*/
spin_lock(&workqueue_lock);
list_add(&wq->list, &workqueues);
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ if (workqueue_frozen && wq->flags & WQ_FREEZEABLE)
+ cwq->cur_worklist = &cwq->frozen_works;
+ else
+ cwq->cur_worklist = &cwq->worklist;
+ }
spin_unlock(&workqueue_lock);
/*
* We must initialize cwqs for each possible cpu even if we
@@ -1063,6 +1076,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
init_waitqueue_head(&cwq->more_work);
+ INIT_LIST_HEAD(&cwq->frozen_works);

if (failed || !cpu_online(cpu))
continue;
@@ -1099,12 +1113,17 @@ void destroy_workqueue(struct workqueue_struct *wq)
int cpu;

cpu_maps_update_begin();
+
+ flush_workqueue(wq);
+
+ /*
+ * wq list is used to freeze wq, remove from list after
+ * flushing is complete in case freeze races us.
+ */
spin_lock(&workqueue_lock);
list_del(&wq->list);
spin_unlock(&workqueue_lock);

- flush_workqueue(wq);
-
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

@@ -1114,6 +1133,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
cwq->worker = NULL;
}
BUG_ON(cwq->nr_in_flight);
+ BUG_ON(!list_empty(&cwq->frozen_works));
}

cpu_maps_update_done();
@@ -1217,6 +1237,149 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
EXPORT_SYMBOL_GPL(work_on_cpu);
#endif /* CONFIG_SMP */

+#ifdef CONFIG_FREEZER
+/**
+ * freeze_workqueues_begin - begin freezing workqueues
+ *
+ * Start freezing workqueues. After this function returns, all
+ * freezeable workqueues will queue new works to their frozen_works
+ * list instead of the cwq ones.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void freeze_workqueues_begin(void)
+{
+ struct workqueue_struct *wq;
+ unsigned int cpu;
+
+ spin_lock(&workqueue_lock);
+
+ BUG_ON(workqueue_frozen);
+ workqueue_frozen = true;
+
+ for_each_possible_cpu(cpu) {
+ list_for_each_entry(wq, &workqueues, list) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ if (!(wq->flags & WQ_FREEZEABLE))
+ continue;
+
+ spin_lock_irq(&cwq->lock);
+
+ BUG_ON(cwq->cur_worklist != &cwq->worklist);
+ BUG_ON(!list_empty(&cwq->frozen_works));
+
+ cwq->cur_worklist = &cwq->frozen_works;
+
+ spin_unlock_irq(&cwq->lock);
+ }
+ }
+ spin_unlock(&workqueue_lock);
+}
+
+/**
+ * freeze_workqueues_busy - are freezeable workqueues still busy?
+ *
+ * Check whether freezing is complete. This function must be called
+ * between freeeze_workqueues_begin() and thaw_workqueues().
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock.
+ *
+ * RETURNS:
+ * %true if some freezeable workqueues are still busy. %false if
+ * freezing is complete.
+ */
+bool freeze_workqueues_busy(void)
+{
+ struct workqueue_struct *wq;
+ unsigned int cpu;
+ bool busy = false;
+
+ spin_lock(&workqueue_lock);
+
+ BUG_ON(!workqueue_frozen);
+
+ for_each_possible_cpu(cpu) {
+ list_for_each_entry(wq, &workqueues, list) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct work_struct *work;
+ int nr_in_flight;
+
+ if (!(wq->flags & WQ_FREEZEABLE))
+ continue;
+
+ spin_lock_irq(&cwq->lock);
+
+ BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
+
+ nr_in_flight = cwq->nr_in_flight;
+ list_for_each_entry(work, &cwq->frozen_works, entry)
+ nr_in_flight--;
+
+ spin_unlock_irq(&cwq->lock);
+
+ BUG_ON(nr_in_flight < 0);
+ if (nr_in_flight) {
+ busy = true;
+ break;
+ }
+ }
+ if (busy)
+ break;
+ }
+ spin_unlock(&workqueue_lock);
+ return busy;
+}
+
+/**
+ * thaw_workqueues - thaw workqueues
+ *
+ * Thaw workqueues. Normal queueing is restored and all collected
+ * frozen works are transferred to their respective cwq worklists.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void thaw_workqueues(void)
+{
+ struct workqueue_struct *wq;
+ unsigned int cpu;
+
+ spin_lock(&workqueue_lock);
+
+ if (!workqueue_frozen)
+ goto out_unlock;
+
+ workqueue_frozen = false;
+
+ for_each_possible_cpu(cpu) {
+ list_for_each_entry(wq, &workqueues, list) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+ if (!(wq->flags & WQ_FREEZEABLE))
+ continue;
+
+ spin_lock_irq(&cwq->lock);
+
+ /* switch to normal work queueing */
+ BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
+ cwq->cur_worklist = &cwq->worklist;
+
+ /* transfer frozen tasks to cwq worklist */
+ list_splice_tail(&cwq->frozen_works, &cwq->worklist);
+ INIT_LIST_HEAD(&cwq->frozen_works);
+ wake_up(&cwq->more_work);
+
+ spin_unlock_irq(&cwq->lock);
+ }
+ }
+out_unlock:
+ spin_unlock(&workqueue_lock);
+}
+#endif /* CONFIG_FREEZER */
+
void __init init_workqueues(void)
{
/*
--
1.6.4.2

2009-10-01 08:11:05

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/19] workqueue: implement concurrency managed workqueue

Currently each workqueue has its own dedicated worker pool. This
causes the following problems.

* Works which are dependent on each other can cause a deadlock by
depending on the same execution resource. This is bad because this
type of dependency is quite difficult to find.

* Works which may sleep and take long time to finish need to have
separate workqueues so that it doesn't block other works. Similarly
works which want to be executed in timely manner often need to
create it custom workqueue too to avoid being blocked by long
running ones. This leads to large number of workqueues and thus
many workers.

* The static one-per-cpu worker isn't good enough for jobs which
require higher level of concurrency necessiating other worker pool
mechanism. slow-work and async are good examples and there are also
some custom implementations buried in subsystems.

* Combined, the above factors lead to many workqueues with large
number of dedicated and mostly unused workers. This also makes work
processing less optimal as the dedicated workers end up switching
among themselves costing scheduleing overhead and wasting cache
footprint for their stacks and as the system gets busy, these
workers end up competing with each other.

To solve the above issues, this patch implements concurrency-managed
workqueue.

There is single global cpu workqueue (gcwq) for each cpu which serves
all the workqueues. gcwq maintains single pool of workers which is
shared by all cwqs on the cpu.

gcwq keeps the number of concurrent active workers to minimum but no
less. As long as there's one or more running workers on the cpu, no
new worker is scheduled so that works can be processed in batch as
much as possible but when the last running worker blocks, gcwq
immediately schedules new worker so that the cpu doesn't sit idle
while there are works to be processed.

gcwq always keeps at least single idle worker around. When a new
worker is necessary and the worker is the last idle one, the worker
assumes the role of "manager" and manages the worker pool -
ie. creates another worker. Forward-progress is guaranteed by having
dedicated rescue workers for workqueues which may be necessary while
creating a new worker. When the manager is having problem creating a
new worker, mayday timer activates and rescue workers are summoned to
the cpu and execute works which may be necessary to create new
workers.

To keep track of which worker is executing which work, gcwq uses a
hash table. This is necessary as works may be destroyed once it
starts executing and flushing should be implemented by tracking
whether any worker is executing the work.

cpu hotplug implementation is more complex than before because there
are multiple workers and now workqueue is capable of hosting long
erunning works. cpu offlining is implemented by creating a "trustee"
kthread which runs the gcwq as if the cpu is still online until all
works are drained. As soon as the trustee takes over the gcwq, cpu
hotunplug operation can proceed without waiting for workqueues to be
drained. Onlining is the reverse. If trustee is still trying to
drain the gcwq from the previous offlining, it puts all workers back
to the cpu and let the gcwq run as if cpu has been online the whole
time.

The new implementation has the following benefits.

* Workqueue users no longer have to worry about managing concurrency
and, in most cases, deadlocks. The workqueue will manage it
automatically and unless the deadlock chain involves more than 127
works, it won't happen.

* There's one single shared pool of workers per cpu and one rescuer
for each workqueue which requires it, so there are far fewer number
of kthreads.

* More efficient. Although it adds considerable amount of code, the
code added to hot path isn't big and works will be executed on the
local cpu and in batch as much as possible using minimal number of
kthreads leading to fewer task switches and lower cache
footprint. <NEED SOME BACKING NUMBERS>

* As concurrency is no longer a problem, most types of asynchronous
jobs can be done using generic workqueue and other async mechanisms
can be removed.

NOT_SIGNED_OFF_YET
---
include/linux/workqueue.h | 7 +-
kernel/workqueue.c | 1566 +++++++++++++++++++++++++++++++++++++++------
2 files changed, 1378 insertions(+), 195 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 351466d..9dbdbc2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -182,6 +182,7 @@ struct execute_work {

enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
+ WQ_RESCUER = 1 << 1, /* has an rescue worker */
};

extern struct workqueue_struct *
@@ -208,11 +209,11 @@ __create_workqueue_key(const char *name, unsigned int flags,
#endif

#define create_workqueue(name) \
- __create_workqueue((name), 0)
+ __create_workqueue((name), WQ_RESCUER)
#define create_freezeable_workqueue(name) \
- __create_workqueue((name), WQ_FREEZEABLE)
+ __create_workqueue((name), WQ_FREEZEABLE | WQ_RESCUER)
#define create_singlethread_workqueue(name) \
- __create_workqueue((name), 0)
+ __create_workqueue((name), WQ_RESCUER)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 097da97..67cb3a1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -29,19 +29,72 @@
#include <linux/kthread.h>
#include <linux/hardirq.h>
#include <linux/mempolicy.h>
-#include <linux/freezer.h>
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
+#include <linux/wait.h>

#include "sched_workqueue.h"

+enum {
+ /* worker flags */
+ WORKER_STARTED = 1 << 0, /* started */
+ WORKER_DIE = 1 << 1, /* die die die */
+ WORKER_IDLE = 1 << 2, /* is idle */
+ WORKER_MANAGER = 1 << 3, /* I'm the manager */
+ WORKER_RESCUER = 1 << 4, /* I'm a rescuer */
+ WORKER_ROGUE = 1 << 5, /* not bound to any cpu */
+
+ WORKER_IGN_RUNNING = WORKER_IDLE | WORKER_MANAGER | WORKER_ROGUE,
+
+ /* global_cwq flags */
+ GCWQ_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
+ GCWQ_MANAGING_WORKERS = 1 << 1, /* managing workers */
+ GCWQ_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
+
+ /* gcwq->trustee_state */
+ TRUSTEE_START = 0, /* start */
+ TRUSTEE_IN_CHARGE = 1, /* trustee in charge of gcwq */
+ TRUSTEE_BUTCHER = 2, /* butcher workers */
+ TRUSTEE_RELEASE = 3, /* release workers */
+ TRUSTEE_DONE = 4, /* trustee is done */
+
+ MAX_CPU_WORKERS_ORDER = 7, /* 128 */
+ MAX_WORKERS_PER_CPU = 1 << MAX_CPU_WORKERS_ORDER,
+
+ BUSY_WORKER_HASH_ORDER = 4, /* 16 pointers */
+ BUSY_WORKER_HASH_SIZE = 1 << BUSY_WORKER_HASH_ORDER,
+ BUSY_WORKER_HASH_MASK = BUSY_WORKER_HASH_SIZE - 1,
+
+ MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
+ IDLE_WORKER_TIMEOUT = 300 * HZ, /* keep idle ones for 5 mins */
+
+ MAYDAY_INITIAL_TIMEOUT = HZ / 100, /* call for help after 10ms */
+ MAYDAY_INTERVAL = HZ / 10, /* and then every 100ms */
+ CREATE_COOLDOWN = HZ, /* time to breath after fail */
+ TRUSTEE_COOLDOWN = HZ / 10, /* for trustee draining */
+
+ /*
+ * Rescue workers are used only on emergencies and shared by
+ * all cpus. Give -20.
+ */
+ RESCUER_NICE_LEVEL = -20,
+};
+
/*
* Structure fields follow one of the following exclusion rules.
*
* I: Set during initialization and read-only afterwards.
*
- * L: cwq->lock protected. Access with cwq->lock held.
+ * P: Preemption protected. Disabling preemption is enough and should
+ * only be modified and accessed from the local cpu.
+ *
+ * L: gcwq->lock protected. Access with gcwq->lock held.
+ *
+ * X: During normal operation, modification requires gcwq->lock and
+ * should be done only from local cpu. Either disabling preemption
+ * on local cpu or grabbing gcwq->lock is enough for read access.
+ * While trustee is in charge, it's identical to L.
*
* W: workqueue_lock protected.
*
@@ -50,14 +103,56 @@
* be removed).
*/

-struct cpu_workqueue_struct;
+struct global_cwq;

+/*
+ * The poor guys doing the actual heavy lifting. All on-duty workers
+ * are either serving the manager role, on idle list or on busy hash.
+ */
struct worker {
+ /* on idle list while idle, on busy hash table while busy */
+ union {
+ struct list_head entry; /* L: while idle */
+ struct hlist_node hentry; /* L: while busy */
+ };
+
struct work_struct *current_work; /* L: work being processed */
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
- struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
-};
+ struct global_cwq *gcwq; /* I: the associated gcwq */
+ unsigned long last_active; /* L: last active timestamp */
+ /* 64 bytes boundary on 64bit, 32 on 32bit */
+ bool running; /* ?: is this worker running? */
+ unsigned int flags; /* ?: flags */
+} ____cacheline_aligned_in_smp;
+
+/*
+ * Global per-cpu workqueue. There's one and only one for each cpu
+ * and all works are queued and processed here regardless of their
+ * target workqueues.
+ */
+struct global_cwq {
+ spinlock_t lock; /* the gcwq lock */
+ struct list_head worklist; /* L: list of pending works */
+ unsigned int cpu; /* I: the associated cpu */
+ unsigned int flags; /* L: GCWQ_* flags */
+
+ int nr_workers; /* L: total number of workers */
+ int nr_idle; /* L: currently idle ones */
+
+ /* workers are chained either in the idle_list or busy_hash */
+ struct list_head idle_list; /* ?: list of idle workers */
+ struct hlist_head busy_hash[BUSY_WORKER_HASH_SIZE];
+ /* L: hash of busy workers */
+
+ struct timer_list idle_timer; /* L: worker idle timeout */
+ struct timer_list mayday_timer; /* L: SOS timer for dworkers */
+
+ struct task_struct *trustee; /* L: for gcwq shutdown */
+ unsigned int trustee_state; /* L: trustee state */
+ wait_queue_head_t trustee_wait; /* trustee wait */
+ struct worker *first_idle; /* L: first idle worker */
+} ____cacheline_aligned_in_smp;

/*
* The per-CPU workqueue. The lower WORK_STRUCT_FLAG_BITS of
@@ -65,14 +160,7 @@ struct worker {
* aligned at two's power of the number of flag bits.
*/
struct cpu_workqueue_struct {
-
- spinlock_t lock;
-
- struct list_head worklist;
- wait_queue_head_t more_work;
- unsigned int cpu;
- struct worker *worker;
-
+ struct global_cwq *gcwq; /* I: the associated gcwq */
struct list_head *cur_worklist; /* L: current worklist */
int nr_in_flight; /* L: nr of in_flight works */
unsigned int flush_color; /* L: current flush color */
@@ -94,6 +182,9 @@ struct workqueue_struct {
atomic_t nr_cwqs_to_flush; /* flush in progress */
struct completion *flush_done; /* flush done */

+ cpumask_var_t mayday_mask; /* cpus requesting rescue */
+ struct worker *rescuer; /* I: rescue worker */
+
const char *name; /* I: workqueue name */
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
@@ -105,8 +196,27 @@ static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
static bool workqueue_frozen;

+/*
+ * The almighty global cpu workqueues. nr_running is the only field
+ * which is expected to be used frequently by other cpus by
+ * try_to_wake_up() which ends up incrementing it. Put it in a
+ * separate cacheline.
+ */
+static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
+
static int worker_thread(void *__worker);

+static struct global_cwq *get_gcwq(unsigned int cpu)
+{
+ return &per_cpu(global_cwq, cpu);
+}
+
+static atomic_t *get_gcwq_nr_running(unsigned int cpu)
+{
+ return &per_cpu(gcwq_nr_running, cpu);
+}
+
static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
struct workqueue_struct *wq)
{
@@ -133,6 +243,106 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
}

/*
+ * Policy functions. These define the policies on how the global
+ * worker pool is managed. Unless noted otherwise, these functions
+ * assume that they're being called with gcwq->lock held.
+ */
+
+/*
+ * Need to wake up a worker? Called from anything but currently
+ * running workers.
+ */
+static bool need_more_worker(struct global_cwq *gcwq)
+{
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ return !list_empty(&gcwq->worklist) && !atomic_read(nr_running);
+}
+
+/* Can I start working? Called from busy but !running workers. */
+static bool may_start_working(struct global_cwq *gcwq)
+{
+ return gcwq->nr_idle;
+}
+
+/* Do I need to keep working? Called from currently running workers. */
+static bool keep_working(struct global_cwq *gcwq)
+{
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ return !list_empty(&gcwq->worklist) && atomic_read(nr_running) <= 1;
+}
+
+/* Do we need a new worker? Called from manager. */
+static bool need_to_create_worker(struct global_cwq *gcwq)
+{
+ return need_more_worker(gcwq) && !may_start_working(gcwq);
+}
+
+/* Do I need to be the manager? */
+static bool need_to_manage_workers(struct global_cwq *gcwq)
+{
+ return need_to_create_worker(gcwq) || gcwq->flags & GCWQ_MANAGE_WORKERS;
+}
+
+/* Do we have too many workers and should some go away? */
+static bool too_many_workers(struct global_cwq *gcwq)
+{
+ bool managing = gcwq->flags & GCWQ_MANAGING_WORKERS;
+ int nr_idle = gcwq->nr_idle + managing; /* manager is considered idle */
+ int nr_busy = gcwq->nr_workers - nr_idle;
+
+ return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
+}
+
+/*
+ * Wake up functions.
+ */
+
+/* Return the first worker. Safe with preemption disabled */
+static struct worker *first_worker(struct global_cwq *gcwq)
+{
+ if (unlikely(list_empty(&gcwq->idle_list)))
+ return NULL;
+
+ return list_first_entry(&gcwq->idle_list, struct worker, entry);
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void wake_up_worker(struct global_cwq *gcwq)
+{
+ struct worker *worker = first_worker(gcwq);
+
+ if (likely(worker))
+ wake_up_process(worker->task);
+}
+
+/**
+ * sched_wake_up_worker - wake up an idle worker from a scheduler callback
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * Scheduler callback. DO NOT call from anywhere else.
+ */
+static void sched_wake_up_worker(struct global_cwq *gcwq)
+{
+ struct worker *worker = first_worker(gcwq);
+
+ if (likely(worker))
+ sched_workqueue_wake_up_process(worker->task);
+}
+
+/*
* Scheduler callbacks. These functions are called during schedule()
* with rq lock held. Don't try to acquire any lock and only access
* fields which are safe with preemption disabled from local cpu.
@@ -141,29 +351,144 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
/* called when a worker task wakes up from sleep */
void sched_workqueue_worker_wakeup(struct task_struct *task)
{
+ struct worker *worker = kthread_data(task);
+ struct global_cwq *gcwq = worker->gcwq;
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+ return;
+
+ if (likely(!worker->running)) {
+ worker->running = true;
+ atomic_inc(nr_running);
+ }
}

/* called when a worker task goes into sleep */
void sched_workqueue_worker_sleep(struct task_struct *task)
{
+ struct worker *worker = kthread_data(task);
+ struct global_cwq *gcwq = worker->gcwq;
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+ return;
+
+ /* this can only happen on the local cpu */
+ BUG_ON(gcwq->cpu != raw_smp_processor_id());
+
+ if (likely(worker->running)) {
+ worker->running = false;
+ /*
+ * The counterpart of the following dec_and_test,
+ * implied mb, worklist not empty test sequence is in
+ * insert_work(). Please read comment there.
+ */
+ if (atomic_dec_and_test(nr_running) &&
+ !list_empty(&gcwq->worklist))
+ sched_wake_up_worker(gcwq);
+ }
+}
+
+/**
+ * busy_worker_head - return the busy hash head for a work
+ * @gcwq: gcwq of interest
+ * @work: work to be hashed
+ *
+ * Return hash head of @gcwq for @work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to the hash head.
+ */
+static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
+ struct work_struct *work)
+{
+ const int base_shift = ilog2(sizeof(struct work_struct));
+ unsigned long v = (unsigned long)work;
+
+ /* simple shift and fold hash, do we need something better? */
+ v >>= base_shift;
+ v += v >> BUSY_WORKER_HASH_ORDER;
+ v &= BUSY_WORKER_HASH_MASK;
+
+ return &gcwq->busy_hash[v];
}

/**
- * insert_work - insert a work into cwq
+ * __find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @bwh: hash head as returned by busy_worker_head()
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq. @bwh should be
+ * the hash head obtained by calling busy_worker_head() with the same
+ * work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *__find_worker_executing_work(struct global_cwq *gcwq,
+ struct hlist_head *bwh,
+ struct work_struct *work)
+{
+ struct worker *worker;
+ struct hlist_node *tmp;
+
+ hlist_for_each_entry(worker, tmp, bwh, hentry)
+ if (worker->current_work == work)
+ return worker;
+ return NULL;
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq. This function is
+ * identical to __find_worker_executing_work() except that this
+ * function calculates @bwh itself.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
+ struct work_struct *work)
+{
+ return __find_worker_executing_work(gcwq, busy_worker_head(gcwq, work),
+ work);
+}
+
+/**
+ * insert_work - insert a work into gcwq
* @cwq: cwq @work belongs to
* @work: work to insert
* @head: insertion point
* @extra_flags: extra WORK_STRUCT_* flags to set
*
- * Insert @work into @cwq after @head.
+ * Insert @work which belongs to @cwq into @gcwq after @head.
+ * @extra_flags is ORd to WORK_STRUCT flags.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void insert_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work, struct list_head *head,
unsigned int extra_flags)
{
+ struct global_cwq *gcwq = cwq->gcwq;
+
cwq->nr_in_flight++;

/* we own @work, set data and link */
@@ -176,19 +501,29 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
smp_wmb();

list_add_tail(&work->entry, head);
- wake_up(&cwq->more_work);
+
+ /*
+ * Ensure either sched_workqueue_worker_sleep() sees the above
+ * list_add_tail() or we see zero nr_running to avoid workers
+ * lying around lazily while there are works to be processed.
+ */
+ smp_mb();
+
+ if (!atomic_read(get_gcwq_nr_running(gcwq->cpu)))
+ wake_up_worker(gcwq);
}

static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;
unsigned long flags;

- spin_lock_irqsave(&cwq->lock, flags);
+ spin_lock_irqsave(&gcwq->lock, flags);
BUG_ON(!list_empty(&work->entry));
insert_work(cwq, work, cwq->cur_worklist, 0);
- spin_unlock_irqrestore(&cwq->lock, flags);
+ spin_unlock_irqrestore(&gcwq->lock, flags);
}

/**
@@ -300,22 +635,77 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state. Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+
+ BUG_ON(worker->flags & WORKER_IDLE);
+ BUG_ON(!list_empty(&worker->entry) &&
+ (worker->hentry.next || worker->hentry.pprev));
+
+ worker->flags |= WORKER_IDLE;
+ gcwq->nr_idle++;
+ worker->last_active = jiffies;
+
+ /* idle_list is LIFO */
+ list_add(&worker->entry, &gcwq->idle_list);
+
+ if (likely(!(worker->flags & WORKER_ROGUE))) {
+ if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
+ mod_timer(&gcwq->idle_timer,
+ jiffies + IDLE_WORKER_TIMEOUT);
+ } else
+ wake_up_all(&gcwq->trustee_wait);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state. Update stats.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+
+ BUG_ON(!(worker->flags & WORKER_IDLE));
+ worker->flags &= ~WORKER_IDLE;
+ gcwq->nr_idle--;
+ list_del_init(&worker->entry);
+}
+
static struct worker *alloc_worker(void)
{
struct worker *worker;

worker = kzalloc(sizeof(*worker), GFP_KERNEL);
- if (worker)
+ if (worker) {
+ INIT_LIST_HEAD(&worker->entry);
INIT_LIST_HEAD(&worker->scheduled);
+ /* on creation a worker is not idle */
+ }
return worker;
}

/**
* create_worker - create a new workqueue worker
- * @cwq: cwq the new worker will belong to
+ * @gcwq: gcwq the new worker will belong to
* @bind: whether to set affinity to @cpu or not
*
- * Create a new worker which is bound to @cwq. The returned worker
+ * Create a new worker which is bound to @gcwq. The returned worker
* can be started by calling start_worker() or destroyed using
* destroy_worker().
*
@@ -325,23 +715,30 @@ static struct worker *alloc_worker(void)
* RETURNS:
* Pointer to the newly created worker.
*/
-static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
{
- struct worker *worker;
+ struct worker *worker = NULL;

worker = alloc_worker();
if (!worker)
goto fail;

- worker->cwq = cwq;
+ worker->gcwq = gcwq;

worker->task = kthread_create(worker_thread, worker, "kworker/%u",
- cwq->cpu);
+ gcwq->cpu);
if (IS_ERR(worker->task))
goto fail;

+ /*
+ * A rogue worker will become a regular one if CPU comes
+ * online later on. Make sure every worker has
+ * PF_THREAD_BOUND set.
+ */
if (bind)
- kthread_bind(worker->task, cwq->cpu);
+ kthread_bind(worker->task, gcwq->cpu);
+ else
+ worker->task->flags |= PF_THREAD_BOUND;

return worker;
fail:
@@ -353,13 +750,16 @@ fail:
* start_worker - start a newly created worker
* @worker: worker to start
*
- * Start @worker.
+ * Make the gcwq aware of @worker and start it.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void start_worker(struct worker *worker)
{
+ worker->flags |= WORKER_STARTED;
+ worker->gcwq->nr_workers++;
+ worker_enter_idle(worker);
wake_up_process(worker->task);
}

@@ -367,16 +767,263 @@ static void start_worker(struct worker *worker)
* destroy_worker - destroy a workqueue worker
* @worker: worker to be destroyed
*
- * Destroy @worker.
+ * Destroy @worker and adjust @gcwq stats accordingly.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
*/
static void destroy_worker(struct worker *worker)
{
+ struct global_cwq *gcwq = worker->gcwq;
+
/* sanity check frenzy */
BUG_ON(worker->current_work);
BUG_ON(!list_empty(&worker->scheduled));
+ BUG_ON(worker->running);
+
+ if (worker->flags & WORKER_STARTED)
+ gcwq->nr_workers--;
+ if (worker->flags & WORKER_IDLE)
+ gcwq->nr_idle--;
+
+ list_del_init(&worker->entry);
+ worker->flags |= WORKER_DIE;
+
+ spin_unlock_irq(&gcwq->lock);

kthread_stop(worker->task);
kfree(worker);
+
+ spin_lock_irq(&gcwq->lock);
+}
+
+static void idle_worker_timeout(unsigned long __gcwq)
+{
+ struct global_cwq *gcwq = (void *)__gcwq;
+
+ spin_lock_irq(&gcwq->lock);
+
+ if (too_many_workers(gcwq)) {
+ struct worker *worker;
+ unsigned long expires;
+
+ /* idle_list is kept in LIFO order, check the last one */
+ worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+ expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+ if (time_before(jiffies, expires))
+ mod_timer(&gcwq->idle_timer, expires);
+ else {
+ /* it's been idle for too long, wake up manager */
+ gcwq->flags |= GCWQ_MANAGE_WORKERS;
+ wake_up_worker(gcwq);
+ }
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+}
+
+static bool send_mayday(struct work_struct *work)
+{
+ struct cpu_workqueue_struct *cwq = get_wq_data(work);
+ struct workqueue_struct *wq = cwq->wq;
+
+ if (!(wq->flags & WQ_RESCUER))
+ return false;
+
+ /* mayday mayday mayday */
+ if (!cpumask_test_and_set_cpu(cwq->gcwq->cpu, wq->mayday_mask))
+ wake_up_process(wq->rescuer->task);
+ return true;
+}
+
+static void gcwq_mayday_timeout(unsigned long __gcwq)
+{
+ struct global_cwq *gcwq = (void *)__gcwq;
+ struct work_struct *work;
+
+ spin_lock_irq(&gcwq->lock);
+
+ if (need_to_create_worker(gcwq)) {
+ /*
+ * We've been trying to create a new worker but
+ * haven't been successful. We might be hitting an
+ * allocation deadlock. Send distress calls to
+ * rescuers.
+ */
+ list_for_each_entry(work, &gcwq->worklist, entry)
+ send_mayday(work);
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+
+ mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INTERVAL);
+}
+
+/**
+ * maybe_create_worker - create a new worker if necessary
+ * @gcwq: gcwq to create a new worker for
+ *
+ * Create a new worker for @gcwq if necessary. @gcwq is guaranteed to
+ * have at least one idle worker on return from this function. If
+ * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
+ * sent to all rescuers with works scheduled on @gcwq to resolve
+ * possible allocation deadlock.
+ *
+ * On return, need_to_create_worker() is guaranteed to be false and
+ * may_start_working() true.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. Does GFP_KERNEL allocations. Called only from
+ * manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_create_worker(struct global_cwq *gcwq)
+{
+ if (!need_to_create_worker(gcwq))
+ return false;
+restart:
+ /* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
+ mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
+
+ while (true) {
+ struct worker *worker;
+
+ if (gcwq->nr_workers >= MAX_WORKERS_PER_CPU) {
+ if (printk_ratelimit())
+ printk(KERN_WARNING "workqueue: too many "
+ "workers (%d) on cpu %d, can't create "
+ "new ones\n",
+ gcwq->nr_workers, gcwq->cpu);
+ goto cooldown;
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+
+ worker = create_worker(gcwq, true);
+ if (worker) {
+ del_timer_sync(&gcwq->mayday_timer);
+ spin_lock_irq(&gcwq->lock);
+ start_worker(worker);
+ BUG_ON(need_to_create_worker(gcwq));
+ return true;
+ }
+
+ if (!need_to_create_worker(gcwq))
+ break;
+ cooldown:
+ spin_unlock_irq(&gcwq->lock);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(CREATE_COOLDOWN);
+ spin_lock_irq(&gcwq->lock);
+ if (!need_to_create_worker(gcwq))
+ break;
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+ del_timer_sync(&gcwq->mayday_timer);
+ spin_lock_irq(&gcwq->lock);
+ if (need_to_create_worker(gcwq))
+ goto restart;
+ return true;
+}
+
+/**
+ * maybe_destroy_worker - destroy workers which have been idle for a while
+ * @gcwq: gcwq to destroy workers for
+ *
+ * Destroy @gcwq workers which have been idle for longer than
+ * IDLE_WORKER_TIMEOUT.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. Called only from manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_destroy_workers(struct global_cwq *gcwq)
+{
+ bool ret = false;
+
+ while (too_many_workers(gcwq)) {
+ struct worker *worker;
+ unsigned long expires;
+
+ worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+ expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+ if (time_before(jiffies, expires)) {
+ mod_timer(&gcwq->idle_timer, expires);
+ break;
+ }
+
+ destroy_worker(worker);
+ ret = true;
+ }
+
+ return ret;
+}
+
+/**
+ * manage_workers - manage worker pool
+ * @worker: self
+ *
+ * Assume the manager role and manage gcwq worker pool @worker belongs
+ * to. At any given time, there can be only zero or one manager per
+ * gcwq. The exclusion is handled automatically by this function.
+ *
+ * The caller can safely start processing works on false return. On
+ * true return, it's guaranteed that need_to_create_worker() is false
+ * and may_start_working() is true.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true if
+ * some action was taken.
+ */
+static bool manage_workers(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+ bool ret = false;
+
+ if (gcwq->flags & GCWQ_MANAGING_WORKERS)
+ return ret;
+
+ gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
+ gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+ /* manager should never be accounted as running */
+ BUG_ON(worker->running);
+ worker->flags |= WORKER_MANAGER;
+
+ /*
+ * Destroy and then create so that may_start_working() is true
+ * on return.
+ */
+ ret |= maybe_destroy_workers(gcwq);
+ ret |= maybe_create_worker(gcwq);
+
+ gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+ worker->flags &= ~WORKER_MANAGER;
+ BUG_ON(worker->running);
+
+ /*
+ * The trustee might be waiting to take over the manager
+ * position, tell it we're done.
+ */
+ if (unlikely(gcwq->trustee))
+ wake_up_all(&gcwq->trustee_wait);
+
+ return ret;
}

/**
@@ -394,7 +1041,7 @@ static void destroy_worker(struct worker *worker)
* be nested inside outer list_for_each_entry_safe().
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void schedule_work_to_worker(struct worker *worker,
struct work_struct *work,
@@ -431,7 +1078,7 @@ static void schedule_work_to_worker(struct worker *worker,
* decrement nr_in_flight of its cwq and handle workqueue flushing.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
unsigned int work_color)
@@ -456,13 +1103,16 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
* call this function to process a work.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
*/
static void process_one_work(struct worker *worker, struct work_struct *work)
{
- struct cpu_workqueue_struct *cwq = worker->cwq;
+ struct cpu_workqueue_struct *cwq = get_wq_data(work);
+ struct global_cwq *gcwq = cwq->gcwq;
+ struct hlist_head *bwh = busy_worker_head(gcwq, work);
work_func_t f = work->func;
unsigned int work_color;
+ struct worker *collision;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -473,14 +1123,26 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
*/
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
+ /*
+ * A single work shouldn't be executed concurrently by
+ * multiple workers on a single cpu. Check whether anyone is
+ * already processing the work. If so, defer the work to the
+ * currently executing one.
+ */
+ collision = __find_worker_executing_work(gcwq, bwh, work);
+ if (unlikely(collision)) {
+ schedule_work_to_worker(collision, work, NULL);
+ return;
+ }
+
/* claim and process */
+ hlist_add_head(&worker->hentry, bwh);
worker->current_work = work;
work_color = *work_data_bits(work) & WORK_STRUCT_COLOR;
list_del_init(&work->entry);

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);

- BUG_ON(get_wq_data(work) != cwq);
work_clear_pending(work);
lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_acquire(&lockdep_map);
@@ -498,9 +1160,10 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
dump_stack();
}

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

/* we're done with it, release */
+ hlist_del_init(&worker->hentry);
worker->current_work = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}
@@ -514,7 +1177,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
* fetches a work from the top and executes it.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
* multiple times.
*/
static void process_scheduled_works(struct worker *worker)
@@ -530,36 +1193,60 @@ static void process_scheduled_works(struct worker *worker)
* worker_thread - the worker thread function
* @__worker: self
*
- * The cwq worker thread function.
+ * The gcwq worker thread function. There's a single dynamic pool of
+ * these per each cpu. These workers process all works regardless of
+ * their specific target workqueue. The only exception is works which
+ * belong to workqueues with a rescuer which will be explained in
+ * rescuer_thread().
*/
static int worker_thread(void *__worker)
{
struct worker *worker = __worker;
- struct cpu_workqueue_struct *cwq = worker->cwq;
- DEFINE_WAIT(wait);
-
- /* set workqueue scheduler */
- switch_sched_workqueue(current, true);
-
- if (cwq->wq->flags & WQ_FREEZEABLE)
- set_freezable();
+ struct global_cwq *gcwq = worker->gcwq;
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+ /* set workqueue scheduler and adjust nice level */
+ switch_sched_workqueue(worker->task, true);
+woke_up:
+ spin_lock_irq(&gcwq->lock);
+
+ /* DIE can be set only while we're idle, checking here is enough */
+ if (worker->flags & WORKER_DIE) {
+ spin_unlock_irq(&gcwq->lock);
+ switch_sched_workqueue(worker->task, false);
+ return 0;
+ }

- for (;;) {
- prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
- if (!freezing(current) &&
- !kthread_should_stop() &&
- list_empty(&cwq->worklist))
- schedule();
- finish_wait(&cwq->more_work, &wait);
+ worker_leave_idle(worker);
+repeat:
+ if (need_more_worker(gcwq)) {
+ if (unlikely(!may_start_working(gcwq)) &&
+ manage_workers(worker))
+ goto repeat;

- if (kthread_should_stop())
- break;
+ /*
+ * ->scheduled list can only be filled while a worker
+ * is preparing to process a work or actually
+ * processing it. Make sure nobody diddled with it
+ * while I was sleeping. Also, nobody should have set
+ * running till this point.
+ */
+ BUG_ON(!list_empty(&worker->scheduled));
+ BUG_ON(worker->running);

- spin_lock_irq(&cwq->lock);
+ /*
+ * When control reaches this point, we're guaranteed
+ * to have at least one idle worker or that someone
+ * else has already assumed the manager role.
+ */
+ if (likely(!(worker->flags & WORKER_IGN_RUNNING))) {
+ worker->running = true;
+ atomic_inc(nr_running);
+ }

- while (!list_empty(&cwq->worklist)) {
+ do {
struct work_struct *work =
- list_first_entry(&cwq->worklist,
+ list_first_entry(&gcwq->worklist,
struct work_struct, entry);

if (likely(!(*work_data_bits(work) &
@@ -572,12 +1259,145 @@ static int worker_thread(void *__worker)
schedule_work_to_worker(worker, work, NULL);
process_scheduled_works(worker);
}
+ } while (keep_working(gcwq));
+
+ if (likely(worker->running)) {
+ worker->running = false;
+ atomic_dec(nr_running);
}
+ }
+
+ if (unlikely(need_to_manage_workers(gcwq)) && manage_workers(worker))
+ goto repeat;
+
+ /*
+ * gcwq->lock is held and there's no work to process and no
+ * need to manage, sleep. Workers are woken up only while
+ * holding gcwq->lock or from local cpu, so setting the
+ * current state before releasing gcwq->lock is enough to
+ * prevent losing any event.
+ */
+ worker_enter_idle(worker);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock_irq(&gcwq->lock);
+ schedule();
+ goto woke_up;
+}
+
+/**
+ * worker_maybe_bind_and_lock - bind worker to its cpu if possible and lock gcwq
+ * @worker: target worker
+ *
+ * Works which are scheduled while the cpu is online must at least be
+ * scheduled to a worker which is bound to the cpu so that if they are
+ * flushed from cpu callbacks while cpu is going down, they are
+ * guaranteed to execute on the cpu.
+ *
+ * This function is to be used to bind rescuers and new rogue workers
+ * to the target cpu and may race with cpu going down or coming
+ * online. kthread_bind() can't be used because it may put the worker
+ * to already dead cpu and force_cpus_allowed_ptr() can't be used
+ * verbatim as it's best effort and blocking and gcwq may be
+ * [dis]associated in the meantime.
+ *
+ * This function tries force_cpus_allowed_ptr() and locks gcwq and
+ * verifies the binding against GCWQ_DISASSOCIATED which is set during
+ * CPU_DYING and cleared during CPU_ONLINE, so if the worker enters
+ * idle state or fetches works without dropping lock, it can guarantee
+ * the scheduling requirement described in the first paragraph.
+ *
+ * CONTEXT:
+ * Might sleep. Called without any lock but returns with gcwq->lock
+ * held.
+ */
+static void worker_maybe_bind_and_lock(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+ struct task_struct *task = worker->task;

- spin_unlock_irq(&cwq->lock);
+ while (true) {
+ /*
+ * The following call may fail, succeed or succeed
+ * without actually migrating the task to the cpu if
+ * it races with cpu hotunplug operation. Verify
+ * against GCWQ_DISASSOCIATED.
+ */
+ force_cpus_allowed_ptr(task, get_cpu_mask(gcwq->cpu));
+
+ spin_lock_irq(&gcwq->lock);
+ if (gcwq->flags & GCWQ_DISASSOCIATED)
+ return;
+ if (task_cpu(task) == gcwq->cpu &&
+ cpumask_equal(&current->cpus_allowed,
+ get_cpu_mask(gcwq->cpu)))
+ return;
+ spin_unlock_irq(&gcwq->lock);
+
+ /* CPU has come up inbetween, retry migration */
+ cpu_relax();
}
+}

- return 0;
+/**
+ * rescuer_thread - the rescuer thread function
+ * @__wq: the associated workqueue
+ *
+ * Workqueue rescuer thread function. There's one rescuer for each
+ * workqueue which has WQ_RESCUER set.
+ *
+ * Regular work processing on a gcwq may block trying to create a new
+ * worker which uses GFP_KERNEL allocation which has slight chance of
+ * developing into deadlock if some works currently on the same queue
+ * need to be processed to satisfy the GFP_KERNEL allocation. This is
+ * the problem rescuer solves.
+ *
+ * When such condition is possible, the gcwq summons rescuers of all
+ * workqueues which have works queued on the gcwq and let them process
+ * those works so that forward progress can be guaranteed.
+ *
+ * This should happen rarely.
+ */
+static int rescuer_thread(void *__wq)
+{
+ struct workqueue_struct *wq = __wq;
+ struct worker *rescuer = wq->rescuer;
+ unsigned int cpu;
+
+ rescuer->flags |= WORKER_RESCUER;
+ set_user_nice(current, RESCUER_NICE_LEVEL);
+repeat:
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ for_each_cpu(cpu, wq->mayday_mask) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;
+ struct work_struct *work, *n;
+
+ __set_current_state(TASK_RUNNING);
+ cpumask_clear_cpu(cpu, wq->mayday_mask);
+
+ /* migrate to the target cpu if possible */
+ rescuer->gcwq = gcwq;
+ worker_maybe_bind_and_lock(rescuer);
+
+ /*
+ * Slurp in all works issued via this workqueue and
+ * process'em.
+ */
+ BUG_ON(!list_empty(&rescuer->scheduled));
+ list_for_each_entry_safe(work, n, &gcwq->worklist, entry)
+ if (get_wq_data(work) == cwq)
+ schedule_work_to_worker(rescuer, work, &n);
+
+ process_scheduled_works(rescuer);
+ spin_unlock_irq(&gcwq->lock);
+ }
+
+ schedule();
+ goto repeat;
}

struct wq_barrier {
@@ -609,7 +1429,7 @@ static void wq_barrier_func(struct work_struct *work)
* after a work with LINKED flag set.
*
* CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
*/
static void insert_wq_barrier(struct wq_barrier *barr,
struct work_struct *target, struct worker *worker)
@@ -668,8 +1488,9 @@ void flush_workqueue(struct workqueue_struct *wq)

for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

BUG_ON(cwq->flush_cnt);

@@ -681,7 +1502,7 @@ void flush_workqueue(struct workqueue_struct *wq)
wait = true;
}

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
}

if (wait)
@@ -707,17 +1528,19 @@ int flush_work(struct work_struct *work)
{
struct worker *worker = NULL;
struct cpu_workqueue_struct *cwq;
+ struct global_cwq *gcwq;
struct wq_barrier barr;

might_sleep();
cwq = get_wq_data(work);
if (!cwq)
return 0;
+ gcwq = cwq->gcwq;

lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_release(&cwq->wq->lockdep_map);

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);
if (!list_empty(&work->entry)) {
/*
* See the comment near try_to_grab_pending()->smp_rmb().
@@ -727,18 +1550,17 @@ int flush_work(struct work_struct *work)
if (unlikely(cwq != get_wq_data(work)))
goto already_gone;
} else {
- if (cwq->worker && cwq->worker->current_work == work)
- worker = cwq->worker;
+ worker = find_worker_executing_work(gcwq, work);
if (!worker)
goto already_gone;
}

insert_wq_barrier(&barr, work, worker);
- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
wait_for_completion(&barr.done);
return 1;
already_gone:
- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);
return 0;
}
EXPORT_SYMBOL_GPL(flush_work);
@@ -749,6 +1571,7 @@ EXPORT_SYMBOL_GPL(flush_work);
*/
static int try_to_grab_pending(struct work_struct *work)
{
+ struct global_cwq *gcwq;
struct cpu_workqueue_struct *cwq;
int ret = -1;

@@ -763,8 +1586,9 @@ static int try_to_grab_pending(struct work_struct *work)
cwq = get_wq_data(work);
if (!cwq)
return ret;
+ gcwq = cwq->gcwq;

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);
if (!list_empty(&work->entry)) {
/*
* This work is queued, but perhaps we locked the wrong cwq.
@@ -779,7 +1603,7 @@ static int try_to_grab_pending(struct work_struct *work)
ret = 1;
}
}
- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);

return ret;
}
@@ -787,18 +1611,17 @@ static int try_to_grab_pending(struct work_struct *work)
static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
+ struct global_cwq *gcwq = cwq->gcwq;
struct wq_barrier barr;
struct worker *worker;

- spin_lock_irq(&cwq->lock);
+ spin_lock_irq(&gcwq->lock);

- worker = NULL;
- if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
- worker = cwq->worker;
+ worker = find_worker_executing_work(gcwq, work);
+ if (unlikely(worker))
insert_wq_barrier(&barr, work, worker);
- }

- spin_unlock_irq(&cwq->lock);
+ spin_unlock_irq(&gcwq->lock);

if (unlikely(worker))
wait_for_completion(&barr.done);
@@ -1026,7 +1849,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
const char *lock_name)
{
struct workqueue_struct *wq;
- bool failed = false;
unsigned int cpu;

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -1044,58 +1866,55 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);

- cpu_maps_update_begin();
- /*
- * We must place this wq on list even if the code below fails.
- * cpu_down(cpu) can remove cpu from cpu_populated_map before
- * destroy_workqueue() takes the lock, in that case we leak
- * cwq[cpu]->thread.
- */
- spin_lock(&workqueue_lock);
- list_add(&wq->list, &workqueues);
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = get_gcwq(cpu);

- if (workqueue_frozen && wq->flags & WQ_FREEZEABLE)
- cwq->cur_worklist = &cwq->frozen_works;
- else
- cwq->cur_worklist = &cwq->worklist;
- }
- spin_unlock(&workqueue_lock);
- /*
- * We must initialize cwqs for each possible cpu even if we
- * are going to call destroy_workqueue() finally. Otherwise
- * cpu_up() can hit the uninitialized cwq once we drop the
- * lock.
- */
- for_each_possible_cpu(cpu) {
- struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
-
- cwq->cpu = cpu;
+ BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
+ cwq->gcwq = gcwq;
+ cwq->cur_worklist = &gcwq->worklist;
cwq->wq = wq;
- spin_lock_init(&cwq->lock);
- INIT_LIST_HEAD(&cwq->worklist);
- init_waitqueue_head(&cwq->more_work);
INIT_LIST_HEAD(&cwq->frozen_works);
-
- if (failed || !cpu_online(cpu))
- continue;
- cwq->worker = create_worker(cwq, true);
- if (cwq->worker)
- start_worker(cwq->worker);
- else
- failed = true;
}
- cpu_maps_update_done();

- if (failed) {
- destroy_workqueue(wq);
- wq = NULL;
+ if (flags & WQ_RESCUER) {
+ struct worker *rescuer;
+
+ if (!alloc_cpumask_var(&wq->mayday_mask, GFP_KERNEL))
+ goto err;
+
+ wq->rescuer = rescuer = alloc_worker();
+ if (!rescuer)
+ goto err;
+
+ rescuer->task = kthread_create(rescuer_thread, wq, "%s", name);
+ if (IS_ERR(rescuer->task))
+ goto err;
+
+ wq->rescuer = rescuer;
+ rescuer->task->flags |= PF_THREAD_BOUND;
+ wake_up_process(rescuer->task);
}
+
+ /*
+ * Works can't be queued before we return. Add to workqueue
+ * list and set cur_worklist to frozen_works if frozen.
+ */
+ spin_lock(&workqueue_lock);
+ list_add(&wq->list, &workqueues);
+ if (workqueue_frozen && wq->flags & WQ_FREEZEABLE)
+ for_each_possible_cpu(cpu) {
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ cwq->cur_worklist = &cwq->frozen_works;
+ }
+ spin_unlock(&workqueue_lock);
+
return wq;
err:
if (wq) {
free_percpu(wq->cpu_wq);
+ free_cpumask_var(wq->mayday_mask);
+ kfree(wq->rescuer);
kfree(wq);
}
return NULL;
@@ -1110,9 +1929,7 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key);
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- int cpu;
-
- cpu_maps_update_begin();
+ unsigned int cpu;

flush_workqueue(wq);

@@ -1124,70 +1941,400 @@ void destroy_workqueue(struct workqueue_struct *wq)
list_del(&wq->list);
spin_unlock(&workqueue_lock);

+ /* sanity check */
for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
-
- /* cpu_add_remove_lock protects cwq->thread */
- if (cwq->worker) {
- destroy_worker(cwq->worker);
- cwq->worker = NULL;
- }
BUG_ON(cwq->nr_in_flight);
BUG_ON(!list_empty(&cwq->frozen_works));
}

- cpu_maps_update_done();
+ if (wq->flags & WQ_RESCUER) {
+ kthread_stop(wq->rescuer->task);
+ free_cpumask_var(wq->mayday_mask);
+ }

free_percpu(wq->cpu_wq);
kfree(wq);
}
EXPORT_SYMBOL_GPL(destroy_workqueue);

-static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
- unsigned long action,
- void *hcpu)
+/*
+ * CPU hotplug.
+ *
+ * There are two challenges in supporting CPU hotplug. Firstly, there
+ * are a lot of assumptions on strong associations among work, cwq and
+ * gcwq which make migrating pending and scheduled works very
+ * difficult to implement without impacting hot paths. Secondly,
+ * gcwqs serve mix of short, long and very long running works making
+ * blocked draining impractical.
+ *
+ * This is solved by allowing a gcwq to be detached from CPU, running
+ * it with unbound (rogue) workers and allowing it to be reattached
+ * later if the cpu comes back online. A separate thread is created
+ * to govern a gcwq in such state and is called the trustee of the
+ * gcwq.
+ *
+ * Trustee states and their descriptions.
+ *
+ * START Command state used on startup. On CPU_DOWN_PREPARE, a
+ * new trustee is started with this state.
+ *
+ * IN_CHARGE Once started, trustee will enter this state after
+ * assuming the manager role and making all existing
+ * workers rogue. DOWN_PREPARE waits for trustee to
+ * enter this state. After reaching IN_CHARGE, trustee
+ * tries to execute the pending worklist until it's empty
+ * and the state is set to BUTCHER, or the state is set
+ * to RELEASE.
+ *
+ * BUTCHER Command state which is set by the cpu callback after
+ * the cpu has went down. Once this state is set trustee
+ * knows that there will be no new works on the worklist
+ * and once the worklist is empty it can proceed to
+ * killing idle workers.
+ *
+ * RELEASE Command state which is set by the cpu callback if the
+ * cpu down has been canceled or it has come online
+ * again. After recognizing this state, trustee stops
+ * trying to drain or butcher and clears ROGUE, rebinds
+ * all remaining workers back to the cpu and releases
+ * manager role.
+ *
+ * DONE Trustee will enter this state after BUTCHER or RELEASE
+ * is complete.
+ *
+ * trustee CPU draining
+ * took over down complete
+ * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
+ * | | ^
+ * | CPU is back online v return workers |
+ * ----------------> RELEASE --------------
+ */
+
+#define for_each_busy_worker(worker, i, pos, gcwq) \
+ for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
+ hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry) \
+ if (!(worker->flags & WORKER_RESCUER))
+
+/**
+ * trustee_wait_event_timeout - timed event wait for trustee
+ * @cond: condition to wait for
+ * @timeout: timeout in jiffies
+ *
+ * wait_event_timeout() for trustee to use. Handles locking and
+ * checks for RELEASE request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. To be used by trustee.
+ *
+ * RETURNS:
+ * Positive indicating left time if @cond is satisfied, 0 if timed
+ * out, -1 if canceled.
+ */
+#define trustee_wait_event_timeout(cond, timeout) ({ \
+ long __ret = (timeout); \
+ while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) && \
+ __ret) { \
+ spin_unlock_irq(&gcwq->lock); \
+ __wait_event_timeout(gcwq->trustee_wait, (cond) || \
+ (gcwq->trustee_state == TRUSTEE_RELEASE), \
+ __ret); \
+ spin_lock_irq(&gcwq->lock); \
+ } \
+ gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret); \
+})
+
+/**
+ * trustee_wait_event - event wait for trustee
+ * @cond: condition to wait for
+ *
+ * wait_event() for trustee to use. Automatically handles locking and
+ * checks for CANCEL request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. To be used by trustee.
+ *
+ * RETURNS:
+ * 0 if @cond is satisfied, -1 if canceled.
+ */
+#define trustee_wait_event(cond) ({ \
+ long __ret1; \
+ __ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
+ __ret1 < 0 ? -1 : 0; \
+})
+
+static bool __cpuinit trustee_unset_rogue(struct worker *worker)
{
- unsigned int cpu = (unsigned long)hcpu;
- struct cpu_workqueue_struct *cwq;
- struct workqueue_struct *wq;
- int ret = NOTIFY_OK;
+ struct global_cwq *gcwq = worker->gcwq;

- action &= ~CPU_TASKS_FROZEN;
+ if (!(worker->flags & WORKER_ROGUE))
+ return false;
+
+ spin_unlock_irq(&gcwq->lock);
+ BUG_ON(force_cpus_allowed_ptr(worker->task, get_cpu_mask(gcwq->cpu)));
+ spin_lock_irq(&gcwq->lock);
+ worker->flags &= ~WORKER_ROGUE;
+ return true;
+}
+
+static int __cpuinit trustee_thread(void *__gcwq)
+{
+ struct global_cwq *gcwq = __gcwq;
+ atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+ struct worker *worker;
+ struct work_struct *work;
+ struct hlist_node *pos;
+ int i;
+
+ BUG_ON(gcwq->cpu != smp_processor_id());
+
+ spin_lock_irq(&gcwq->lock);
+ /*
+ * Claim the manager position and make all workers rogue.
+ * Trustee must be bound to the target cpu and can't be
+ * cancelled.
+ */
+ BUG_ON(gcwq->cpu != smp_processor_id());
+ BUG_ON(trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS)) < 0);
+ gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+ list_for_each_entry(worker, &gcwq->idle_list, entry)
+ worker->flags |= WORKER_ROGUE;
+
+ for_each_busy_worker(worker, i, pos, gcwq)
+ worker->flags |= WORKER_ROGUE;
+
+ /*
+ * Call schedule() so that we cross rq->lock and thus can
+ * guarantee sched callbacks see the rogue flag. This is
+ * necessary as scheduler callbacks may be invoked from other
+ * cpus.
+ */
+ spin_unlock_irq(&gcwq->lock);
+ schedule();
+ spin_lock_irq(&gcwq->lock);
+
+ /*
+ * Sched callbacks are disabled now. Clear running and adjust
+ * nr_running accordingly. After this, gcwq->nr_running stays
+ * zero and need_more_worker() and keep_working() are always
+ * true as long as the worklist is not empty.
+ */
+ for_each_busy_worker(worker, i, pos, gcwq)
+ if (worker->running) {
+ worker->running = false;
+ atomic_dec(nr_running);
+ }
+ WARN_ON(atomic_read(nr_running));
+
+ spin_unlock_irq(&gcwq->lock);
+ del_timer_sync(&gcwq->idle_timer);
+ spin_lock_irq(&gcwq->lock);
+
+ /*
+ * We're now in charge. Notify and proceed to drain. We need
+ * to keep the gcwq running during the whole CPU down
+ * procedure as other cpu hotunplug callbacks may need to
+ * flush currently running tasks.
+ */
+ gcwq->trustee_state = TRUSTEE_IN_CHARGE;
+ wake_up_all(&gcwq->trustee_wait);
+
+ /*
+ * The original cpu is in the process of dying and may go away
+ * anytime now. When that happens, we and all workers would
+ * be migrated to other cpus. Try draining any left work. We
+ * want to get it over with ASAP - spam rescuers, wake up as
+ * many idlers as necessary and create new ones till the
+ * worklist is empty.
+ */
+ while (!list_empty(&gcwq->worklist) ||
+ gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+ int nr_works = 0;

-undo:
- list_for_each_entry(wq, &workqueues, list) {
- cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+ list_for_each_entry(work, &gcwq->worklist, entry) {
+ send_mayday(work);
+ nr_works++;
+ }

- switch (action) {
- case CPU_UP_PREPARE:
- cwq->worker = create_worker(cwq, false);
- if (cwq->worker)
+ list_for_each_entry(worker, &gcwq->idle_list, entry) {
+ if (!nr_works--)
break;
- printk(KERN_ERR "workqueue [%s] for %i failed\n",
- wq->name, cpu);
- action = CPU_UP_CANCELED;
- ret = NOTIFY_BAD;
- goto undo;
-
- case CPU_ONLINE:
- kthread_bind(cwq->worker->task, cpu);
- start_worker(cwq->worker);
+ wake_up_process(worker->task);
+ }
+
+ if (need_to_create_worker(gcwq) &&
+ gcwq->nr_workers < MAX_WORKERS_PER_CPU) {
+ spin_unlock_irq(&gcwq->lock);
+ worker = create_worker(gcwq, false);
+ if (worker) {
+ worker_maybe_bind_and_lock(worker);
+ worker->flags |= WORKER_ROGUE;
+ start_worker(worker);
+ } else
+ spin_lock_irq(&gcwq->lock);
+ }
+
+ /* give a breather */
+ if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
break;
+ }

- case CPU_UP_CANCELED:
- start_worker(cwq->worker);
- case CPU_POST_DEAD:
- flush_workqueue(wq);
- /* cpu_add_remove_lock protects cwq->thread */
- if (cwq->worker) {
- destroy_worker(cwq->worker);
- cwq->worker = NULL;
- }
+ /*
+ * Either all works have been scheduled and cpu is down, or
+ * cpu down has already been canceled. Wait for and butcher
+ * all workers till we're canceled.
+ */
+ while (gcwq->nr_workers) {
+ if (trustee_wait_event(!list_empty(&gcwq->idle_list)) < 0)
break;
+
+ while (!list_empty(&gcwq->idle_list)) {
+ worker = list_first_entry(&gcwq->idle_list,
+ struct worker, entry);
+ destroy_worker(worker);
}
}

- return ret;
+ /*
+ * At this point, either draining has completed and no worker
+ * is left, or cpu down has been canceled or the cpu is being
+ * brought back up. Clear ROGUE from and rebind all left
+ * workers. Unsetting ROGUE and rebinding require dropping
+ * gcwq->lock. Restart loop after each successful release.
+ */
+recheck:
+ list_for_each_entry(worker, &gcwq->idle_list, entry)
+ if (trustee_unset_rogue(worker))
+ goto recheck;
+
+ for_each_busy_worker(worker, i, pos, gcwq)
+ if (trustee_unset_rogue(worker))
+ goto recheck;
+
+ /* relinquish manager role */
+ gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
+ /* notify completion */
+ gcwq->trustee = NULL;
+ gcwq->trustee_state = TRUSTEE_DONE;
+ wake_up_all(&gcwq->trustee_wait);
+ spin_unlock_irq(&gcwq->lock);
+ return 0;
+}
+
+/**
+ * wait_trustee_state - wait for trustee to enter the specified state
+ * @gcwq: gcwq the trustee of interest belongs to
+ * @state: target state to wait for
+ *
+ * Wait for the trustee to reach @state. DONE is already matched.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times. To be used by cpu_callback.
+ */
+static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
+{
+ if (!(gcwq->trustee_state == state ||
+ gcwq->trustee_state == TRUSTEE_DONE)) {
+ spin_unlock_irq(&gcwq->lock);
+ __wait_event(gcwq->trustee_wait,
+ gcwq->trustee_state == state ||
+ gcwq->trustee_state == TRUSTEE_DONE);
+ spin_lock_irq(&gcwq->lock);
+ }
+}
+
+static int __cpuinit workqueue_cpu_callback(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ struct task_struct *new_trustee = NULL;
+ struct worker *uninitialized_var(new_worker);
+
+ action &= ~CPU_TASKS_FROZEN;
+
+ switch (action) {
+ case CPU_DOWN_PREPARE:
+ new_trustee = kthread_create(trustee_thread, gcwq,
+ "workqueue_trustee/%d\n", cpu);
+ if (IS_ERR(new_trustee))
+ return NOTIFY_BAD;
+ kthread_bind(new_trustee, cpu);
+ /* fall through */
+ case CPU_UP_PREPARE:
+ BUG_ON(gcwq->first_idle);
+ new_worker = create_worker(gcwq, false);
+ if (!new_worker) {
+ if (new_trustee)
+ kthread_stop(new_trustee);
+ return NOTIFY_BAD;
+ }
+ }
+
+ spin_lock_irq(&gcwq->lock);
+
+ switch (action) {
+ case CPU_DOWN_PREPARE:
+ /* initialize trustee and tell it to acquire the gcwq */
+ BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
+ gcwq->trustee = new_trustee;
+ gcwq->trustee_state = TRUSTEE_START;
+ wake_up_process(gcwq->trustee);
+ wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+ /* fall through */
+ case CPU_UP_PREPARE:
+ BUG_ON(gcwq->first_idle);
+ gcwq->first_idle = new_worker;
+ break;
+
+ case CPU_DYING:
+ /*
+ * Before this, the trustee and all workers must have
+ * stayed on the cpu. After this, they'll all be
+ * diasporas.
+ */
+ gcwq->flags |= GCWQ_DISASSOCIATED;
+ break;
+
+ case CPU_POST_DEAD:
+ gcwq->trustee_state = TRUSTEE_BUTCHER;
+ /* fall through */
+ case CPU_UP_CANCELED:
+ destroy_worker(gcwq->first_idle);
+ gcwq->first_idle = NULL;
+ break;
+
+ case CPU_DOWN_FAILED:
+ case CPU_ONLINE:
+ gcwq->flags &= ~GCWQ_DISASSOCIATED;
+ if (gcwq->trustee_state != TRUSTEE_DONE) {
+ gcwq->trustee_state = TRUSTEE_RELEASE;
+ wake_up_process(gcwq->trustee);
+ wait_trustee_state(gcwq, TRUSTEE_DONE);
+ }
+ /*
+ * Trustee is done and there might be no worker left.
+ * Put the first_idle in and request a real manager to
+ * take a look.
+ */
+ spin_unlock_irq(&gcwq->lock);
+ kthread_bind(gcwq->first_idle->task, cpu);
+ spin_lock_irq(&gcwq->lock);
+ gcwq->flags |= GCWQ_MANAGE_WORKERS;
+ start_worker(gcwq->first_idle);
+ gcwq->first_idle = NULL;
+ break;
+ }
+
+ spin_unlock_irq(&gcwq->lock);
+
+ return NOTIFY_OK;
}

#ifdef CONFIG_SMP
@@ -1243,10 +2390,10 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
*
* Start freezing workqueues. After this function returns, all
* freezeable workqueues will queue new works to their frozen_works
- * list instead of the cwq ones.
+ * list instead of the gcwq ones.
*
* CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
*/
void freeze_workqueues_begin(void)
{
@@ -1259,21 +2406,22 @@ void freeze_workqueues_begin(void)
workqueue_frozen = true;

for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_irq(&gcwq->lock);
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

if (!(wq->flags & WQ_FREEZEABLE))
continue;

- spin_lock_irq(&cwq->lock);
-
- BUG_ON(cwq->cur_worklist != &cwq->worklist);
+ BUG_ON(cwq->cur_worklist != &gcwq->worklist);
BUG_ON(!list_empty(&cwq->frozen_works));

cwq->cur_worklist = &cwq->frozen_works;
-
- spin_unlock_irq(&cwq->lock);
}
+ spin_unlock_irq(&gcwq->lock);
}
spin_unlock(&workqueue_lock);
}
@@ -1302,6 +2450,10 @@ bool freeze_workqueues_busy(void)
BUG_ON(!workqueue_frozen);

for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_irq(&gcwq->lock);
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
struct work_struct *work;
@@ -1310,22 +2462,19 @@ bool freeze_workqueues_busy(void)
if (!(wq->flags & WQ_FREEZEABLE))
continue;

- spin_lock_irq(&cwq->lock);
-
BUG_ON(cwq->cur_worklist != &cwq->frozen_works);

nr_in_flight = cwq->nr_in_flight;
list_for_each_entry(work, &cwq->frozen_works, entry)
nr_in_flight--;

- spin_unlock_irq(&cwq->lock);
-
BUG_ON(nr_in_flight < 0);
if (nr_in_flight) {
busy = true;
break;
}
}
+ spin_unlock_irq(&gcwq->lock);
if (busy)
break;
}
@@ -1337,10 +2486,10 @@ bool freeze_workqueues_busy(void)
* thaw_workqueues - thaw workqueues
*
* Thaw workqueues. Normal queueing is restored and all collected
- * frozen works are transferred to their respective cwq worklists.
+ * frozen works are transferred to their respective gcwq worklists.
*
* CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
*/
void thaw_workqueues(void)
{
@@ -1355,25 +2504,28 @@ void thaw_workqueues(void)
workqueue_frozen = false;

for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_irq(&gcwq->lock);
+
list_for_each_entry(wq, &workqueues, list) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);

if (!(wq->flags & WQ_FREEZEABLE))
continue;

- spin_lock_irq(&cwq->lock);
-
/* switch to normal work queueing */
BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
- cwq->cur_worklist = &cwq->worklist;
+ cwq->cur_worklist = &gcwq->worklist;

- /* transfer frozen tasks to cwq worklist */
- list_splice_tail(&cwq->frozen_works, &cwq->worklist);
+ /* transfer frozen tasks to gcwq worklist */
+ list_splice_tail(&cwq->frozen_works, &gcwq->worklist);
INIT_LIST_HEAD(&cwq->frozen_works);
- wake_up(&cwq->more_work);
-
- spin_unlock_irq(&cwq->lock);
}
+
+ wake_up_worker(gcwq);
+
+ spin_unlock_irq(&gcwq->lock);
}
out_unlock:
spin_unlock(&workqueue_lock);
@@ -1382,16 +2534,46 @@ out_unlock:

void __init init_workqueues(void)
{
- /*
- * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
- * Make sure that the alignment isn't lower than that of
- * unsigned long long in case this code survives for longer
- * than twenty years. :-P
- */
- BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
- __alignof__(unsigned long long));
+ unsigned int cpu;
+ int i;

hotcpu_notifier(workqueue_cpu_callback, 0);
- keventd_wq = create_workqueue("events");
+
+ /* initialize gcwqs */
+ for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+
+ spin_lock_init(&gcwq->lock);
+ INIT_LIST_HEAD(&gcwq->worklist);
+ gcwq->cpu = cpu;
+
+ INIT_LIST_HEAD(&gcwq->idle_list);
+ for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
+ INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
+
+ init_timer_deferrable(&gcwq->idle_timer);
+ gcwq->idle_timer.function = idle_worker_timeout;
+ gcwq->idle_timer.data = (unsigned long)gcwq;
+
+ setup_timer(&gcwq->mayday_timer, gcwq_mayday_timeout,
+ (unsigned long)gcwq);
+
+ gcwq->trustee_state = TRUSTEE_DONE;
+ init_waitqueue_head(&gcwq->trustee_wait);
+ }
+
+ /* create the initial worker */
+ for_each_online_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ struct worker *worker;
+
+ worker = create_worker(gcwq, true);
+ BUG_ON(!worker);
+ spin_lock_irq(&gcwq->lock);
+ start_worker(worker);
+ spin_unlock_irq(&gcwq->lock);
+ }
+
+ keventd_wq = __create_workqueue("events", 0);
BUG_ON(!keventd_wq);
}
--
1.6.4.2

2009-10-01 08:24:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On Thu, Oct 01 2009, Tejun Heo wrote:
> Hello, all.
>
> This rather large patchset implements concurrency managed workqueue.
> It's not complete yet. Singlethread workqueue handling needs more
> work and workqueue users need to be audited and simplified and async
> and slow-work should be reimplemented in terms of workqueue. Although
> this patchset currently adds ~2000 lines of code, I'm fairly
> optimistic that after the whole conversion is done, it would be a net
> decrease in lines of code.
>
> This patchset reimplements workqueue such that it auto-regulates
> concurrency and thus relieves its users from the managing duty. It
> works by managing single shared pool of per-cpu workers and hooking
> into the scheduler to get notifications about workers going to sleep
> and waking up. Using the mechanism, workqueue implementation keeps
> track of the current level of concurrency and schedules only the
> necessary number of workers to keep the cpu occupied.
>
> Concurrency managed workqueue has the following benefits.
>
> * Workqueue users no longer have to worry about managing concurrency
> and, in most cases, deadlocks. The workqueue will manage it
> automatically and unless the deadlock chain involves many (currently
> 127) works, it won't happen.
>
> * There's one single shared pool of workers per cpu and one rescuer
> for each workqueue which requires it, so there are far fewer number
> of kthreads.
>
> * More efficient. Although it adds considerable amount of code, the
> code added to hot path isn't big and works will be executed on the
> local cpu and in batch as much as possible using minimal number of
> kthreads leading to fewer task switches and lower cache
> footprint. <NEED SOME BACKING NUMBERS>
>
> * As concurrency is no longer a problem, most types of asynchronous
> jobs can be done using generic workqueue and other async mechanisms,
> including slow-work, async and adhoc subsystem custom ones, can be
> removed. ie. It can serve as the unified async thread pool
> mechanism.

Awesome work so far Tejun, I have high hopes for this patchset! I'll
take some time to review this when I have it, just consider this so far
a big encouragement to crank away on this. It's always annoyed me that
we have various methods for doing async work, this promises to unify
that very nicely.

--
Jens Axboe

2009-10-01 08:24:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On Thu, Oct 01 2009, Tejun Heo wrote:
> I'm attaching test-wq source and Makefile which I used to verify each
> aspect of the new workqueue. It's pretty easy to write new test
> scenarios using the module so if you're interested in how this
> actually work, it is quite useful.

BTW, you forgot to attach the test-wq source.

--
Jens Axboe

2009-10-01 08:41:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue


* Tejun Heo <[email protected]> wrote:

> Hello, all.
>
> This rather large patchset implements concurrency managed workqueue.
> It's not complete yet. Singlethread workqueue handling needs more
> work and workqueue users need to be audited and simplified and async
> and slow-work should be reimplemented in terms of workqueue. Although
> this patchset currently adds ~2000 lines of code, I'm fairly
> optimistic that after the whole conversion is done, it would be a net
> decrease in lines of code.
>
> This patchset reimplements workqueue such that it auto-regulates
> concurrency and thus relieves its users from the managing duty. It
> works by managing single shared pool of per-cpu workers and hooking
> into the scheduler to get notifications about workers going to sleep
> and waking up. Using the mechanism, workqueue implementation keeps
> track of the current level of concurrency and schedules only the
> necessary number of workers to keep the cpu occupied.

Ok, this looks fairly interesting - and the way you reused scheduler
classes to auto-regulate with no impact on regular performance is quite
an ingenious idea as well. (KVM's preempt notifiers should probably use
this trick too, instead of an ugly notifier in the scheduler hotpath)

This mechanism could be used to implement threadlets/syslets too btw.,
and other forms of asynchronous IO.

My main worry is that in practice workqueues arent all that performance
critical - so we are shooting to optimize something that doesnt
necessarily use all the potential goodness inherent in this approach.

Ingo

2009-10-01 08:47:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On Thu, Oct 01 2009, Ingo Molnar wrote:
> My main worry is that in practice workqueues arent all that performance
> critical - so we are shooting to optimize something that doesnt
> necessarily use all the potential goodness inherent in this approach.

Well, the main problem with the current code is that per-cpu workqueues
are way abused. I don't look at this patchset from a performance point
of view, but rather a way to limit this huge number of idle and
pointless threads. Another issue are specific workqueues to deal with
slowly executing work, or dependency issues. The end result is just
hundreds of pointless kernel threads. I have two 64-thread boxes here, I
dare not think what the 128 or 256 thread (and even bigger) boxes look
like when booted. The latter would be safely into the thousands of
useless threads.

If we get this right, we'll have a leaner kernel and far fewer threads.
On the source side, we'll potentially get rid of specialized workqueue
akin implementations.

It's a win-win as far as I'm concerned.

--
Jens Axboe

2009-10-01 09:02:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue


* Jens Axboe <[email protected]> wrote:

> On Thu, Oct 01 2009, Ingo Molnar wrote:
> > My main worry is that in practice workqueues arent all that performance
> > critical - so we are shooting to optimize something that doesnt
> > necessarily use all the potential goodness inherent in this approach.
>
> Well, the main problem with the current code is that per-cpu
> workqueues are way abused. I don't look at this patchset from a
> performance point of view, but rather a way to limit this huge number
> of idle and pointless threads. [...]

I do look at it as a potentially (and primarily) big performance feature
- if only it was utilized in a place where the performance aspect
mattered.

Sure, the memory savings are nice too.

Ingo

2009-10-01 09:05:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On Thu, Oct 01 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Thu, Oct 01 2009, Ingo Molnar wrote:
> > > My main worry is that in practice workqueues arent all that performance
> > > critical - so we are shooting to optimize something that doesnt
> > > necessarily use all the potential goodness inherent in this approach.
> >
> > Well, the main problem with the current code is that per-cpu
> > workqueues are way abused. I don't look at this patchset from a
> > performance point of view, but rather a way to limit this huge number
> > of idle and pointless threads. [...]
>
> I do look at it as a potentially (and primarily) big performance feature
> - if only it was utilized in a place where the performance aspect
> mattered.

That just makes it win-win-win :-)

> Sure, the memory savings are nice too.

As is the unified approach to async work handling, compared with what we
have now (which is several flavors of workqueues, async work, slow work,
etc).

--
Jens Axboe

2009-10-01 09:15:30

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On 10/01/2009 10:40 AM, Ingo Molnar wrote:
> Ok, this looks fairly interesting - and the way you reused scheduler
> classes to auto-regulate with no impact on regular performance is quite
> an ingenious idea as well. (KVM's preempt notifiers should probably use
> this trick too, instead of an ugly notifier in the scheduler hotpath)
>
> This mechanism could be used to implement threadlets/syslets too btw.,
> and other forms of asynchronous IO.
>

In fact I've thought of implementing threadlets and concurrency-managed
workqueues with preempt notifiers ;)

Isn't a scheduling class overkill for two existing callbacks? Note we
can easily use a thread flag and __switch_to_xtra() to avoid the overhead.

For kvm, we don't want to force a specific scheduling class for vcpu
threads, so we'd need infrastructure to create a new scheduling class
out of an existing one to hook the two callbacks. Seems like quite a
lot of work, for something that is orthogonal to scheduling.

Tejun, would preempt notifiers work for your workqueues? see bottom of
include/linux/preempt.h.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-01 09:27:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On 10/01/2009 11:11 AM, Avi Kivity wrote:
>
> In fact I've thought of implementing threadlets and
> concurrency-managed workqueues with preempt notifiers ;)

Oh, and use them to make kernel_fpu_begin() preemptible.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-01 12:57:16

by David Howells

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue


Sounds interesting as a replacement for slow-work. Some thoughts for you:

The most important features of slow-work are:

(1) Work items are not re-entered whilst they are executing.

(2) The slow-work facility keeps references on its work items by asking the
client to get and put on the client's refcount.

(3) The slow-work facility can create a lot more threads than the number of
CPUs on a system, and the system doesn't grind to a halt if they're all
taken up with long term I/O (performing several mkdirs for example).

I think you have (1) and (3) covered, but I'm unsure about (2).

Also, does it actually make sense to bind threads that are performing
long-term I/O to particular CPUs? Threads that are going to spend a lot more
time sleeping on disk I/O than actually running on a CPU?

David

2009-10-01 13:00:48

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0

Tejun Heo <[email protected]> wrote:

> - kacpid_wq = create_singlethread_workqueue("kacpid");
> - bind_workqueue(kacpid_wq);
> - kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
> - bind_workqueue(kacpi_notify_wq);
> - kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
> - bind_workqueue(kacpi_hotplug_wq);
> + kacpid_wq = create_workqueue("kacpid");
> + kacpi_notify_wq = create_workqueue("kacpi_notify");
> + kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");

Doesn't that then create one worker thread per CPU and then eschew all but
those attached to CPU 0? Sounds excessive, but presumably you deal with that
in later patches.

David

2009-10-01 13:08:38

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

Tejun Heo <[email protected]> wrote:

> +enum {
> + WORK_STRUCT_PENDING = 0, /* work item is pending execution */
> +
> + /*
> + * Reserve 3bits off of cwq pointer. This is enough and
> + * provides acceptable alignment on both 32 and 64bit
> + * machines.
> + */
> + WORK_STRUCT_FLAG_BITS = 3,
> +
> + WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
> + WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
> +};

There's some great enum abuse going on here:-)

David

2009-10-01 13:18:48

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

Tejun Heo <[email protected]> wrote:

> gcwq always keeps at least single idle worker around. When a new
> worker is necessary and the worker is the last idle one, the worker
> assumes the role of "manager" and manages the worker pool -
> ie. creates another worker. Forward-progress is guaranteed by having
> dedicated rescue workers for workqueues which may be necessary while
> creating a new worker. When the manager is having problem creating a
> new worker, mayday timer activates and rescue workers are summoned to
> the cpu and execute works which may be necessary to create new
> workers.

I take it that means that the rescue-workers are donated to the pool to become
worker threads (perhaps on a temporary basis) in the event that forking fails
due to ENOMEM, such that resources can be freed up for fork() to use.

David

2009-10-01 14:51:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

On Thu, 1 Oct 2009 17:09:18 +0900 Tejun Heo <[email protected]> wrote:

> To solve the above issues, this patch implements concurrency-managed
> workqueue.

Seems reasonable.

This approach would appear to rule out the option of setting a work
thread's state (scheduling policy, scheduling priority, uid, etc) to
anything other than some default.

I guess that's unlikely to be a problem if we haven't yet had a need to
do that, but I'd be a bit surprised to discover that nobody has done
that sort of thing yet? Nobody has niced up their workqueue threads?

2009-10-01 15:12:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue


* Andrew Morton <[email protected]> wrote:

> On Thu, 1 Oct 2009 17:09:18 +0900 Tejun Heo <[email protected]> wrote:
>
> > To solve the above issues, this patch implements concurrency-managed
> > workqueue.
>
> Seems reasonable.
>
> This approach would appear to rule out the option of setting a work
> thread's state (scheduling policy, scheduling priority, uid, etc) to
> anything other than some default.
>
> I guess that's unlikely to be a problem if we haven't yet had a need
> to do that, but I'd be a bit surprised to discover that nobody has
> done that sort of thing yet? Nobody has niced up their workqueue
> threads?

tuna in -rt might do that perhaps?

Ingo

2009-10-01 16:15:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

On 10/01/2009 09:05 AM, David Howells wrote:
> Tejun Heo<[email protected]> wrote:
>
>> +enum {
>> + WORK_STRUCT_PENDING = 0, /* work item is pending execution */
>> +
>> + /*
>> + * Reserve 3bits off of cwq pointer. This is enough and
>> + * provides acceptable alignment on both 32 and 64bit
>> + * machines.
>> + */
>> + WORK_STRUCT_FLAG_BITS = 3,
>> +
>> + WORK_STRUCT_FLAG_MASK = (1UL<< WORK_STRUCT_FLAG_BITS) - 1,
>> + WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
>> +};
>
> There's some great enum abuse going on here:-)

The quoted code is a standard kernel technique for creating typed
constants that are visible to the C compiler (rather than CPP) and debugger.

It is found in a great many Linux kernel drivers at this point, and is
definitely not abuse, IMO.

Jeff

2009-10-01 16:22:20

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

Jeff Garzik <[email protected]> wrote:

> The quoted code is a standard kernel technique for creating typed constants
> that are visible to the C compiler (rather than CPP) and debugger.
>
> It is found in a great many Linux kernel drivers at this point, and is
> definitely not abuse, IMO.

Just because it's common practice, and just because you like it, doesn't make
it not abuse. But that's a side issue, and fairly irrelevant to the actual
implementation.

David

2009-10-01 16:31:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

David Howells wrote:
> Jeff Garzik <[email protected]> wrote:
>
>> The quoted code is a standard kernel technique for creating typed constants
>> that are visible to the C compiler (rather than CPP) and debugger.
>>
>> It is found in a great many Linux kernel drivers at this point, and is
>> definitely not abuse, IMO.
>
> Just because it's common practice, and just because you like it, doesn't make
> it not abuse. But that's a side issue, and fairly irrelevant to the actual
> implementation.

Yeap, that's something I picked up from Jeff while working on libata
and I'm quite fond of it. I haven't encountered any downside of it
yet. Any reason why you think it's an abuse? But, at any rate, yeap,
it's a pretty peripheral issue.

Thanks.

--
tejun

2009-10-01 16:35:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

Andrew Morton wrote:
> On Thu, 1 Oct 2009 17:09:18 +0900 Tejun Heo <[email protected]> wrote:
>
>> To solve the above issues, this patch implements concurrency-managed
>> workqueue.
>
> Seems reasonable.
>
> This approach would appear to rule out the option of setting a work
> thread's state (scheduling policy, scheduling priority, uid, etc) to
> anything other than some default.
>
> I guess that's unlikely to be a problem if we haven't yet had a need to
> do that, but I'd be a bit surprised to discover that nobody has done
> that sort of thing yet? Nobody has niced up their workqueue threads?

There were only two users in mainline which diddle with the worker
kthread. The stop machine call which uses RT priority and osl which
binds worker to cpu0. Both are updated not to do that in earlier
patches. For most cases, I don't think it would matter. For special
cases, given the rarity of them, I think we're better off with custom
kthread for now. If they become more prevalent, we'll need to add
support for it so that it can be done easier but even that turns out
to be the case I think it would better to implement that separately
from generic workqueue.

Thanks.

--
tejun

2009-10-01 16:37:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Jens Axboe wrote:
> Awesome work so far Tejun, I have high hopes for this patchset! I'll
> take some time to review this when I have it, just consider this so far
> a big encouragement to crank away on this. It's always annoyed me that
> we have various methods for doing async work, this promises to unify
> that very nicely.

Ah... thanks. I've had this in my queue for too long (workqueue
semantics was quite intricate and getting hotplug right took me quite
some time). It's nice to see it turns out to be not such a bad idea.
:-)

--
tejun

2009-10-01 16:44:43

by Alan

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

On Thu, 01 Oct 2009 14:05:20 +0100
David Howells <[email protected]> wrote:

> Tejun Heo <[email protected]> wrote:
>
> > +enum {
> > + WORK_STRUCT_PENDING = 0, /* work item is pending execution */
> > +
> > + /*
> > + * Reserve 3bits off of cwq pointer. This is enough and
> > + * provides acceptable alignment on both 32 and 64bit
> > + * machines.
> > + */
> > + WORK_STRUCT_FLAG_BITS = 3,
> > +
> > + WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
> > + WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
> > +};
>
> There's some great enum abuse going on here:-)

Its actually a very sensible use of enum - enum constants don't contain
the deathtraps that CPP constants do.

Alan

2009-10-01 16:44:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Hello,

Ingo Molnar wrote:
> Ok, this looks fairly interesting - and the way you reused scheduler
> classes to auto-regulate with no impact on regular performance is quite
> an ingenious idea as well. (KVM's preempt notifiers should probably use
> this trick too, instead of an ugly notifier in the scheduler hotpath)
>
> This mechanism could be used to implement threadlets/syslets too btw.,
> and other forms of asynchronous IO.
>
> My main worry is that in practice workqueues arent all that performance
> critical - so we are shooting to optimize something that doesnt
> necessarily use all the potential goodness inherent in this approach.

The scheduler code was pretty nice to hook into. But as Jens said,
this patchset is more about getting the async framework which can
scale and be used universally. Performance-wise, I was mainly aiming
for not introducing noticeable slow down as I expect workqueues to be
used more widely with this change. Workqueue is already pretty
heavily used in certain paths - ie. block IO completion path with SSDs
and any signficant overhead would be noticeable. With mixture of
different works of different run time, I think the net effect would be
positive. It should be able to achieve better latency and throughput.
I don't have any numbers to back my rather optimistic projection yet
tho.

Thanks.

--
tejun

2009-10-01 16:56:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Hello,

Avi Kivity wrote:
> On 10/01/2009 10:40 AM, Ingo Molnar wrote:
>> Ok, this looks fairly interesting - and the way you reused scheduler
>> classes to auto-regulate with no impact on regular performance is quite
>> an ingenious idea as well. (KVM's preempt notifiers should probably use
>> this trick too, instead of an ugly notifier in the scheduler hotpath)
>>
>> This mechanism could be used to implement threadlets/syslets too btw.,
>> and other forms of asynchronous IO.
>>
>
> In fact I've thought of implementing threadlets and concurrency-managed
> workqueues with preempt notifiers ;)

:-)

> Isn't a scheduling class overkill for two existing callbacks? Note we
> can easily use a thread flag and __switch_to_xtra() to avoid the overhead.

The partial overriding only takes about seventy lines of code and is
conceptually trivial. I don't think it's an overkill.

> For kvm, we don't want to force a specific scheduling class for vcpu
> threads, so we'd need infrastructure to create a new scheduling class
> out of an existing one to hook the two callbacks. Seems like quite a
> lot of work, for something that is orthogonal to scheduling.
>
> Tejun, would preempt notifiers work for your workqueues? see bottom of
> include/linux/preempt.h.

I considered that but the thing is workqueue needs to know when a
thread wakes up not when it gets scheduled. Of course we can add
another notifier op and call it from try_to_wake_up() but I really
didn't want to add yet another hook in a very hot path which will only
be useful for very small number of tasks but yet has to be called for
every operation and the sched_class mechanism means that we already
have hooks at all the interesting spots, so I think it's better to
make use of them instead of adding another set of callbacks.

Thanks.

--
tejun

2009-10-01 16:59:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class



On Thu, 1 Oct 2009, Tejun Heo wrote:
>
> Implement workqueue scheduler class. Workqueue sched_class inherits
> fair sched_class and behaves exactly the same as sched_class except
> that it has two callback functions which get called when a task is put
> to sleep and wakes up and doesn't allow switching to different
> scheduler class.

So this looks odd to me.

I agree completely with the callback functions, but what I don't agree
with is that this is somehow workqueue-related. I bet that others could
use this, and in fact, I suspect that it should not be tied to the
scheduler class at all, but to the _thread_.

Just as an example, I could imagine that we would do lock_kernel()
releasing the kernel lock on scheduling (and re-taking it on wakeup) as a
callback. Or async IO handling - both done independently of any
"workqueue" logic.

So tying this to the scheduler class seems a bit odd. But maybe I'm
missing something?

Linus

2009-10-01 17:05:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works



On Thu, 1 Oct 2009, Tejun Heo wrote:
>
> Reimplement workqueue flushing using color coded works. There are two
> colors and each cwq has the current color which is painted on the
> works being issued via the cwq. Flushing a workqueue is achieved by
> flipping the current colors of each cwq and wait for the works which
> have the old color to drain.

Is there any reason for the "two colors" choice? I could imagine that it
could end up being a limitation (and possible deadlock?) to allow just a
single flush pending at any time.

Could the color be an 8-bit counter or something like that instead?

Linus

2009-10-01 17:08:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On 10/01/2009 06:55 PM, Tejun Heo wrote:
>> For kvm, we don't want to force a specific scheduling class for vcpu
>> threads, so we'd need infrastructure to create a new scheduling class
>> out of an existing one to hook the two callbacks. Seems like quite a
>> lot of work, for something that is orthogonal to scheduling.
>>
>> Tejun, would preempt notifiers work for your workqueues? see bottom of
>> include/linux/preempt.h.
>>
> I considered that but the thing is workqueue needs to know when a
> thread wakes up not when it gets scheduled. Of course we can add
> another notifier op and call it from try_to_wake_up() but I really
> didn't want to add yet another hook in a very hot path which will only
> be useful for very small number of tasks but yet has to be called for
> every operation and the sched_class mechanism means that we already
> have hooks at all the interesting spots, so I think it's better to
> make use of them instead of adding another set of callbacks.
>

Right, it's a subtle difference that makes p_n unusable for you.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-01 17:08:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0

David Howells wrote:
> Tejun Heo <[email protected]> wrote:
>
>> - kacpid_wq = create_singlethread_workqueue("kacpid");
>> - bind_workqueue(kacpid_wq);
>> - kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
>> - bind_workqueue(kacpi_notify_wq);
>> - kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
>> - bind_workqueue(kacpi_hotplug_wq);
>> + kacpid_wq = create_workqueue("kacpid");
>> + kacpi_notify_wq = create_workqueue("kacpi_notify");
>> + kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
>
> Doesn't that then create one worker thread per CPU and then eschew all but
> those attached to CPU 0? Sounds excessive, but presumably you deal with that
> in later patches.

Yeah, I'm just trying to strip down odd usages to ease conversion to
new mechanism, so between this patch and the actual new workqueue
implementation there will be whole lot of unused kthreads. After the
new workqueue, they automatically disappear. Also, the above can be
cleaned up such that only one of the workqueues remains afterwards.

Thanks.

--
tejun

2009-10-01 17:12:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works

Hello, Linus.

Linus Torvalds wrote:
> On Thu, 1 Oct 2009, Tejun Heo wrote:
>> Reimplement workqueue flushing using color coded works. There are two
>> colors and each cwq has the current color which is painted on the
>> works being issued via the cwq. Flushing a workqueue is achieved by
>> flipping the current colors of each cwq and wait for the works which
>> have the old color to drain.
>
> Is there any reason for the "two colors" choice? I could imagine that it
> could end up being a limitation (and possible deadlock?) to allow just a
> single flush pending at any time.
>
> Could the color be an 8-bit counter or something like that instead?

It's related to how many bits can be used from work_struct->data which
in turn is determined by the alignment of cwq. Currently, the
alignment is 8 bytes so 3 bits are available. One is used for
PENDING, the other for LINKED and one last bit is used for COLOR.
Aligning cwq to, say, 64bytes wouldn't be bad at all and then we can
have 6-bits of coloring. Hmmm... yeap, I'll work on that.

Thanks.

--
tejun

2009-10-01 17:14:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue



On Thu, 1 Oct 2009, Tejun Heo wrote:
>
> To solve the above issues, this patch implements concurrency-managed
> workqueue.

Ok, all the other patches in the series looked great, this one looks
scary. Everything else was pretty small and targeted, and if bugs appear,
bisection would show the step that was nasty.

And then comes the big flag-day patch that actually introduces all the new
logic, and is the most complex of the lot.

Maybe it's inevitable, but if this one could be split up a bit more, I'd
find the series less scary.

I don't know how, though (except by adding code that simply isn't used in
previous patches, and that doesn't really help anything at all), so maybe
the big "switch everything over at once" thing is unavoidable.

Linus

2009-10-01 17:17:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works

Tejun Heo wrote:
> Hello, Linus.
>
> Linus Torvalds wrote:
>> On Thu, 1 Oct 2009, Tejun Heo wrote:
>>> Reimplement workqueue flushing using color coded works. There are two
>>> colors and each cwq has the current color which is painted on the
>>> works being issued via the cwq. Flushing a workqueue is achieved by
>>> flipping the current colors of each cwq and wait for the works which
>>> have the old color to drain.
>> Is there any reason for the "two colors" choice? I could imagine that it
>> could end up being a limitation (and possible deadlock?) to allow just a
>> single flush pending at any time.
>>
>> Could the color be an 8-bit counter or something like that instead?
>
> It's related to how many bits can be used from work_struct->data which
> in turn is determined by the alignment of cwq. Currently, the
> alignment is 8 bytes so 3 bits are available. One is used for
> PENDING, the other for LINKED and one last bit is used for COLOR.
> Aligning cwq to, say, 64bytes wouldn't be bad at all and then we can
> have 6-bits of coloring. Hmmm... yeap, I'll work on that.

Oops, 4 not 6.

--
tejun

2009-10-01 17:23:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

Hello,

Linus Torvalds wrote:
>
> On Thu, 1 Oct 2009, Tejun Heo wrote:
>> To solve the above issues, this patch implements concurrency-managed
>> workqueue.
>
> Ok, all the other patches in the series looked great, this one looks
> scary. Everything else was pretty small and targeted, and if bugs appear,
> bisection would show the step that was nasty.
>
> And then comes the big flag-day patch that actually introduces all the new
> logic, and is the most complex of the lot.
>
> Maybe it's inevitable, but if this one could be split up a bit more, I'd
> find the series less scary.
>
> I don't know how, though (except by adding code that simply isn't used in
> previous patches, and that doesn't really help anything at all), so maybe
> the big "switch everything over at once" thing is unavoidable.

Yeap, this one is pretty scary. I tried pretty hard to at least make
the diffs fall into relevant sections (ie. updates to certain part of
logic shows up as diffs to the original part) but I agree this patch
is a tad too big. Maybe I can introduce gcwq first. The big problem
is the strong inter-dependency between the single global worker pool
and the hotplug logic. I tried to introduce the hotplug logic first
but it basically required implementing different interim mechanism
altogether. One solution could be disabling or crippling cpu hotplug
support for several commits so that the processing part and hotplug
part can be introduced separately. Would that be acceptable?

Thanks.

--
tejun

2009-10-01 17:57:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue



On Fri, 2 Oct 2009, Tejun Heo wrote:
>
> Yeap, this one is pretty scary. I tried pretty hard to at least make
> the diffs fall into relevant sections (ie. updates to certain part of
> logic shows up as diffs to the original part) but I agree this patch
> is a tad too big. Maybe I can introduce gcwq first. The big problem
> is the strong inter-dependency between the single global worker pool
> and the hotplug logic. I tried to introduce the hotplug logic first
> but it basically required implementing different interim mechanism
> altogether. One solution could be disabling or crippling cpu hotplug
> support for several commits so that the processing part and hotplug
> part can be introduced separately. Would that be acceptable?

I would say that disabling CPU hotplug for a while is likely a good way to
at least limit the damage, except for one thing - suspend and resume.

That's one of the things that is hard to debug anyway, is tied to
workqueues (with that whole freezing thing) _and_ uses CPU hotplug to
actually get its work done. So if you effectively disable CPU hotplug for
a part of the series, then that just means that bisection doesn't work
over that part - and is probably one of the areas where it would be the
most important that it works ;^/

So I dunno. I do like the series even as-is. It seems to have fairly solid
reasoning behind it, and I absolutely hate the deadlocks we have now in
workqueue handling and the insane things we do to work around them. At the
same time, this code is something I get worried about - I can easily
imaging subtle bugs that only happen under certain circumstances and with
certain timings - and the last patch was the scariest of the lot.

So if the thing could be split up while limiting CPU hotplug only a bit
(so that the case of suspend/resume would hopefully still be ok), that
would be wonderful.

Linus

2009-10-01 18:36:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting


> Freezing isn't exactly the most latency sensitive operation and
> there's no reason to burn cpu cycles and power waiting for it to
> complete. msleep(10) instead of yield(). This should improve
> reliability of emergency hibernation.

i don't see how it improves reliability, but its probably ok.

Well... for hibernation anyway. I can imagine cgroup users where
freeze is so fast that this matters. rjw cc-ed. pavel

> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/power/process.c | 13 +++++++++----
> 1 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index cc2e553..9d26a0a 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> do_gettimeofday(&start);
>
> end_time = jiffies + TIMEOUT;
> - do {
> + while (true) {
> todo = 0;
> read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> todo++;
> } while_each_thread(g, p);
> read_unlock(&tasklist_lock);
> - yield(); /* Yield is okay here */
> - if (time_after(jiffies, end_time))
> + if (!todo || time_after(jiffies, end_time))
> break;
> - } while (todo);
> +
> + /*
> + * We need to retry. There's no reason to be
> + * over-anxious about it and waste power.
> + */
> + msleep(10);
> + }
>
> do_gettimeofday(&end);
> elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-10-01 18:45:03

by Ben Pfaff

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

David Howells <[email protected]> writes:

> Tejun Heo <[email protected]> wrote:
>
>> +enum {
>> + WORK_STRUCT_PENDING = 0, /* work item is pending execution */
>> +
>> + /*
>> + * Reserve 3bits off of cwq pointer. This is enough and
>> + * provides acceptable alignment on both 32 and 64bit
>> + * machines.
>> + */
>> + WORK_STRUCT_FLAG_BITS = 3,
>> +
>> + WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
>> + WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
>> +};
>
> There's some great enum abuse going on here:-)

The "1UL" part is a bit worrisome. Enumeration constants always
have type "int"[*], so if code that uses WORK_STRUCT_WQ_DATA_MASK
actually depends on getting a full "long" worth of bits, it is
not going to work on 64-bit systems.

[*] See C99:

6.4.4.3 Enumeration constants
Syntax
1 enumeration-constant:
identifier
Semantics
2 An identifier declared as an enumeration constant has type int.

--
Ben Pfaff
http://benpfaff.org

2009-10-01 18:49:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class


* Linus Torvalds <[email protected]> wrote:

> On Thu, 1 Oct 2009, Tejun Heo wrote:
> >
> > Implement workqueue scheduler class. Workqueue sched_class inherits
> > fair sched_class and behaves exactly the same as sched_class except
> > that it has two callback functions which get called when a task is
> > put to sleep and wakes up and doesn't allow switching to different
> > scheduler class.
>
> So this looks odd to me.
>
> I agree completely with the callback functions, but what I don't agree
> with is that this is somehow workqueue-related. I bet that others
> could use this, and in fact, I suspect that it should not be tied to
> the scheduler class at all, but to the _thread_.
>
> Just as an example, I could imagine that we would do lock_kernel()
> releasing the kernel lock on scheduling (and re-taking it on wakeup)
> as a callback. Or async IO handling - both done independently of any
> "workqueue" logic.
>
> So tying this to the scheduler class seems a bit odd. But maybe I'm
> missing something?

We could do what Avi suggested: not use scheduler classes at all for
this (that brings in other limitations like lack of p->policy freedom),
but use the existing preempt-notifications callbacks.

They are per task - we would simply make preempt notifiers
unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all
unconditional scheduler logic.

( Sidenote: we could even unify that logic with the perf event
sched-in/sched-out logic which is per task in a similar fashion and
which can have callback properties as well. That would mean a single,
well-defined callback facility for scheduler events. )

Ingo

2009-10-01 19:03:06

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class

On 10/01/2009 08:48 PM, Ingo Molnar wrote:
> We could do what Avi suggested: not use scheduler classes at all for
> this (that brings in other limitations like lack of p->policy freedom),
> but use the existing preempt-notifications callbacks.
>
> They are per task - we would simply make preempt notifiers
> unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all
> unconditional scheduler logic.
>
>

Sure, but it would mean that we need a new notifier. sched_out,
sched_in, and wakeup (and, return to userspace, with the new notifier).

btw, I've been thinking we should extend concurrency managed workqueues
to userspace. Right now userspace can spawn a massive amount of
threads, hoping to hide any waiting by making more work available to the
scheduler. That has the drawback of increasing latency due to
involuntary preemption. Or userspace can use one thread per cpu, hope
it's the only application on the machine, and go all-aio.

But what if we added a way for userspace to tell the kernel to fork off
threads when processing power and work to do are both available? The
scheduler knows when there is processing power, and an epoll fd can tell
it when there is work to do. So the scheduler will create threads to
saturate the processors, if one of them waits for I/O the scheduler
forks off another one until all queues are busy again.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-01 19:08:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class



On Thu, 1 Oct 2009, Ingo Molnar wrote:
>
> > So tying this to the scheduler class seems a bit odd. But maybe I'm
> > missing something?
>
> We could do what Avi suggested: not use scheduler classes at all for
> this (that brings in other limitations like lack of p->policy freedom),
> but use the existing preempt-notifications callbacks.

I don't think that works. I think the preempt notifiers are done too early
and too late (ie they are done at the actual context switch), which what
Tejun's code wants is to be called so that he can actually _do_ something
about the task before the next task is selected.

But if the preempt-notification users are ok with being called earlier,
then yes, I guess we could share the logic (and rename it). I do agree
that what I'd like to see is more _like_ those preempt notifications, with
a list of things to do before/after.

> They are per task - we would simply make preempt notifiers
> unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all
> unconditional scheduler logic.

I don't mind that, but see above: I think the semantics are fundamentally
different. One wants to be called exactly when the actual context switch
happens, the other wants to be called before the choice of the next thread
is even done (which in turn can mean that no context switch actually
happens at all, because maybe you end up picking the same thread after
all - possibly _due_ to the callback doing something)

But it's possible that I'm making a semantic difference out of something
that nobody actually cares about.

Linus

2009-10-01 19:13:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class



On Thu, 1 Oct 2009, Avi Kivity wrote:
>
> Sure, but it would mean that we need a new notifier. sched_out, sched_in, and
> wakeup (and, return to userspace, with the new notifier).

Ok, see the email I just sent out.

And I don't think we want a new notifier - mainly because I don't think we
want to walk the list four times (prepare, out, in, final - we need to
make sure that these things nest properly, so even if "in" and "final"
happen with the same state, they aren't the same, because "in" only
happens if "out" was called, while "final" would happen if "prepare" was
called)

So it would be better to have separate lists, in order to avoid walking
the lists four times just because there was a single notifier that just
wanted to be called for the inner (or outer) cases.

> btw, I've been thinking we should extend concurrency managed workqueues to
> userspace. Right now userspace can spawn a massive amount of threads, hoping
> to hide any waiting by making more work available to the scheduler. That has
> the drawback of increasing latency due to involuntary preemption. Or
> userspace can use one thread per cpu, hope it's the only application on the
> machine, and go all-aio.

This is what the whole next-gen AIO was supposed to do with the
threadlets, ie avoid doing a new thread if it could do the IO all cached
and without being preempted.

Linus

2009-10-01 19:15:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class


* Avi Kivity <[email protected]> wrote:

> On 10/01/2009 08:48 PM, Ingo Molnar wrote:
>> We could do what Avi suggested: not use scheduler classes at all for
>> this (that brings in other limitations like lack of p->policy freedom),
>> but use the existing preempt-notifications callbacks.
>>
>> They are per task - we would simply make preempt notifiers
>> unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all
>> unconditional scheduler logic.
>
> Sure, but it would mean that we need a new notifier. sched_out,
> sched_in, and wakeup (and, return to userspace, with the new
> notifier).

perf events have sched out, sched in and wakeup events. Return to
user-space would be interesting to add as well. (and overhead of that
can be hidden via TIF - like you did via the return-to-userspace
notifiers)

Sounds more generally useful (and less scary) than (clever but somewhat
limiting) sched_class hackery.

I.e. i'd prefer if we had just one callback facility in those codepaths,
minimizing the hotpath overhead and providing a coherent API.

> btw, I've been thinking we should extend concurrency managed
> workqueues to userspace. Right now userspace can spawn a massive
> amount of threads, hoping to hide any waiting by making more work
> available to the scheduler. That has the drawback of increasing
> latency due to involuntary preemption. Or userspace can use one
> thread per cpu, hope it's the only application on the machine, and go
> all-aio.
>
> But what if we added a way for userspace to tell the kernel to fork
> off threads when processing power and work to do are both available?
> The scheduler knows when there is processing power, and an epoll fd
> can tell it when there is work to do. So the scheduler will create
> threads to saturate the processors, if one of them waits for I/O the
> scheduler forks off another one until all queues are busy again.

Sounds like syslets done right?

Ingo

2009-10-01 19:24:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class


* Linus Torvalds <[email protected]> wrote:

> On Thu, 1 Oct 2009, Avi Kivity wrote:
> >
> > Sure, but it would mean that we need a new notifier. sched_out,
> > sched_in, and wakeup (and, return to userspace, with the new
> > notifier).
>
> Ok, see the email I just sent out.
>
> And I don't think we want a new notifier - mainly because I don't
> think we want to walk the list four times (prepare, out, in, final -
> we need to make sure that these things nest properly, so even if "in"
> and "final" happen with the same state, they aren't the same, because
> "in" only happens if "out" was called, while "final" would happen if
> "prepare" was called)
>
> So it would be better to have separate lists, in order to avoid
> walking the lists four times just because there was a single notifier
> that just wanted to be called for the inner (or outer) cases.

Sounds a bit like perf events with callbacks, triggered at those places.
(allowing arbitrary permutation of the callbacks)

But ... it needs some work to shape in precisely such a way. Primarily
it would need a splitting/slimming of struct perf_event, to allow the
callback properties to be separated out for in-kernel users that are
only interested in the callbacks, not in the other abstractions.

But it looks straightforward and useful ... the kind of useful work
interested parties would be able to complete by the next merge window
;-)

Other places could use this too - we really want just one callback
facility for certain system events - be that in-kernel use for other
kernel facilities, or external instrumentation injected by user-space.

> > btw, I've been thinking we should extend concurrency managed
> > workqueues to userspace. Right now userspace can spawn a massive
> > amount of threads, hoping to hide any waiting by making more work
> > available to the scheduler. That has the drawback of increasing
> > latency due to involuntary preemption. Or userspace can use one
> > thread per cpu, hope it's the only application on the machine, and
> > go all-aio.
>
> This is what the whole next-gen AIO was supposed to do with the
> threadlets, ie avoid doing a new thread if it could do the IO all
> cached and without being preempted.

Yeah. That scheme was hobbled by signal semantics: it looked hard to do
the 'flip a reserve thread with a blocked thread' trick in the scheduler
while still keeping all the signal details in place.

Ingo

2009-10-01 20:05:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class



On Thu, 1 Oct 2009, Ingo Molnar wrote:
>
> Yeah. That scheme was hobbled by signal semantics: it looked hard to do
> the 'flip a reserve thread with a blocked thread' trick in the scheduler
> while still keeping all the signal details in place.

I think we should look at David Wheeler's advice: "Any problem in computer
science can be solved with another level of indirection".

In particular, my favourite solution to this is to split "struct
thread_struct" into a new part, which would be "CPU state".

In other words, all threadlets would share one single "struct
thread_struct" (and thus local signal state), but they would then each
have a "struct cpu_state" associated with them, which includes the kernel
stack. And that cpu_state thing would not be preempted, it would be
something like a round-robin cooperative scheduling that is only invoced
when a threadlet goes to sleep.

Linus

2009-10-01 21:02:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting

On Thursday 01 October 2009, Pavel Machek wrote:
>
> > Freezing isn't exactly the most latency sensitive operation and
> > there's no reason to burn cpu cycles and power waiting for it to
> > complete. msleep(10) instead of yield(). This should improve
> > reliability of emergency hibernation.
>
> i don't see how it improves reliability, but its probably ok.
>
> Well... for hibernation anyway. I can imagine cgroup users where
> freeze is so fast that this matters. rjw cc-ed. pavel

Thanks. I'd like to hear from the cgroup freezer people about that.

> > Signed-off-by: Tejun Heo <[email protected]>
> > ---
> > kernel/power/process.c | 13 +++++++++----
> > 1 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > index cc2e553..9d26a0a 100644
> > --- a/kernel/power/process.c
> > +++ b/kernel/power/process.c
> > @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> > do_gettimeofday(&start);
> >
> > end_time = jiffies + TIMEOUT;
> > - do {
> > + while (true) {
> > todo = 0;
> > read_lock(&tasklist_lock);
> > do_each_thread(g, p) {
> > @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> > todo++;
> > } while_each_thread(g, p);
> > read_unlock(&tasklist_lock);
> > - yield(); /* Yield is okay here */
> > - if (time_after(jiffies, end_time))
> > + if (!todo || time_after(jiffies, end_time))
> > break;
> > - } while (todo);
> > +
> > + /*
> > + * We need to retry. There's no reason to be
> > + * over-anxious about it and waste power.
> > + */

The comment above looks like it's only meaningful in the context of the patch.
After it's been applied the meaning of the comment won't be so obvious, I'm
afraid.

> > + msleep(10);
> > + }
> >
> > do_gettimeofday(&end);
> > elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);

Thanks,
Rafael

2009-10-02 00:42:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

Tejun Heo <[email protected]> writes:

> Currently each workqueue has its own dedicated worker pool. This
> causes the following problems.

I definitely like the basic idea (without having read all the code so
far). Thanks for doing that work. On large systems the number of
kernel threads around can be mind boggling, and it will also get rid
of bizarreness like the ps2mouse thread.

I wonder have you ever thought about exposing this to user space too?
It can face similar problems for parallelization.

-Andi

--
[email protected] -- Speaking for myself only.

2009-10-02 10:57:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting

Hello, Pavel, Rafael.

Rafael J. Wysocki wrote:
> On Thursday 01 October 2009, Pavel Machek wrote:
>>> Freezing isn't exactly the most latency sensitive operation and
>>> there's no reason to burn cpu cycles and power waiting for it to
>>> complete. msleep(10) instead of yield(). This should improve
>>> reliability of emergency hibernation.
>> i don't see how it improves reliability, but its probably ok.

It's about battery. When emergency hibernation kicks in and something
is taking a while to freeze (usually nfs does this for me) burning
power waiting for it to finish is a pretty bad idea.

>> Well... for hibernation anyway. I can imagine cgroup users where
>> freeze is so fast that this matters. rjw cc-ed. pavel
>
> Thanks. I'd like to hear from the cgroup freezer people about that.

Oh... didn't know that. 10ms sleeps really matter there?

--
tejun

2009-10-02 11:46:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Hello, David Howells.

David Howells wrote:
> Sounds interesting as a replacement for slow-work. Some thoughts for you:
>
> The most important features of slow-work are:
>
> (1) Work items are not re-entered whilst they are executing.
>
> (2) The slow-work facility keeps references on its work items by asking the
> client to get and put on the client's refcount.
>
> (3) The slow-work facility can create a lot more threads than the number of
> CPUs on a system, and the system doesn't grind to a halt if they're all
> taken up with long term I/O (performing several mkdirs for example).
>
> I think you have (1) and (3) covered, but I'm unsure about (2).

Given that slow-work isn't being used too extensively yet, I was
thinking whether that part could be pushed down to the caller. Or, we
can also wrap work and export an interface which supports the get/put
reference.

> Also, does it actually make sense to bind threads that are
> performing long-term I/O to particular CPUs? Threads that are going
> to spend a lot more time sleeping on disk I/O than actually running
> on a CPU?

Binding is usually beneficial and doesn't matter for IO intensive
ones, so...

Thanks.

--
tejun

2009-10-02 11:57:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available

(Restoring cc list. Please don't drop them)

Ben Pfaff wrote:
> David Howells <[email protected]> writes:
>
>> Tejun Heo <[email protected]> wrote:
>>
>>> +enum {
>>> + WORK_STRUCT_PENDING = 0, /* work item is pending execution */
>>> +
>>> + /*
>>> + * Reserve 3bits off of cwq pointer. This is enough and
>>> + * provides acceptable alignment on both 32 and 64bit
>>> + * machines.
>>> + */
>>> + WORK_STRUCT_FLAG_BITS = 3,
>>> +
>>> + WORK_STRUCT_FLAG_MASK = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
>>> + WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
>>> +};
>> There's some great enum abuse going on here:-)
>
> The "1UL" part is a bit worrisome. Enumeration constants always
> have type "int"[*], so if code that uses WORK_STRUCT_WQ_DATA_MASK
> actually depends on getting a full "long" worth of bits, it is
> not going to work on 64-bit systems.
>
> [*] See C99:
>
> 6.4.4.3 Enumeration constants
> Syntax
> 1 enumeration-constant:
> identifier
> Semantics
> 2 An identifier declared as an enumeration constant has type int.

Aieee... oops. Well, this isn't how gcc behaves.

$ cat test.c
#include <stdio.h>

enum {
ENUM = ~0U,
};

enum {
LONG_ENUM = ~0UL,
};

int main(void)
{
printf("%zu %zu\n", sizeof(ENUM), sizeof(LONG_ENUM));
return 0;
}
$ gcc test.c && ./a.out; gcc -std=c99 test.c && ./a.out
4 8
4 8

But, yeah, this definitely is a problem. 6.7.2.2 also says that

Each enumerated type shall be compatible with char, a signed integer
type, or an unsigned integer type. The choice of type is
implementation-defined,113) but shall be capable of representing the
values of all the members of the enumeration. The enumerated type is
incomplete until after the } that terminates the list of enumerator
declarations.

gcc probably is being a bit too generous with possible integer types
here. BTW, does c++ define it like this too?

Hmmm... So, should we go back to using defines for these or keep
(ab)using gcc's generousity and maybe hope the next iteration of the
standard to become a bit more generous too?

Thanks.

--
tejun

2009-10-02 12:04:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

David Howells wrote:
> Tejun Heo <[email protected]> wrote:
>
>> gcwq always keeps at least single idle worker around. When a new
>> worker is necessary and the worker is the last idle one, the worker
>> assumes the role of "manager" and manages the worker pool -
>> ie. creates another worker. Forward-progress is guaranteed by having
>> dedicated rescue workers for workqueues which may be necessary while
>> creating a new worker. When the manager is having problem creating a
>> new worker, mayday timer activates and rescue workers are summoned to
>> the cpu and execute works which may be necessary to create new
>> workers.
>
> I take it that means that the rescue-workers are donated to the pool
> to become worker threads (perhaps on a temporary basis) in the event
> that forking fails due to ENOMEM, such that resources can be freed
> up for fork() to use.

Yes, when kthread_create() doesn't return fast enough
(MAYDAY_INITIAL_TIMEOUT, currently 10ms), rescue workers are summoned
to the cpu. They scan the pending works and process only the works
which are for the workqueue the rescuer is bound to which should solve
the possible allocation deadlock. Currently, all workqueues are
created with a rescuer by default but that will be changed as most of
them don't need it.

Thanks.

--
tejun

2009-10-02 12:10:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

Hello, Andi.

Andi Kleen wrote:
> Tejun Heo <[email protected]> writes:
>
>> Currently each workqueue has its own dedicated worker pool. This
>> causes the following problems.
>
> I definitely like the basic idea (without having read all the code so
> far). Thanks for doing that work. On large systems the number of
> kernel threads around can be mind boggling, and it will also get rid
> of bizarreness like the ps2mouse thread.
>
> I wonder have you ever thought about exposing this to user space too?
> It can face similar problems for parallelization.

Yeah, some similar mechanism could be useful to manage userland worker
pool too but determining what to export would be the difficult
question. Events when all of certain groups of threads are blocked
could be useful and cpu idleness is too.

One thing is that for userspace there's already pretty thick layer of
abstraction in place and having this type of bare metal mechanism
might not buy much. In most cases, monitoring system parameters
periodically and supplying a bit of extra threads should be able to
achieve about the same result. Do you have anything specific on your
mind?

Thanks.

--
tejun

2009-10-02 12:23:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 03/19] scheduler: implement workqueue scheduler class

Hello, Linus.

Linus Torvalds wrote:
> On Thu, 1 Oct 2009, Tejun Heo wrote:
>> Implement workqueue scheduler class. Workqueue sched_class inherits
>> fair sched_class and behaves exactly the same as sched_class except
>> that it has two callback functions which get called when a task is put
>> to sleep and wakes up and doesn't allow switching to different
>> scheduler class.
>
> So this looks odd to me.
>
> I agree completely with the callback functions, but what I don't agree
> with is that this is somehow workqueue-related. I bet that others could
> use this, and in fact, I suspect that it should not be tied to the
> scheduler class at all, but to the _thread_.
>
> Just as an example, I could imagine that we would do lock_kernel()
> releasing the kernel lock on scheduling (and re-taking it on wakeup) as a
> callback. Or async IO handling - both done independently of any
> "workqueue" logic.
>
> So tying this to the scheduler class seems a bit odd. But maybe I'm
> missing something?

The only reason why the callbacks are implemented by inheriting
sched_class is because I thought it was a pretty special case, so
slightly lax on the hackish side while staying spartan on adding
hotpath overhead. If this type of notification mechanism is something
which can be used more widely, it would definitely be better to have
proper per-task callback mechanism instead. Probably per-task chain
of notifier ops so that only tasks which would be interested in such
events can opt in.

Avi, such per-task notification ops should be useable for kvm too,
right?

Thanks.

--
tejun

2009-10-02 12:46:48

by Stefan Richter

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Tejun Heo wrote:
> David Howells wrote:
>> Sounds interesting as a replacement for slow-work. Some thoughts for you:
>>
>> The most important features of slow-work are:
>>
>> (1) Work items are not re-entered whilst they are executing.
>>
>> (2) The slow-work facility keeps references on its work items by asking the
>> client to get and put on the client's refcount.
>>
>> (3) The slow-work facility can create a lot more threads than the number of
>> CPUs on a system, and the system doesn't grind to a halt if they're all
>> taken up with long term I/O (performing several mkdirs for example).
>>
>> I think you have (1) and (3) covered, but I'm unsure about (2).
>
> Given that slow-work isn't being used too extensively yet, I was
> thinking whether that part could be pushed down to the caller. Or, we
> can also wrap work and export an interface which supports the get/put
> reference.

BTW, not only slow-work users need reference counting, many existing
regular workqueue users could use it as well. (Well, I guess I stated
the obvious.) Currently we have local wrappers for that like these ones:
http://lxr.linux.no/#linux+v2.6.31/drivers/firewire/core-card.c#L212
http://lxr.linux.no/#linux+v2.6.31/drivers/firewire/sbp2.c#L838
--
Stefan Richter
-=====-==--= =-=- ---=-
http://arcgraph.de/sr/

2009-10-02 14:28:53

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

2009/10/1 Tejun Heo <[email protected]>:
> Currently each workqueue has its own dedicated worker pool. ?This
> causes the following problems.
>
> * Works which are dependent on each other can cause a deadlock by
> ?depending on the same execution resource. ?This is bad because this
> ?type of dependency is quite difficult to find.
>
> * Works which may sleep and take long time to finish need to have
> ?separate workqueues so that it doesn't block other works. ?Similarly
> ?works which want to be executed in timely manner often need to
> ?create it custom workqueue too to avoid being blocked by long
> ?running ones. ?This leads to large number of workqueues and thus
> ?many workers.
>
> * The static one-per-cpu worker isn't good enough for jobs which
> ?require higher level of concurrency necessiating other worker pool
> ?mechanism. ?slow-work and async are good examples and there are also
> ?some custom implementations buried in subsystems.
>
> * Combined, the above factors lead to many workqueues with large
> ?number of dedicated and mostly unused workers. ?This also makes work
> ?processing less optimal as the dedicated workers end up switching
> ?among themselves costing scheduleing overhead and wasting cache
> ?footprint for their stacks and as the system gets busy, these
> ?workers end up competing with each other.
>
> To solve the above issues, this patch implements concurrency-managed
> workqueue.
>
> There is single global cpu workqueue (gcwq) for each cpu which serves
> all the workqueues. ?gcwq maintains single pool of workers which is
> shared by all cwqs on the cpu.
>
> gcwq keeps the number of concurrent active workers to minimum but no
> less. ?As long as there's one or more running workers on the cpu, no
> new worker is scheduled so that works can be processed in batch as
> much as possible but when the last running worker blocks, gcwq
> immediately schedules new worker so that the cpu doesn't sit idle
> while there are works to be processed.



That's really a cool thing.
So once such new workers are created, what's the state/event that triggers their
destruction?

Is it the following, propagated recursively?

Worker A blocks.
B is created.
B has just finished a worklet and A has been woken up
Then destroy B

2009-10-02 15:40:12

by David Howells

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

Tejun Heo <[email protected]> wrote:

> Given that slow-work isn't being used too extensively yet, I was
> thinking whether that part could be pushed down to the caller. Or, we
> can also wrap work and export an interface which supports the get/put
> reference.

The caller of what?

I found the refcounting much easier to manage in slow-work when slow-work
actively got/put refs on the work items it was queueing. The reason for that
is that slow-work can handle the queue/queue races and the requeue/execute
races much more efficiently.

Part of this was due to the fact I wanted to prevent re-entry into the work
executor, and to do that I had maintenance flags in the work item struct - but
that meant that slow-work had to modify the work item after execution.

So I should adjust point 1 on my list.

(1) Work items can be requeued whilst they are executing, but the execution
function will not be re-entered until after the current execution
completes, but rather the execution will be deferred.

One possible problem with assuming that you can no longer access the work item
after you call the execution function, is that it's slightly dodgy to retain
the pointer to it to prevent reentry as the item can be destroyed, reallocated
and queued before the execution function returns.

Anyway, don't let me put you off re-implementing the whole shebang - it needs
doing.

> Binding is usually beneficial and doesn't matter for IO intensive
> ones, so...

The scenario I'm thinking of is this: someone who has an NFS volume cached
through FS-Cache does a tar of a large tree of files (say a kernel source
tree). FS-Cache adds a long duration work item for each of those files
(~32000) to create structure in the cache. Will all of those wind up bound to
the same CPU as was running tar?

David

2009-10-02 19:47:15

by Oren Laadan

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting



Rafael J. Wysocki wrote:
> On Thursday 01 October 2009, Pavel Machek wrote:
>>> Freezing isn't exactly the most latency sensitive operation and
>>> there's no reason to burn cpu cycles and power waiting for it to
>>> complete. msleep(10) instead of yield(). This should improve
>>> reliability of emergency hibernation.
>> i don't see how it improves reliability, but its probably ok.
>>
>> Well... for hibernation anyway. I can imagine cgroup users where
>> freeze is so fast that this matters. rjw cc-ed. pavel
>
> Thanks. I'd like to hear from the cgroup freezer people about that.
>

[Adding Matt Helsley to the CC list]

To checkpoint or migrate an application, the cgroup to which it belongs
must be frozen first.

It's a bit down the road, but if one seeks minimum application downtime
during application checkpoint and/or migration, then a (minimum of)
10ms - or multiples of it - may result in a visible/undesired hick-up.

Perhaps avoid it when freezing a cgroup ? or maybe a way for the user
to control this behavior per cgroup ?

Oren.

>>> Signed-off-by: Tejun Heo <[email protected]>
>>> ---
>>> kernel/power/process.c | 13 +++++++++----
>>> 1 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/power/process.c b/kernel/power/process.c
>>> index cc2e553..9d26a0a 100644
>>> --- a/kernel/power/process.c
>>> +++ b/kernel/power/process.c
>>> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
>>> do_gettimeofday(&start);
>>>
>>> end_time = jiffies + TIMEOUT;
>>> - do {
>>> + while (true) {
>>> todo = 0;
>>> read_lock(&tasklist_lock);
>>> do_each_thread(g, p) {
>>> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
>>> todo++;
>>> } while_each_thread(g, p);
>>> read_unlock(&tasklist_lock);
>>> - yield(); /* Yield is okay here */
>>> - if (time_after(jiffies, end_time))
>>> + if (!todo || time_after(jiffies, end_time))
>>> break;
>>> - } while (todo);
>>> +
>>> + /*
>>> + * We need to retry. There's no reason to be
>>> + * over-anxious about it and waste power.
>>> + */
>
> The comment above looks like it's only meaningful in the context of the patch.
> After it's been applied the meaning of the comment won't be so obvious, I'm
> afraid.
>
>>> + msleep(10);
>>> + }
>>>
>>> do_gettimeofday(&end);
>>> elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
>
> Thanks,
> Rafael

2009-10-02 21:04:46

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting

On Fri, Oct 02, 2009 at 03:47:14PM -0400, Oren Laadan wrote:
>
>
> Rafael J. Wysocki wrote:
> > On Thursday 01 October 2009, Pavel Machek wrote:
> >>> Freezing isn't exactly the most latency sensitive operation and
> >>> there's no reason to burn cpu cycles and power waiting for it to
> >>> complete. msleep(10) instead of yield(). This should improve
> >>> reliability of emergency hibernation.
> >> i don't see how it improves reliability, but its probably ok.

>From what little of the patch I can see at this point I agree.
On a single cpu system the yield gives up the cpu so other tasks
are more likely to make the progress necessary to become freezable.

> >>
> >> Well... for hibernation anyway. I can imagine cgroup users where
> >> freeze is so fast that this matters. rjw cc-ed. pavel

It doesn't (more below), though I appreciate your keeping us in mind.

> >
> > Thanks. I'd like to hear from the cgroup freezer people about that.
> >
>
> [Adding Matt Helsley to the CC list]
>
> To checkpoint or migrate an application, the cgroup to which it belongs
> must be frozen first.
>
> It's a bit down the road, but if one seeks minimum application downtime
> during application checkpoint and/or migration, then a (minimum of)
> 10ms - or multiples of it - may result in a visible/undesired hick-up.
>
> Perhaps avoid it when freezing a cgroup ? or maybe a way for the user
> to control this behavior per cgroup ?

This is already the case.

The cgroup freezer does not use this yield-loop to iterate over all the tasks.
Instead of yield() the cgroup freezer has its own "loop". It changes its
own state to FREEZING and returns to userspace so that userspace can decide
what to do -- sleep? keep trying to freeze? go back to THAWED? etc.

[ In the future this may change depending on the blocking/non-blocking
flag of the open freezer.state cgroup file handle. ]

Cheers,
-Matt Helsley

>
> Oren.
>
> >>> Signed-off-by: Tejun Heo <[email protected]>
> >>> ---
> >>> kernel/power/process.c | 13 +++++++++----
> >>> 1 files changed, 9 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/power/process.c b/kernel/power/process.c
> >>> index cc2e553..9d26a0a 100644
> >>> --- a/kernel/power/process.c
> >>> +++ b/kernel/power/process.c
> >>> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> >>> do_gettimeofday(&start);
> >>>
> >>> end_time = jiffies + TIMEOUT;
> >>> - do {
> >>> + while (true) {
> >>> todo = 0;
> >>> read_lock(&tasklist_lock);
> >>> do_each_thread(g, p) {
> >>> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> >>> todo++;
> >>> } while_each_thread(g, p);
> >>> read_unlock(&tasklist_lock);
> >>> - yield(); /* Yield is okay here */
> >>> - if (time_after(jiffies, end_time))
> >>> + if (!todo || time_after(jiffies, end_time))
> >>> break;
> >>> - } while (todo);
> >>> +
> >>> + /*
> >>> + * We need to retry. There's no reason to be
> >>> + * over-anxious about it and waste power.
> >>> + */
> >
> > The comment above looks like it's only meaningful in the context of the patch.
> > After it's been applied the meaning of the comment won't be so obvious, I'm
> > afraid.
> >
> >>> + msleep(10);
> >>> + }
> >>>
> >>> do_gettimeofday(&end);
> >>> elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
> >
> > Thanks,
> > Rafael

2009-10-02 21:19:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting

On Friday 02 October 2009, Matt Helsley wrote:
> On Fri, Oct 02, 2009 at 03:47:14PM -0400, Oren Laadan wrote:
> >
> >
> > Rafael J. Wysocki wrote:
> > > On Thursday 01 October 2009, Pavel Machek wrote:
> > >>> Freezing isn't exactly the most latency sensitive operation and
> > >>> there's no reason to burn cpu cycles and power waiting for it to
> > >>> complete. msleep(10) instead of yield(). This should improve
> > >>> reliability of emergency hibernation.
> > >> i don't see how it improves reliability, but its probably ok.
>
> From what little of the patch I can see at this point I agree.
> On a single cpu system the yield gives up the cpu so other tasks
> are more likely to make the progress necessary to become freezable.
>
> > >>
> > >> Well... for hibernation anyway. I can imagine cgroup users where
> > >> freeze is so fast that this matters. rjw cc-ed. pavel
>
> It doesn't (more below), though I appreciate your keeping us in mind.
>
> > >
> > > Thanks. I'd like to hear from the cgroup freezer people about that.
> > >
> >
> > [Adding Matt Helsley to the CC list]
> >
> > To checkpoint or migrate an application, the cgroup to which it belongs
> > must be frozen first.
> >
> > It's a bit down the road, but if one seeks minimum application downtime
> > during application checkpoint and/or migration, then a (minimum of)
> > 10ms - or multiples of it - may result in a visible/undesired hick-up.
> >
> > Perhaps avoid it when freezing a cgroup ? or maybe a way for the user
> > to control this behavior per cgroup ?
>
> This is already the case.
>
> The cgroup freezer does not use this yield-loop to iterate over all the tasks.
> Instead of yield() the cgroup freezer has its own "loop". It changes its
> own state to FREEZING and returns to userspace so that userspace can decide
> what to do -- sleep? keep trying to freeze? go back to THAWED? etc.
>
> [ In the future this may change depending on the blocking/non-blocking
> flag of the open freezer.state cgroup file handle. ]

OK, thanks for the info.

Tejun, do you want me to take the patch or is it more convenient to you to
push it yourself?

Also, care to respond to my previous comment?

Rafael

2009-10-03 00:45:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting

Hello, Rafael.

Rafael J. Wysocki wrote:
> Tejun, do you want me to take the patch or is it more convenient to you to
> push it yourself?

Please take the patch. As long as it's in a stable git tree, it
should be fine for me.

> Also, care to respond to my previous comment?

About the comment? Dropping it is fine. Do you want me to resend the
patch w/o the comment?

Thanks.

--
tejun

2009-10-03 02:59:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

> One thing is that for userspace there's already pretty thick layer of
> abstraction in place and having this type of bare metal mechanism
> might not buy much. In most cases, monitoring system parameters

That can be costly in terms of power/cpu cycles.

> periodically and supplying a bit of extra threads should be able to
> achieve about the same result. Do you have anything specific on your
> mind?

I was thinking of a special file descriptor that could be polled on.
Register threads to be monitored and get a notification back this
way when they are busy above some threshold. Then a runtime could
start more threads upto the maximum number of cores. This would
be mainly useful with existing work distributing libraries
like threadbuildingblocks.

-Andi
--
[email protected] -- Speaking for myself only.

2009-10-03 05:07:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

David Howells wrote:
> Tejun Heo <[email protected]> wrote:
>
>> Given that slow-work isn't being used too extensively yet, I was
>> thinking whether that part could be pushed down to the caller. Or, we
>> can also wrap work and export an interface which supports the get/put
>> reference.
>
> The caller of what?

The user of the API. It can be implemented there too, right?

> I found the refcounting much easier to manage in slow-work when slow-work
> actively got/put refs on the work items it was queueing. The reason for that
> is that slow-work can handle the queue/queue races and the requeue/execute
> races much more efficiently.
>
> Part of this was due to the fact I wanted to prevent re-entry into the work
> executor, and to do that I had maintenance flags in the work item struct - but
> that meant that slow-work had to modify the work item after execution.
>
> So I should adjust point 1 on my list.
>
> (1) Work items can be requeued whilst they are executing, but the execution
> function will not be re-entered until after the current execution
> completes, but rather the execution will be deferred.

This is already guaranteed on a single cpu, so unless a work ends up
being scheduled on a different cpu, it will be okay. This actually is
about the same problem as how to support singlethread workqueue. I'm
not entirely sure how to choose the cpu for such works yet.

> One possible problem with assuming that you can no longer access the work item
> after you call the execution function, is that it's slightly dodgy to retain
> the pointer to it to prevent reentry as the item can be destroyed, reallocated
> and queued before the execution function returns.

All the above is already implemented to avoid running the same work
parallelly on the same cpu and flushing.

>> Binding is usually beneficial and doesn't matter for IO intensive
>> ones, so...
>
> The scenario I'm thinking of is this: someone who has an NFS volume cached
> through FS-Cache does a tar of a large tree of files (say a kernel source
> tree). FS-Cache adds a long duration work item for each of those files
> (~32000) to create structure in the cache. Will all of those wind up bound to
> the same CPU as was running tar?

Yeap, something to think about. I considered adding a cpu workqueue
which isn't bound to any cpu to serve that type of workload but it
seemed too complex for the problem. Maybe simple round robin with
per-cpu throttling should do the trick?

Thanks.

--
tejun

2009-10-03 19:35:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 01/19] freezer: don't get over-anxious while waiting

On Saturday 03 October 2009, Tejun Heo wrote:
> Hello, Rafael.
>
> Rafael J. Wysocki wrote:
> > Tejun, do you want me to take the patch or is it more convenient to you to
> > push it yourself?
>
> Please take the patch. As long as it's in a stable git tree, it
> should be fine for me.

OK, I'll put it into suspend-2.6/for-linus, but first it'll stay in
suspend-2.6/linux-next for a few days.

> > Also, care to respond to my previous comment?
>
> About the comment? Dropping it is fine. Do you want me to resend the
> patch w/o the comment?

No, thanks.

Best,
Rafael

2009-10-04 08:39:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue

On Thu, 2009-10-01 at 17:08 +0900, Tejun Heo wrote:
> Hello, all.
>
> This rather large patchset implements concurrency managed workqueue.
> It's not complete yet. Singlethread workqueue handling needs more
> work and workqueue users need to be audited and simplified and async
> and slow-work should be reimplemented in terms of workqueue. Although
> this patchset currently adds ~2000 lines of code, I'm fairly
> optimistic that after the whole conversion is done, it would be a net
> decrease in lines of code.
>
> This patchset reimplements workqueue such that it auto-regulates
> concurrency and thus relieves its users from the managing duty. It
> works by managing single shared pool of per-cpu workers and hooking
> into the scheduler to get notifications about workers going to sleep
> and waking up. Using the mechanism, workqueue implementation keeps
> track of the current level of concurrency and schedules only the
> necessary number of workers to keep the cpu occupied.
>
> Concurrency managed workqueue has the following benefits.
>
> * Workqueue users no longer have to worry about managing concurrency
> and, in most cases, deadlocks. The workqueue will manage it
> automatically and unless the deadlock chain involves many (currently
> 127) works, it won't happen.
>
> * There's one single shared pool of workers per cpu and one rescuer
> for each workqueue which requires it, so there are far fewer number
> of kthreads.
>
> * More efficient. Although it adds considerable amount of code, the
> code added to hot path isn't big and works will be executed on the
> local cpu and in batch as much as possible using minimal number of
> kthreads leading to fewer task switches and lower cache
> footprint. <NEED SOME BACKING NUMBERS>
>
> * As concurrency is no longer a problem, most types of asynchronous
> jobs can be done using generic workqueue and other async mechanisms,
> including slow-work, async and adhoc subsystem custom ones, can be
> removed. ie. It can serve as the unified async thread pool
> mechanism.
>
> Please read the patch description of the last patch for more details.
>
> This patchset contains the following 19 patches and most of these are
> not signed off yet.

Like Linus, I dislike the sched_class bits (as in really hate them).

Also, from a quick look it looks like this scheme does not allow
priority inheritance of worklets, like we used to do in -rt.

2009-10-04 08:48:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue

On Thu, 2009-10-01 at 07:49 -0700, Andrew Morton wrote:
> This approach would appear to rule out the option of setting a work
> thread's state (scheduling policy, scheduling priority, uid, etc) to
> anything other than some default.
>
> I guess that's unlikely to be a problem if we haven't yet had a need to
> do that, but I'd be a bit surprised to discover that nobody has done
> that sort of thing yet? Nobody has niced up their workqueue threads?

-rt actually used to do PI on worklets, and that sure flipped the
scheduling class around a lot.

2009-10-06 23:27:36

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 07/19] stop_machine: reimplement without using workqueue

On Thu, 1 Oct 2009 05:39:06 pm Tejun Heo wrote:
> stop_machine() is the only user of RT workqueue. Reimplement it using
> kthreads directly and rip RT support from workqueue. This is in
> preparation of concurrency managed workqueue.

Seems reasonable, but:

> void stop_machine_destroy(void)
> {
> + get_online_cpus();
> mutex_lock(&setup_lock);

Did you test hotplug cpu after this change? Maybe this works fine, but it's
always the one we break when we play with this code :(

Thanks!
Rusty.

2009-10-06 23:43:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 07/19] stop_machine: reimplement without using workqueue

Rusty Russell wrote:
> On Thu, 1 Oct 2009 05:39:06 pm Tejun Heo wrote:
>> stop_machine() is the only user of RT workqueue. Reimplement it using
>> kthreads directly and rip RT support from workqueue. This is in
>> preparation of concurrency managed workqueue.
>
> Seems reasonable, but:
>
>> void stop_machine_destroy(void)
>> {
>> + get_online_cpus();
>> mutex_lock(&setup_lock);
>
> Did you test hotplug cpu after this change? Maybe this works fine, but it's
> always the one we break when we play with this code :(

Yeap, played with it and seems to work fine.

Thanks.

--
tejun