2009-01-16 02:26:27

by Miao Xie

[permalink] [raw]
Subject: [PATCH] cpuset: fix possible deadlock in async_rebuild_sched_domains

Lockdep reported some possible circular locking info when we tested cpuset on
NUMA/fake NUMA box.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.29-rc1-00224-ga652504 #111
-------------------------------------------------------
bash/2968 is trying to acquire lock:
(events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8

but task is already holding lock:
(cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29

which lock already depends on the new lock.
......
-------------------------------------------------------

Steps to reproduce:
# mkdir /dev/cpuset
# mount -t cpuset xxx /dev/cpuset
# mkdir /dev/cpuset/0
# echo 0 > /dev/cpuset/0/cpus
# echo 0 > /dev/cpuset/0/mems
# echo 1 > /dev/cpuset/0/memory_migrate
# cat /dev/zero > /dev/null &
# echo $! > /dev/cpuset/0/tasks

This is because async_rebuild_sched_domains has the following lock sequence:
run_workqueue(async_rebuild_sched_domains)
-> do_rebuild_sched_domains -> cgroup_lock

But, attaching tasks when memory_migrate is set has following:
cgroup_lock_live_group(cgroup_tasks_write)
-> do_migrate_pages -> flush_work

This patch fixes it by using a separate workqueue thread.

Signed-off-by: Miao Xie <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Cc: Max Krasnyansky <[email protected]>
---
kernel/cpuset.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a856788..f76db9d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -61,6 +61,14 @@
#include <linux/cgroup.h>

/*
+ * Workqueue for cpuset related tasks.
+ *
+ * Using kevent workqueue may cause deadlock when memory_migrate
+ * is set. So we create a separate workqueue thread for cpuset.
+ */
+static struct workqueue_struct *cpuset_wq;
+
+/*
* Tracks how many cpusets are currently defined in system.
* When there is only one cpuset (the root cpuset) we can
* short circuit some hooks.
@@ -831,7 +839,7 @@ static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
*/
static void async_rebuild_sched_domains(void)
{
- schedule_work(&rebuild_sched_domains_work);
+ queue_work(cpuset_wq, &rebuild_sched_domains_work);
}

/*
@@ -2111,6 +2119,9 @@ void __init cpuset_init_smp(void)

hotcpu_notifier(cpuset_track_online_cpus, 0);
hotplug_memory_notifier(cpuset_track_online_nodes, 10);
+
+ cpuset_wq = create_singlethread_workqueue("cpuset");
+ BUG_ON(!cpuset_wq);
}

/**
--
1.6.0.3


2009-01-16 03:34:32

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] cpuset: fix possible deadlock in async_rebuild_sched_domains


But queuing a work to an other thread is adding some overhead for cpuset.
And a new separate workqueue thread is wasteful, this thread is sleeping
at most time.

This is an effective fix:

This patch add cgroup_queue_defer_work(). And the works will be deferring
processed with cgroup_mutex released. And this patch just add very very
little overhead for cgroup_unlock()'s fast path.

Lai

From: Lai Jiangshan <[email protected]>

Lockdep reported some possible circular locking info when we tested cpuset on
NUMA/fake NUMA box.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.29-rc1-00224-ga652504 #111
-------------------------------------------------------
bash/2968 is trying to acquire lock:
(events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8

but task is already holding lock:
(cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29

which lock already depends on the new lock.
......
-------------------------------------------------------

Steps to reproduce:
# mkdir /dev/cpuset
# mount -t cpuset xxx /dev/cpuset
# mkdir /dev/cpuset/0
# echo 0 > /dev/cpuset/0/cpus
# echo 0 > /dev/cpuset/0/mems
# echo 1 > /dev/cpuset/0/memory_migrate
# cat /dev/zero > /dev/null &
# echo $! > /dev/cpuset/0/tasks

This is because async_rebuild_sched_domains has the following lock sequence:
run_workqueue(async_rebuild_sched_domains)
-> do_rebuild_sched_domains -> cgroup_lock

But, attaching tasks when memory_migrate is set has following:
cgroup_lock_live_group(cgroup_tasks_write)
-> do_migrate_pages -> flush_work

This can be fixed by using a separate workqueue thread.

But queuing a work to an other thread is adding some overhead for cpuset.
And a new separate workqueue thread is wasteful, this thread is sleeping
at most time.

This patch add cgroup_queue_defer_work(). And the works will be deferring
processed with cgroup_mutex released. And this patch just add very very
little overhead for cgroup_unlock()'s fast path.

Reported-by: Miao Xie <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Cc: Max Krasnyansky <[email protected]>
---
include/linux/cgroup.h | 13 ++++
kernel/cgroup.c | 139 ++++++++++++++++++++++++++++++++++---------------
kernel/cpuset.c | 28 ++++-----
3 files changed, 125 insertions(+), 55 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e267e62..bb025ad 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);

+struct cgroup_defer_work {
+ struct list_head list;
+ void (*func)(struct cgroup_defer_work *);
+};
+
+#define CGROUP_DEFER_WORK(name, function) \
+ struct cgroup_defer_work name = { \
+ .list = LIST_HEAD_INIT((name).list), \
+ .func = (function), \
+ };
+
+int cgroup_queue_defer_work(struct cgroup_defer_work *defer_work);
+
#else /* !CONFIG_CGROUPS */

static inline int cgroup_init_early(void) { return 0; }
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c298310..3036723 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -540,6 +540,7 @@ void cgroup_lock(void)
mutex_lock(&cgroup_mutex);
}

+static void cgroup_flush_defer_work_locked(void);
/**
* cgroup_unlock - release lock on cgroup changes
*
@@ -547,9 +548,67 @@ void cgroup_lock(void)
*/
void cgroup_unlock(void)
{
+ cgroup_flush_defer_work_locked();
mutex_unlock(&cgroup_mutex);
}

+static LIST_HEAD(defer_work_list);
+
+/* flush deferred works with cgroup_mutex released */
+static void cgroup_flush_defer_work_locked(void)
+{
+ static bool running_dely_work;
+
+ if (likely(list_empty(&defer_work_list)))
+ return;
+
+ /*
+ * Insure it's not recursive and also
+ * insure deferred works are run orderly.
+ */
+ if (running_dely_work)
+ return;
+ running_dely_work = true;
+
+ for ( ; ; ) {
+ struct cgroup_defer_work *defer_work;
+
+ defer_work = list_first_entry(&defer_work_list,
+ struct cgroup_defer_work, list);
+ list_del_init(&defer_work->list);
+ mutex_unlock(&cgroup_mutex);
+
+ defer_work->func(defer_work);
+
+ mutex_lock(&cgroup_mutex);
+ if (list_empty(&defer_work_list))
+ break;
+ }
+
+ running_dely_work = false;
+}
+
+/**
+ * cgroup_queue_defer_work - queue a deferred work
+ * @defer_work: work to queue
+ *
+ * Returns 0 if @defer_work was already on the queue, non-zero otherwise.
+ *
+ * Must called when cgroup_mutex held.
+ * The defered work will be run after cgroup_mutex released.
+ */
+int cgroup_queue_defer_work(struct cgroup_defer_work *defer_work)
+{
+ int ret = 0;
+
+ if (list_empty(&defer_work->list)) {
+ list_add_tail(&defer_work->list, &defer_work_list);
+ ret = 1;
+ }
+
+ return ret;
+}
+
/*
* A couple of forward declarations required, due to cyclic reference loop:
* cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
@@ -616,7 +675,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
* agent */
synchronize_rcu();

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
/*
* Release the subsystem state objects.
*/
@@ -624,7 +683,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
ss->destroy(ss, cgrp);

cgrp->root->number_of_cgroups--;
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/*
* Drop the active superblock reference that we took when we
@@ -761,14 +820,14 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
struct cgroupfs_root *root = vfs->mnt_sb->s_fs_info;
struct cgroup_subsys *ss;

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
for_each_subsys(root, ss)
seq_printf(seq, ",%s", ss->name);
if (test_bit(ROOT_NOPREFIX, &root->flags))
seq_puts(seq, ",noprefix");
if (strlen(root->release_agent_path))
seq_printf(seq, ",release_agent=%s", root->release_agent_path);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

@@ -843,7 +902,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
struct cgroup_sb_opts opts;

mutex_lock(&cgrp->dentry->d_inode->i_mutex);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

/* See what subsystems are wanted */
ret = parse_cgroupfs_options(data, &opts);
@@ -867,7 +926,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
out_unlock:
if (opts.release_agent)
kfree(opts.release_agent);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
return ret;
}
@@ -1015,7 +1074,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
inode = sb->s_root->d_inode;

mutex_lock(&inode->i_mutex);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

/*
* We're accessing css_set_count without locking
@@ -1026,14 +1085,14 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
*/
ret = allocate_cg_links(css_set_count, &tmp_cg_links);
if (ret) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&inode->i_mutex);
goto drop_new_super;
}

ret = rebind_subsystems(root, root->subsys_bits);
if (ret == -EBUSY) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&inode->i_mutex);
goto free_cg_links;
}
@@ -1068,7 +1127,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,

cgroup_populate_dir(root_cgrp);
mutex_unlock(&inode->i_mutex);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
}

return simple_set_mnt(mnt, sb);
@@ -1094,7 +1153,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
BUG_ON(!list_empty(&cgrp->children));
BUG_ON(!list_empty(&cgrp->sibling));

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

/* Rebind all subsystems back to the default hierarchy */
ret = rebind_subsystems(root, 0);
@@ -1118,7 +1177,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
list_del(&root->root_list);
root_count--;

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

kfree(root);
kill_litter_super(sb);
@@ -1345,9 +1404,9 @@ enum cgroup_filetype {
*/
bool cgroup_lock_live_group(struct cgroup *cgrp)
{
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
if (cgroup_is_removed(cgrp)) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return false;
}
return true;
@@ -2392,7 +2451,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
* fs */
atomic_inc(&sb->s_active);

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

init_cgroup_housekeeping(cgrp);

@@ -2427,7 +2486,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
err = cgroup_populate_dir(cgrp);
/* If err < 0, we have a half-filled directory - oh well ;) */

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);

return 0;
@@ -2444,7 +2503,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
ss->destroy(ss, cgrp);
}

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/* Release the reference count that we took on the superblock */
deactivate_super(sb);
@@ -2550,16 +2609,16 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)

/* the vfs holds both inode->i_mutex already */

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
if (atomic_read(&cgrp->count) != 0) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return -EBUSY;
}
if (!list_empty(&cgrp->children)) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return -EBUSY;
}
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/*
* Call pre_destroy handlers of subsys. Notify subsystems
@@ -2567,13 +2626,13 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
*/
cgroup_call_pre_destroy(cgrp);

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
parent = cgrp->parent;

if (atomic_read(&cgrp->count)
|| !list_empty(&cgrp->children)
|| !cgroup_clear_css_refs(cgrp)) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return -EBUSY;
}

@@ -2598,7 +2657,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
set_bit(CGRP_RELEASABLE, &parent->flags);
check_for_release(parent);

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

@@ -2752,7 +2811,7 @@ static int proc_cgroup_show(struct seq_file *m, void *v)

retval = 0;

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

for_each_active_root(root) {
struct cgroup_subsys *ss;
@@ -2774,7 +2833,7 @@ static int proc_cgroup_show(struct seq_file *m, void *v)
}

out_unlock:
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
put_task_struct(tsk);
out_free:
kfree(buf);
@@ -2801,14 +2860,14 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
int i;

seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
seq_printf(m, "%s\t%lu\t%d\t%d\n",
ss->name, ss->root->subsys_bits,
ss->root->number_of_cgroups, !ss->disabled);
}
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

@@ -2984,11 +3043,11 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,

/* First figure out what hierarchy and cgroup we're dealing
* with, and pin them so we can drop cgroup_mutex */
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
again:
root = subsys->root;
if (root == &rootnode) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}
task_lock(tsk);
@@ -2998,14 +3057,14 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
/* Pin the hierarchy */
if (!atomic_inc_not_zero(&parent->root->sb->s_active)) {
/* We race with the final deactivate_super() */
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

/* Keep the cgroup alive */
get_css_set(cg);
task_unlock(tsk);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/* Now do the VFS work to create a cgroup */
inode = parent->dentry->d_inode;
@@ -3036,7 +3095,7 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
/* The cgroup now exists. Retake cgroup_mutex and check
* that we're still in the same state that we thought we
* were. */
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
if ((root != subsys->root) ||
(parent != task_cgroup(tsk, subsys->subsys_id))) {
/* Aargh, we raced ... */
@@ -3061,14 +3120,14 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,

/* All seems fine. Finish by moving the task into the new cgroup */
ret = cgroup_attach_task(child, tsk);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

out_release:
mutex_unlock(&inode->i_mutex);

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
put_css_set(cg);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
deactivate_super(parent->root->sb);
return ret;
}
@@ -3162,7 +3221,7 @@ void __css_put(struct cgroup_subsys_state *css)
static void cgroup_release_agent(struct work_struct *work)
{
BUG_ON(work != &release_agent_work);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
spin_lock(&release_list_lock);
while (!list_empty(&release_list)) {
char *argv[3], *envp[3];
@@ -3196,16 +3255,16 @@ static void cgroup_release_agent(struct work_struct *work)
/* Drop the lock while we invoke the usermode helper,
* since the exec could involve hitting disk and hence
* be a slow process */
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
continue_free:
kfree(pathbuf);
kfree(agentbuf);
spin_lock(&release_list_lock);
}
spin_unlock(&release_list_lock);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
}

static int __init cgroup_disable(char *str)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 647c77a..f2dedb0 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -57,7 +57,6 @@
#include <asm/uaccess.h>
#include <asm/atomic.h>
#include <linux/mutex.h>
-#include <linux/workqueue.h>
#include <linux/cgroup.h>

/*
@@ -789,7 +788,7 @@ done:
* to the cpuset pseudo-filesystem, because it cannot be called
* from code that already holds cgroup_mutex.
*/
-static void do_rebuild_sched_domains(struct work_struct *unused)
+static void do_rebuild_sched_domains(struct cgroup_defer_work *unused)
{
struct sched_domain_attr *attr;
struct cpumask *doms;
@@ -808,10 +807,10 @@ static void do_rebuild_sched_domains(struct work_struct *unused)
put_online_cpus();
}

-static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
+static CGROUP_DEFER_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);

/*
- * Rebuild scheduler domains, asynchronously via workqueue.
+ * Rebuild scheduler domains, defer it after cgroup_lock released.
*
* If the flag 'sched_load_balance' of any cpuset with non-empty
* 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
@@ -826,19 +825,18 @@ static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
*
* So in order to avoid an ABBA deadlock, the cpuset code handling
* these user changes delegates the actual sched domain rebuilding
- * to a separate workqueue thread, which ends up processing the
- * above do_rebuild_sched_domains() function.
+ * to a deferred work queue, and cgroup_unlock() will flush the deferred
+ * work queue and process the above do_rebuild_sched_domains() function.
*/
-static void async_rebuild_sched_domains(void)
+static void defer_rebuild_sched_domains(void)
{
- schedule_work(&rebuild_sched_domains_work);
+ cgroup_queue_defer_work(&rebuild_sched_domains_work);
}

/*
* Accomplishes the same scheduler domain rebuild as the above
- * async_rebuild_sched_domains(), however it directly calls the
- * rebuild routine synchronously rather than calling it via an
- * asynchronous work thread.
+ * defer_rebuild_sched_domains(), however it directly calls the
+ * rebuild routine synchronously rather than deferring it.
*
* This can only be called from code that is not holding
* cgroup_mutex (not nested in a cgroup_lock() call.)
@@ -965,7 +963,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
heap_free(&heap);

if (is_load_balanced)
- async_rebuild_sched_domains();
+ defer_rebuild_sched_domains();
return 0;
}

@@ -1191,7 +1189,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
cs->relax_domain_level = val;
if (!cpumask_empty(cs->cpus_allowed) &&
is_sched_load_balance(cs))
- async_rebuild_sched_domains();
+ defer_rebuild_sched_domains();
}

return 0;
@@ -1234,7 +1232,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
mutex_unlock(&callback_mutex);

if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
- async_rebuild_sched_domains();
+ defer_rebuild_sched_domains();

out:
free_trial_cpuset(trialcs);
@@ -1821,7 +1819,7 @@ static struct cgroup_subsys_state *cpuset_create(
/*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call async_rebuild_sched_domains().
+ * will call defer_rebuild_sched_domains().
*/

static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)

2009-01-16 20:58:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cpuset: fix possible deadlock in async_rebuild_sched_domains

On Fri, 16 Jan 2009 11:33:34 +0800
Lai Jiangshan <[email protected]> wrote:

>
> But queuing a work to an other thread is adding some overhead for cpuset.
> And a new separate workqueue thread is wasteful, this thread is sleeping
> at most time.
>
> This is an effective fix:
>
> This patch add cgroup_queue_defer_work(). And the works will be deferring
> processed with cgroup_mutex released. And this patch just add very very
> little overhead for cgroup_unlock()'s fast path.
>
> Lai
>
> From: Lai Jiangshan <[email protected]>
>
> Lockdep reported some possible circular locking info when we tested cpuset on
> NUMA/fake NUMA box.
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.29-rc1-00224-ga652504 #111
> -------------------------------------------------------
> bash/2968 is trying to acquire lock:
> (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8
>
> but task is already holding lock:
> (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29
>
> which lock already depends on the new lock.
> ......
> -------------------------------------------------------
>
> Steps to reproduce:
> # mkdir /dev/cpuset
> # mount -t cpuset xxx /dev/cpuset
> # mkdir /dev/cpuset/0
> # echo 0 > /dev/cpuset/0/cpus
> # echo 0 > /dev/cpuset/0/mems
> # echo 1 > /dev/cpuset/0/memory_migrate
> # cat /dev/zero > /dev/null &
> # echo $! > /dev/cpuset/0/tasks
>
> This is because async_rebuild_sched_domains has the following lock sequence:
> run_workqueue(async_rebuild_sched_domains)
> -> do_rebuild_sched_domains -> cgroup_lock
>
> But, attaching tasks when memory_migrate is set has following:
> cgroup_lock_live_group(cgroup_tasks_write)
> -> do_migrate_pages -> flush_work
>
> This can be fixed by using a separate workqueue thread.
>
> But queuing a work to an other thread is adding some overhead for cpuset.
> And a new separate workqueue thread is wasteful, this thread is sleeping
> at most time.
>
> This patch add cgroup_queue_defer_work(). And the works will be deferring
> processed with cgroup_mutex released. And this patch just add very very
> little overhead for cgroup_unlock()'s fast path.

hm. There's little discussion for such a large-looking patch?

> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
> int cgroup_scan_tasks(struct cgroup_scanner *scan);
> int cgroup_attach_task(struct cgroup *, struct task_struct *);
>
> +struct cgroup_defer_work {
> + struct list_head list;
> + void (*func)(struct cgroup_defer_work *);
> +};
> +
> +#define CGROUP_DEFER_WORK(name, function) \
> + struct cgroup_defer_work name = { \
> + .list = LIST_HEAD_INIT((name).list), \
> + .func = (function), \
> + };
> +
> +int cgroup_queue_defer_work(struct cgroup_defer_work *defer_work);

These should be called "cgroup_deferred_work" and
"queue_deferred_work", I think? Maybe not.

> +static void cgroup_flush_defer_work_locked(void);

cgroup_flush_deferred_work_locked()?

> /**
> * cgroup_unlock - release lock on cgroup changes
> *
> @@ -547,9 +548,67 @@ void cgroup_lock(void)
> */
> void cgroup_unlock(void)
> {
> + cgroup_flush_defer_work_locked();
> mutex_unlock(&cgroup_mutex);
> }
>
> +static LIST_HEAD(defer_work_list);

Please add a comment telling readers what the locking protocol is for
this list.

> +/* flush deferred works with cgroup_mutex released */
> +static void cgroup_flush_defer_work_locked(void)
> +{
> + static bool running_dely_work;

"running_delayed_work"

> + if (likely(list_empty(&defer_work_list)))
> + return;
> +
> + /*
> + * Insure it's not recursive and also
> + * insure deferred works are run orderly.

"ensure"

> + */
> + if (running_dely_work)
> + return;
> + running_dely_work = true;
> +
> + for ( ; ; ) {
> + struct cgroup_defer_work *defer_work;
> +
> + defer_work = list_first_entry(&defer_work_list,
> + struct cgroup_defer_work, list);
> + list_del_init(&defer_work->list);
> + mutex_unlock(&cgroup_mutex);
> +
> + defer_work->func(defer_work);
> +
> + mutex_lock(&cgroup_mutex);
> + if (list_empty(&defer_work_list))
> + break;
> + }
> +
> + running_dely_work = false;
> +}
> +
> +/**
> + * cgroup_queue_defer_work - queue a deferred work
> + * @defer_work: work to queue
> + *
> + * Returns 0 if @defer_work was already on the queue, non-zero otherwise.
> + *
> + * Must called when cgroup_mutex held.

"must be"

> + * The defered work will be run after cgroup_mutex released.

"deferred"

> + */
> +int cgroup_queue_defer_work(struct cgroup_defer_work *defer_work)
> +{
> + int ret = 0;
> +
> + if (list_empty(&defer_work->list)) {
> + list_add_tail(&defer_work->list, &defer_work_list);
> + ret = 1;
> + }
> +
> + return ret;
> +}
> +
> /*
> * A couple of forward declarations required, due to cyclic reference loop:
> * cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
> @@ -616,7 +675,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> * agent */
> synchronize_rcu();
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();

All these changes add rather a lot of noise. It would be better to do
this as two patches:

1: "convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls"
2: "cpuset: fix possible deadlock in async_rebuild_sched_domains"

Although it doesn't matter a lot.


Also, as it now appears to be compulsory to use cgroup_lock(), you
should rename cgroup_mutex to something else, to prevent accidental
direct usage of the lock.

2009-01-18 08:06:47

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 2/3] cgroup: introduce cgroup_queue_deferred_work()

Sometimes we need require a lock to prevent something,
but this lock cannot nest in cgroup_lock. So this work
should be moved out of cgroup_lock's critical region.

Using schedule_work() can move this work out of cgroup_lock's
critical region. But it's a overkill for move a work to
other process. And if we need flush_work() with cgroup_lock
held, schedule_work() can not work for flush_work() will
cause deadlock.

Another solution is that deferring the work, and processing
it after cgroup_lock released. This patch introduces
cgroup_queue_deferred_work() for queue a cgroup_deferred_work.

Signed-off-by: Lai Jiangshan <[email protected]>
Cc: Max Krasnyansky <[email protected]>
Cc: Miao Xie <[email protected]>
---
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e267e62..6d3e6dc 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);

+struct cgroup_deferred_work {
+ struct list_head list;
+ void (*func)(struct cgroup_deferred_work *);
+};
+
+#define CGROUP_DEFERRED_WORK(name, function) \
+ struct cgroup_deferred_work name = { \
+ .list = LIST_HEAD_INIT((name).list), \
+ .func = (function), \
+ };
+
+int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work);
+
#else /* !CONFIG_CGROUPS */

static inline int cgroup_init_early(void) { return 0; }
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c298310..75a352b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -540,6 +540,7 @@ void cgroup_lock(void)
mutex_lock(&cgroup_mutex);
}

+static void cgroup_flush_deferred_work_locked(void);
/**
* cgroup_unlock - release lock on cgroup changes
*
@@ -547,9 +548,80 @@ void cgroup_lock(void)
*/
void cgroup_unlock(void)
{
+ cgroup_flush_deferred_work_locked();
mutex_unlock(&cgroup_mutex);
}

+/* deferred_work_list is protected by cgroup_mutex */
+static LIST_HEAD(deferred_work_list);
+
+/* flush deferred works with cgroup_lock released */
+static void cgroup_flush_deferred_work_locked(void)
+{
+ static bool running_deferred_work;
+
+ if (likely(list_empty(&deferred_work_list)))
+ return;
+
+ /*
+ * Ensure it's not recursive and also
+ * ensure deferred works are run orderly.
+ */
+ if (running_deferred_work)
+ return;
+ running_deferred_work = true;
+
+ for ( ; ; ) {
+ struct cgroup_deferred_work *deferred_work;
+
+ /* dequeue the first work, and mark it dequeued */
+ deferred_work = list_first_entry(&deferred_work_list,
+ struct cgroup_deferred_work, list);
+ list_del_init(&deferred_work->list);
+
+ mutex_unlock(&cgroup_mutex);
+
+ /*
+ * cgroup_mutex is released. The callback function can use
+ * cgroup_lock()/cgroup_unlock(). This behavior is safe
+ * for running_deferred_work is set to 'true'.
+ */
+ deferred_work->func(deferred_work);
+
+ /*
+ * regain cgroup_mutex to access deferred_work_list
+ * and running_deferred_work.
+ */
+ mutex_lock(&cgroup_mutex);
+
+ if (list_empty(&deferred_work_list))
+ break;
+ }
+
+ running_deferred_work = false;
+}
+
+/**
+ * cgroup_queue_deferred_work - queue a deferred work
+ * @deferred_work: work to queue.
+ *
+ * Returns 0 if @deferred_work was already on the queue, non-zero otherwise.
+ *
+ * Must be called when cgroup_lock held.
+ * The deferred work will be run after cgroup_lock released.
+ */
+int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work)
+{
+ int ret = 0;
+
+ if (list_empty(&deferred_work->list)) {
+ list_add_tail(&deferred_work->list, &deferred_work_list);
+ ret = 1;
+ }
+
+ return ret;
+}
+
/*
* A couple of forward declarations required, due to cyclic reference loop:
* cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->



2009-01-18 08:07:13

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls


Convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock()
calls and convert mutex_unlock(&cgroup_mutex) calls into cgroup_unlock()
calls.

Signed-off-by: Lai Jiangshan <[email protected]>
Cc: Max Krasnyansky <[email protected]>
Cc: Miao Xie <[email protected]>
---
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c298310..75a352b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -616,7 +688,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
* agent */
synchronize_rcu();

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
/*
* Release the subsystem state objects.
*/
@@ -624,7 +696,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
ss->destroy(ss, cgrp);

cgrp->root->number_of_cgroups--;
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/*
* Drop the active superblock reference that we took when we
@@ -761,14 +833,14 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
struct cgroupfs_root *root = vfs->mnt_sb->s_fs_info;
struct cgroup_subsys *ss;

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
for_each_subsys(root, ss)
seq_printf(seq, ",%s", ss->name);
if (test_bit(ROOT_NOPREFIX, &root->flags))
seq_puts(seq, ",noprefix");
if (strlen(root->release_agent_path))
seq_printf(seq, ",release_agent=%s", root->release_agent_path);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

@@ -843,7 +915,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
struct cgroup_sb_opts opts;

mutex_lock(&cgrp->dentry->d_inode->i_mutex);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

/* See what subsystems are wanted */
ret = parse_cgroupfs_options(data, &opts);
@@ -867,7 +939,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
out_unlock:
if (opts.release_agent)
kfree(opts.release_agent);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
return ret;
}
@@ -1015,7 +1087,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
inode = sb->s_root->d_inode;

mutex_lock(&inode->i_mutex);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

/*
* We're accessing css_set_count without locking
@@ -1026,14 +1098,14 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
*/
ret = allocate_cg_links(css_set_count, &tmp_cg_links);
if (ret) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&inode->i_mutex);
goto drop_new_super;
}

ret = rebind_subsystems(root, root->subsys_bits);
if (ret == -EBUSY) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&inode->i_mutex);
goto free_cg_links;
}
@@ -1068,7 +1140,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,

cgroup_populate_dir(root_cgrp);
mutex_unlock(&inode->i_mutex);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
}

return simple_set_mnt(mnt, sb);
@@ -1094,7 +1166,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
BUG_ON(!list_empty(&cgrp->children));
BUG_ON(!list_empty(&cgrp->sibling));

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

/* Rebind all subsystems back to the default hierarchy */
ret = rebind_subsystems(root, 0);
@@ -1118,7 +1190,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
list_del(&root->root_list);
root_count--;

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

kfree(root);
kill_litter_super(sb);
@@ -2392,7 +2464,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
* fs */
atomic_inc(&sb->s_active);

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

init_cgroup_housekeeping(cgrp);

@@ -2427,7 +2499,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
err = cgroup_populate_dir(cgrp);
/* If err < 0, we have a half-filled directory - oh well ;) */

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);

return 0;
@@ -2444,7 +2516,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
ss->destroy(ss, cgrp);
}

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/* Release the reference count that we took on the superblock */
deactivate_super(sb);
@@ -2550,16 +2622,16 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)

/* the vfs holds both inode->i_mutex already */

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
if (atomic_read(&cgrp->count) != 0) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return -EBUSY;
}
if (!list_empty(&cgrp->children)) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return -EBUSY;
}
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/*
* Call pre_destroy handlers of subsys. Notify subsystems
@@ -2567,13 +2639,13 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
*/
cgroup_call_pre_destroy(cgrp);

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
parent = cgrp->parent;

if (atomic_read(&cgrp->count)
|| !list_empty(&cgrp->children)
|| !cgroup_clear_css_refs(cgrp)) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return -EBUSY;
}

@@ -2598,7 +2670,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
set_bit(CGRP_RELEASABLE, &parent->flags);
check_for_release(parent);

- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

@@ -2752,7 +2824,7 @@ static int proc_cgroup_show(struct seq_file *m, void *v)

retval = 0;

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();

for_each_active_root(root) {
struct cgroup_subsys *ss;
@@ -2774,7 +2846,7 @@ static int proc_cgroup_show(struct seq_file *m, void *v)
}

out_unlock:
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
put_task_struct(tsk);
out_free:
kfree(buf);
@@ -2801,14 +2873,14 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
int i;

seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
seq_printf(m, "%s\t%lu\t%d\t%d\n",
ss->name, ss->root->subsys_bits,
ss->root->number_of_cgroups, !ss->disabled);
}
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

@@ -2984,11 +3056,11 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,

/* First figure out what hierarchy and cgroup we're dealing
* with, and pin them so we can drop cgroup_mutex */
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
again:
root = subsys->root;
if (root == &rootnode) {
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}
task_lock(tsk);
@@ -2998,14 +3070,14 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
/* Pin the hierarchy */
if (!atomic_inc_not_zero(&parent->root->sb->s_active)) {
/* We race with the final deactivate_super() */
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
return 0;
}

/* Keep the cgroup alive */
get_css_set(cg);
task_unlock(tsk);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

/* Now do the VFS work to create a cgroup */
inode = parent->dentry->d_inode;
@@ -3036,7 +3108,7 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
/* The cgroup now exists. Retake cgroup_mutex and check
* that we're still in the same state that we thought we
* were. */
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
if ((root != subsys->root) ||
(parent != task_cgroup(tsk, subsys->subsys_id))) {
/* Aargh, we raced ... */
@@ -3061,14 +3133,14 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,

/* All seems fine. Finish by moving the task into the new cgroup */
ret = cgroup_attach_task(child, tsk);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();

out_release:
mutex_unlock(&inode->i_mutex);

- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
put_css_set(cg);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
deactivate_super(parent->root->sb);
return ret;
}
@@ -3162,7 +3234,7 @@ void __css_put(struct cgroup_subsys_state *css)
static void cgroup_release_agent(struct work_struct *work)
{
BUG_ON(work != &release_agent_work);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
spin_lock(&release_list_lock);
while (!list_empty(&release_list)) {
char *argv[3], *envp[3];
@@ -3196,16 +3268,16 @@ static void cgroup_release_agent(struct work_struct *work)
/* Drop the lock while we invoke the usermode helper,
* since the exec could involve hitting disk and hence
* be a slow process */
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
- mutex_lock(&cgroup_mutex);
+ cgroup_lock();
continue_free:
kfree(pathbuf);
kfree(agentbuf);
spin_lock(&release_list_lock);
}
spin_unlock(&release_list_lock);
- mutex_unlock(&cgroup_mutex);
+ cgroup_unlock();
}

static int __init cgroup_disable(char *str)


2009-01-18 08:07:43

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 3/3] cpuset: fix possible deadlock in async_rebuild_sched_domains

Lockdep reported some possible circular locking info when we tested cpuset on
NUMA/fake NUMA box.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.29-rc1-00224-ga652504 #111
-------------------------------------------------------
bash/2968 is trying to acquire lock:
(events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8

but task is already holding lock:
(cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29

which lock already depends on the new lock.
......
-------------------------------------------------------

Steps to reproduce:
# mkdir /dev/cpuset
# mount -t cpuset xxx /dev/cpuset
# mkdir /dev/cpuset/0
# echo 0 > /dev/cpuset/0/cpus
# echo 0 > /dev/cpuset/0/mems
# echo 1 > /dev/cpuset/0/memory_migrate
# cat /dev/zero > /dev/null &
# echo $! > /dev/cpuset/0/tasks

This is because async_rebuild_sched_domains has the following lock sequence:
run_workqueue(async_rebuild_sched_domains)
-> do_rebuild_sched_domains -> cgroup_lock

But, attaching tasks when memory_migrate is set has following:
cgroup_lock_live_group(cgroup_tasks_write)
-> do_migrate_pages -> flush_work

This can be fixed by using a separate workqueue thread.

But queuing a work to an other thread is adding some overhead for cpuset.
And a new separate workqueue thread is wasteful, this thread is sleeping
at most time.

This patch add cgroup_queue_defer_work(). And the works will be deferring
processed with cgroup_mutex released. And this patch just add very very
little overhead for cgroup_unlock()'s fast path.

Reported-by: Miao Xie <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Cc: Max Krasnyansky <[email protected]>
---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 647c77a..5a8c6b7 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -57,7 +57,6 @@
#include <asm/uaccess.h>
#include <asm/atomic.h>
#include <linux/mutex.h>
-#include <linux/workqueue.h>
#include <linux/cgroup.h>

/*
@@ -789,7 +788,7 @@ done:
* to the cpuset pseudo-filesystem, because it cannot be called
* from code that already holds cgroup_mutex.
*/
-static void do_rebuild_sched_domains(struct work_struct *unused)
+static void do_rebuild_sched_domains(struct cgroup_deferred_work *unused)
{
struct sched_domain_attr *attr;
struct cpumask *doms;
@@ -808,10 +807,11 @@ static void do_rebuild_sched_domains(struct work_struct *unused)
put_online_cpus();
}

-static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
+static CGROUP_DEFERRED_WORK(rebuild_sched_domains_work,
+ do_rebuild_sched_domains);

/*
- * Rebuild scheduler domains, asynchronously via workqueue.
+ * Rebuild scheduler domains, defer it after cgroup_lock released.
*
* If the flag 'sched_load_balance' of any cpuset with non-empty
* 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
@@ -826,19 +826,18 @@ static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
*
* So in order to avoid an ABBA deadlock, the cpuset code handling
* these user changes delegates the actual sched domain rebuilding
- * to a separate workqueue thread, which ends up processing the
- * above do_rebuild_sched_domains() function.
+ * to a deferred work queue, and cgroup_unlock() will flush the deferred
+ * work queue and process the above do_rebuild_sched_domains() function.
*/
-static void async_rebuild_sched_domains(void)
+static void defer_rebuild_sched_domains(void)
{
- schedule_work(&rebuild_sched_domains_work);
+ cgroup_queue_deferred_work(&rebuild_sched_domains_work);
}

/*
* Accomplishes the same scheduler domain rebuild as the above
- * async_rebuild_sched_domains(), however it directly calls the
- * rebuild routine synchronously rather than calling it via an
- * asynchronous work thread.
+ * defer_rebuild_sched_domains(), however it directly calls the
+ * rebuild routine synchronously rather than deferring it.
*
* This can only be called from code that is not holding
* cgroup_mutex (not nested in a cgroup_lock() call.)
@@ -965,7 +964,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
heap_free(&heap);

if (is_load_balanced)
- async_rebuild_sched_domains();
+ defer_rebuild_sched_domains();
return 0;
}

@@ -1191,7 +1190,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
cs->relax_domain_level = val;
if (!cpumask_empty(cs->cpus_allowed) &&
is_sched_load_balance(cs))
- async_rebuild_sched_domains();
+ defer_rebuild_sched_domains();
}

return 0;
@@ -1234,7 +1233,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
mutex_unlock(&callback_mutex);

if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
- async_rebuild_sched_domains();
+ defer_rebuild_sched_domains();

out:
free_trial_cpuset(trialcs);
@@ -1821,7 +1820,7 @@ static struct cgroup_subsys_state *cpuset_create(
/*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call async_rebuild_sched_domains().
+ * will call defer_rebuild_sched_domains().
*/

static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)



2009-01-18 09:05:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/3] cgroup: introduce cgroup_queue_deferred_work()


* Lai Jiangshan <[email protected]> wrote:

> Sometimes we need require a lock to prevent something,
> but this lock cannot nest in cgroup_lock. So this work
> should be moved out of cgroup_lock's critical region.
>
> Using schedule_work() can move this work out of cgroup_lock's
> critical region. But it's a overkill for move a work to
> other process. And if we need flush_work() with cgroup_lock
> held, schedule_work() can not work for flush_work() will
> cause deadlock.
>
> Another solution is that deferring the work, and processing
> it after cgroup_lock released. This patch introduces
> cgroup_queue_deferred_work() for queue a cgroup_deferred_work.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Cc: Max Krasnyansky <[email protected]>
> Cc: Miao Xie <[email protected]>
> ---
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index e267e62..6d3e6dc 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
> int cgroup_scan_tasks(struct cgroup_scanner *scan);
> int cgroup_attach_task(struct cgroup *, struct task_struct *);
>
> +struct cgroup_deferred_work {
> + struct list_head list;
> + void (*func)(struct cgroup_deferred_work *);
> +};
> +
> +#define CGROUP_DEFERRED_WORK(name, function) \
> + struct cgroup_deferred_work name = { \
> + .list = LIST_HEAD_INIT((name).list), \
> + .func = (function), \
> + };
> +
> +int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work);
> +
> #else /* !CONFIG_CGROUPS */
>
> static inline int cgroup_init_early(void) { return 0; }
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c298310..75a352b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -540,6 +540,7 @@ void cgroup_lock(void)
> mutex_lock(&cgroup_mutex);
> }
>
> +static void cgroup_flush_deferred_work_locked(void);
> /**
> * cgroup_unlock - release lock on cgroup changes
> *
> @@ -547,9 +548,80 @@ void cgroup_lock(void)
> */
> void cgroup_unlock(void)
> {
> + cgroup_flush_deferred_work_locked();
> mutex_unlock(&cgroup_mutex);

So in cgroup_unlock() [which is called all over the places] we first call
cgroup_flush_deferred_work_locked(), then drop the cgroup_mutex. Then:

> }
>
> +/* deferred_work_list is protected by cgroup_mutex */
> +static LIST_HEAD(deferred_work_list);
> +
> +/* flush deferred works with cgroup_lock released */
> +static void cgroup_flush_deferred_work_locked(void)
> +{
> + static bool running_deferred_work;
> +
> + if (likely(list_empty(&deferred_work_list)))
> + return;

we check whether there's any work done, then:

> +
> + /*
> + * Ensure it's not recursive and also
> + * ensure deferred works are run orderly.
> + */
> + if (running_deferred_work)
> + return;
> + running_deferred_work = true;

we set a recursion flag, then:

> +
> + for ( ; ; ) {

[ please change this to the standard 'for (;;)' style. ]

> + struct cgroup_deferred_work *deferred_work;
> +
> + /* dequeue the first work, and mark it dequeued */
> + deferred_work = list_first_entry(&deferred_work_list,
> + struct cgroup_deferred_work, list);
> + list_del_init(&deferred_work->list);
> +
> + mutex_unlock(&cgroup_mutex);

we drop the cgroup_mutex and start processing deferred work, then:

> +
> + /*
> + * cgroup_mutex is released. The callback function can use
> + * cgroup_lock()/cgroup_unlock(). This behavior is safe
> + * for running_deferred_work is set to 'true'.
> + */
> + deferred_work->func(deferred_work);
> +
> + /*
> + * regain cgroup_mutex to access deferred_work_list
> + * and running_deferred_work.
> + */
> + mutex_lock(&cgroup_mutex);

then we drop the mutex and:

> +
> + if (list_empty(&deferred_work_list))
> + break;
> + }
> +
> + running_deferred_work = false;

clear the recursion flag.

So this is already a high-complexity, high-overhead codepath for the
deferred work case.

Why isnt this in a workqueue? That way there's no overhead for the normal
fastpath _at all_ - the deferred wakeup would be handled as side-effect of
the mutex unlock in essence. Nor would you duplicate core kernel
infrastructure that way.

Plus:

> +int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work)
> +{
> + int ret = 0;
> +
> + if (list_empty(&deferred_work->list)) {
> + list_add_tail(&deferred_work->list, &deferred_work_list);
> + ret = 1;
> + }
> +
> + return ret;

Why is the addition of work dependent on whether it's queued up already?
Callers should know whether it's queued or not - and if they dont then
this is hiding a code structure problem elsewhere.

Ingo

2009-01-18 09:07:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] cpuset: fix possible deadlock in async_rebuild_sched_domains


* Lai Jiangshan <[email protected]> wrote:

> Lockdep reported some possible circular locking info when we tested cpuset on
> NUMA/fake NUMA box.
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.29-rc1-00224-ga652504 #111
> -------------------------------------------------------
> bash/2968 is trying to acquire lock:
> (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8
>
> but task is already holding lock:
> (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29
>
> which lock already depends on the new lock.
> ......
> -------------------------------------------------------
>
> Steps to reproduce:
> # mkdir /dev/cpuset
> # mount -t cpuset xxx /dev/cpuset
> # mkdir /dev/cpuset/0
> # echo 0 > /dev/cpuset/0/cpus
> # echo 0 > /dev/cpuset/0/mems
> # echo 1 > /dev/cpuset/0/memory_migrate
> # cat /dev/zero > /dev/null &
> # echo $! > /dev/cpuset/0/tasks
>
> This is because async_rebuild_sched_domains has the following lock sequence:
> run_workqueue(async_rebuild_sched_domains)
> -> do_rebuild_sched_domains -> cgroup_lock
>
> But, attaching tasks when memory_migrate is set has following:
> cgroup_lock_live_group(cgroup_tasks_write)
> -> do_migrate_pages -> flush_work
>
> This can be fixed by using a separate workqueue thread.
>
> But queuing a work to an other thread is adding some overhead for cpuset.

Can you measure any overhead from that? In any case, this is triggered on
admin activities (when reconfiguring cpusets), so it's a slowpath and thus
using existing infrastructure is preferred in the 99.9% of the cases.

Thanks,

Ingo

2009-01-18 09:11:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls


* Lai Jiangshan <[email protected]> wrote:

> Convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock()
> calls and convert mutex_unlock(&cgroup_mutex) calls into cgroup_unlock()
> calls.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Cc: Max Krasnyansky <[email protected]>
> Cc: Miao Xie <[email protected]>
> ---

(please include diffstat output in patches, so that the general source
code impact can be seen at a glance.)

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c298310..75a352b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -616,7 +688,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> * agent */
> synchronize_rcu();
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();

this just changes over a clean mutex call to a wrapped lock/unlock
sequence that has higher overhead in the common case.

We should do the exact opposite, we should change this opaque API:

void cgroup_lock(void)
{
mutex_lock(&cgroup_mutex);
}

To something more explicit (and more maintainable) like:

cgroup_mutex_lock(&cgroup_mutex);
cgroup_mutex_unlock(&cgroup_mutex);

Which is a NOP in the !CGROUPS case and maps to mutex_lock/unlock in the
CGROUPS=y case.

Ingo

2009-01-19 01:37:39

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls

On Sun, Jan 18, 2009 at 1:10 AM, Ingo Molnar <[email protected]> wrote:
> this just changes over a clean mutex call to a wrapped lock/unlock
> sequence that has higher overhead in the common case.
>
> We should do the exact opposite, we should change this opaque API:
>
> void cgroup_lock(void)
> {
> mutex_lock(&cgroup_mutex);
> }
>
> To something more explicit (and more maintainable) like:

I disagree - cgroup_mutex is a very coarse lock that can be held for
pretty long periods of time by the cgroups framework, and should never
be part of any fastpath code. So the overhead of a function call
should be irrelevant.

The change that you're proposing would send the message that
cgroup_mutex_lock(&cgroup_mutex) is appropriate to use in a
performance-sensitive function, when in fact we want to discourage
such code from taking this lock and instead use more appropriately
fine-grained locks.

Paul

2009-01-19 01:40:36

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 3/3] cpuset: fix possible deadlock in async_rebuild_sched_domains

Ingo Molnar wrote:
>
> Can you measure any overhead from that? In any case, this is triggered on
> admin activities (when reconfiguring cpusets), so it's a slowpath and thus
> using existing infrastructure is preferred in the 99.9% of the cases.
>
>

I thought creating a thread waste some memory, the thread
is sleep at most time.

Thanks, Ingo.


Paul and Andrew:
Could you applied Miao's patch, this is a urgent bugfix.

Lai.

2009-01-19 01:42:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls


* Paul Menage <[email protected]> wrote:

> On Sun, Jan 18, 2009 at 1:10 AM, Ingo Molnar <[email protected]> wrote:
> > this just changes over a clean mutex call to a wrapped lock/unlock
> > sequence that has higher overhead in the common case.
> >
> > We should do the exact opposite, we should change this opaque API:
> >
> > void cgroup_lock(void)
> > {
> > mutex_lock(&cgroup_mutex);
> > }
> >
> > To something more explicit (and more maintainable) like:
>
> I disagree - cgroup_mutex is a very coarse lock that can be held for
> pretty long periods of time by the cgroups framework, and should never
> be part of any fastpath code. So the overhead of a function call should
> be irrelevant.
>
> The change that you're proposing would send the message that
> cgroup_mutex_lock(&cgroup_mutex) is appropriate to use in a
> performance-sensitive function, when in fact we want to discourage such
> code from taking this lock and instead use more appropriately
> fine-grained locks.

Uhm, how does that 'discourage' its use in fastpath code?

It just hides the real lock and invites bad locking/work constructs like
the one proposed in this thread.

Ingo

2009-01-19 01:55:31

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 2/3] cgroup: introduce cgroup_queue_deferred_work()

Ingo Molnar wrote:
> * Lai Jiangshan <[email protected]> wrote:
>
>> Sometimes we need require a lock to prevent something,
>> but this lock cannot nest in cgroup_lock. So this work
>> should be moved out of cgroup_lock's critical region.
>>
>> Using schedule_work() can move this work out of cgroup_lock's
>> critical region. But it's a overkill for move a work to
>> other process. And if we need flush_work() with cgroup_lock
>> held, schedule_work() can not work for flush_work() will
>> cause deadlock.
>>
>> Another solution is that deferring the work, and processing
>> it after cgroup_lock released. This patch introduces
>> cgroup_queue_deferred_work() for queue a cgroup_deferred_work.
>>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> Cc: Max Krasnyansky <[email protected]>
>> Cc: Miao Xie <[email protected]>
>> ---
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index e267e62..6d3e6dc 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
>> int cgroup_scan_tasks(struct cgroup_scanner *scan);
>> int cgroup_attach_task(struct cgroup *, struct task_struct *);
>>
>> +struct cgroup_deferred_work {
>> + struct list_head list;
>> + void (*func)(struct cgroup_deferred_work *);
>> +};
>> +
>> +#define CGROUP_DEFERRED_WORK(name, function) \
>> + struct cgroup_deferred_work name = { \
>> + .list = LIST_HEAD_INIT((name).list), \
>> + .func = (function), \
>> + };
>> +
>> +int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work);
>> +
>> #else /* !CONFIG_CGROUPS */
>>
>> static inline int cgroup_init_early(void) { return 0; }
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index c298310..75a352b 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -540,6 +540,7 @@ void cgroup_lock(void)
>> mutex_lock(&cgroup_mutex);
>> }
>>
>> +static void cgroup_flush_deferred_work_locked(void);
>> /**
>> * cgroup_unlock - release lock on cgroup changes
>> *
>> @@ -547,9 +548,80 @@ void cgroup_lock(void)
>> */
>> void cgroup_unlock(void)
>> {
>> + cgroup_flush_deferred_work_locked();
>> mutex_unlock(&cgroup_mutex);
>
> So in cgroup_unlock() [which is called all over the places] we first call
> cgroup_flush_deferred_work_locked(), then drop the cgroup_mutex. Then:
>
>> }
>>
>> +/* deferred_work_list is protected by cgroup_mutex */
>> +static LIST_HEAD(deferred_work_list);
>> +
>> +/* flush deferred works with cgroup_lock released */
>> +static void cgroup_flush_deferred_work_locked(void)
>> +{
>> + static bool running_deferred_work;
>> +
>> + if (likely(list_empty(&deferred_work_list)))
>> + return;
>
> we check whether there's any work done, then:
>
>> +
>> + /*
>> + * Ensure it's not recursive and also
>> + * ensure deferred works are run orderly.
>> + */
>> + if (running_deferred_work)
>> + return;
>> + running_deferred_work = true;
>
> we set a recursion flag, then:
>
>> +
>> + for ( ; ; ) {
>
> [ please change this to the standard 'for (;;)' style. ]
>
>> + struct cgroup_deferred_work *deferred_work;
>> +
>> + /* dequeue the first work, and mark it dequeued */
>> + deferred_work = list_first_entry(&deferred_work_list,
>> + struct cgroup_deferred_work, list);
>> + list_del_init(&deferred_work->list);
>> +
>> + mutex_unlock(&cgroup_mutex);
>
> we drop the cgroup_mutex and start processing deferred work, then:
>
>> +
>> + /*
>> + * cgroup_mutex is released. The callback function can use
>> + * cgroup_lock()/cgroup_unlock(). This behavior is safe
>> + * for running_deferred_work is set to 'true'.
>> + */
>> + deferred_work->func(deferred_work);
>> +
>> + /*
>> + * regain cgroup_mutex to access deferred_work_list
>> + * and running_deferred_work.
>> + */
>> + mutex_lock(&cgroup_mutex);
>
> then we drop the mutex and:
>
>> +
>> + if (list_empty(&deferred_work_list))
>> + break;
>> + }
>> +
>> + running_deferred_work = false;
>
> clear the recursion flag.
>
> So this is already a high-complexity, high-overhead codepath for the
> deferred work case.
>
> Why isnt this in a workqueue? That way there's no overhead for the normal

We can't use kevent_wq(kernel event workqueue) here.

> fastpath _at all_ - the deferred wakeup would be handled as side-effect of
> the mutex unlock in essence. Nor would you duplicate core kernel
> infrastructure that way.
>
> Plus:
>
>> +int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work)
>> +{
>> + int ret = 0;
>> +
>> + if (list_empty(&deferred_work->list)) {
>> + list_add_tail(&deferred_work->list, &deferred_work_list);
>> + ret = 1;
>> + }
>> +
>> + return ret;
>
> Why is the addition of work dependent on whether it's queued up already?
> Callers should know whether it's queued or not - and if they dont then
> this is hiding a code structure problem elsewhere.
>

The caller doesn't know whether it's dequeued.
see also: queue_work_on() in kernel/workqueue.c

/**
* queue_work_on - queue work on specific cpu
......
* Returns 0 if @work was already on a queue, non-zero otherwise.
......
*/

Lai.

2009-01-20 01:18:58

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls

On Sun, Jan 18, 2009 at 12:06 AM, Lai Jiangshan <[email protected]> wrote:
>
> Convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock()
> calls and convert mutex_unlock(&cgroup_mutex) calls into cgroup_unlock()
> calls.

I don't really see a lot of value in this patch. I'd prefer to leave
cgroup.c as it is, and work towards removing the need for external
code to call cgroup_lock() at all.

Paul

>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Cc: Max Krasnyansky <[email protected]>
> Cc: Miao Xie <[email protected]>
> ---
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c298310..75a352b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -616,7 +688,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> * agent */
> synchronize_rcu();
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> /*
> * Release the subsystem state objects.
> */
> @@ -624,7 +696,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> ss->destroy(ss, cgrp);
>
> cgrp->root->number_of_cgroups--;
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
>
> /*
> * Drop the active superblock reference that we took when we
> @@ -761,14 +833,14 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
> struct cgroupfs_root *root = vfs->mnt_sb->s_fs_info;
> struct cgroup_subsys *ss;
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> for_each_subsys(root, ss)
> seq_printf(seq, ",%s", ss->name);
> if (test_bit(ROOT_NOPREFIX, &root->flags))
> seq_puts(seq, ",noprefix");
> if (strlen(root->release_agent_path))
> seq_printf(seq, ",release_agent=%s", root->release_agent_path);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return 0;
> }
>
> @@ -843,7 +915,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
> struct cgroup_sb_opts opts;
>
> mutex_lock(&cgrp->dentry->d_inode->i_mutex);
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
>
> /* See what subsystems are wanted */
> ret = parse_cgroupfs_options(data, &opts);
> @@ -867,7 +939,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
> out_unlock:
> if (opts.release_agent)
> kfree(opts.release_agent);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
> return ret;
> }
> @@ -1015,7 +1087,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> inode = sb->s_root->d_inode;
>
> mutex_lock(&inode->i_mutex);
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
>
> /*
> * We're accessing css_set_count without locking
> @@ -1026,14 +1098,14 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> */
> ret = allocate_cg_links(css_set_count, &tmp_cg_links);
> if (ret) {
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> mutex_unlock(&inode->i_mutex);
> goto drop_new_super;
> }
>
> ret = rebind_subsystems(root, root->subsys_bits);
> if (ret == -EBUSY) {
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> mutex_unlock(&inode->i_mutex);
> goto free_cg_links;
> }
> @@ -1068,7 +1140,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
>
> cgroup_populate_dir(root_cgrp);
> mutex_unlock(&inode->i_mutex);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> }
>
> return simple_set_mnt(mnt, sb);
> @@ -1094,7 +1166,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
> BUG_ON(!list_empty(&cgrp->children));
> BUG_ON(!list_empty(&cgrp->sibling));
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
>
> /* Rebind all subsystems back to the default hierarchy */
> ret = rebind_subsystems(root, 0);
> @@ -1118,7 +1190,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
> list_del(&root->root_list);
> root_count--;
>
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
>
> kfree(root);
> kill_litter_super(sb);
> @@ -2392,7 +2464,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> * fs */
> atomic_inc(&sb->s_active);
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
>
> init_cgroup_housekeeping(cgrp);
>
> @@ -2427,7 +2499,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> err = cgroup_populate_dir(cgrp);
> /* If err < 0, we have a half-filled directory - oh well ;) */
>
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
>
> return 0;
> @@ -2444,7 +2516,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> ss->destroy(ss, cgrp);
> }
>
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
>
> /* Release the reference count that we took on the superblock */
> deactivate_super(sb);
> @@ -2550,16 +2622,16 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>
> /* the vfs holds both inode->i_mutex already */
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> if (atomic_read(&cgrp->count) != 0) {
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return -EBUSY;
> }
> if (!list_empty(&cgrp->children)) {
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return -EBUSY;
> }
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
>
> /*
> * Call pre_destroy handlers of subsys. Notify subsystems
> @@ -2567,13 +2639,13 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> */
> cgroup_call_pre_destroy(cgrp);
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> parent = cgrp->parent;
>
> if (atomic_read(&cgrp->count)
> || !list_empty(&cgrp->children)
> || !cgroup_clear_css_refs(cgrp)) {
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return -EBUSY;
> }
>
> @@ -2598,7 +2670,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> set_bit(CGRP_RELEASABLE, &parent->flags);
> check_for_release(parent);
>
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return 0;
> }
>
> @@ -2752,7 +2824,7 @@ static int proc_cgroup_show(struct seq_file *m, void *v)
>
> retval = 0;
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
>
> for_each_active_root(root) {
> struct cgroup_subsys *ss;
> @@ -2774,7 +2846,7 @@ static int proc_cgroup_show(struct seq_file *m, void *v)
> }
>
> out_unlock:
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> put_task_struct(tsk);
> out_free:
> kfree(buf);
> @@ -2801,14 +2873,14 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
> int i;
>
> seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> struct cgroup_subsys *ss = subsys[i];
> seq_printf(m, "%s\t%lu\t%d\t%d\n",
> ss->name, ss->root->subsys_bits,
> ss->root->number_of_cgroups, !ss->disabled);
> }
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return 0;
> }
>
> @@ -2984,11 +3056,11 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
>
> /* First figure out what hierarchy and cgroup we're dealing
> * with, and pin them so we can drop cgroup_mutex */
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> again:
> root = subsys->root;
> if (root == &rootnode) {
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return 0;
> }
> task_lock(tsk);
> @@ -2998,14 +3070,14 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
> /* Pin the hierarchy */
> if (!atomic_inc_not_zero(&parent->root->sb->s_active)) {
> /* We race with the final deactivate_super() */
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> return 0;
> }
>
> /* Keep the cgroup alive */
> get_css_set(cg);
> task_unlock(tsk);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
>
> /* Now do the VFS work to create a cgroup */
> inode = parent->dentry->d_inode;
> @@ -3036,7 +3108,7 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
> /* The cgroup now exists. Retake cgroup_mutex and check
> * that we're still in the same state that we thought we
> * were. */
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> if ((root != subsys->root) ||
> (parent != task_cgroup(tsk, subsys->subsys_id))) {
> /* Aargh, we raced ... */
> @@ -3061,14 +3133,14 @@ int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
>
> /* All seems fine. Finish by moving the task into the new cgroup */
> ret = cgroup_attach_task(child, tsk);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
>
> out_release:
> mutex_unlock(&inode->i_mutex);
>
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> put_css_set(cg);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> deactivate_super(parent->root->sb);
> return ret;
> }
> @@ -3162,7 +3234,7 @@ void __css_put(struct cgroup_subsys_state *css)
> static void cgroup_release_agent(struct work_struct *work)
> {
> BUG_ON(work != &release_agent_work);
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> spin_lock(&release_list_lock);
> while (!list_empty(&release_list)) {
> char *argv[3], *envp[3];
> @@ -3196,16 +3268,16 @@ static void cgroup_release_agent(struct work_struct *work)
> /* Drop the lock while we invoke the usermode helper,
> * since the exec could involve hitting disk and hence
> * be a slow process */
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
> - mutex_lock(&cgroup_mutex);
> + cgroup_lock();
> continue_free:
> kfree(pathbuf);
> kfree(agentbuf);
> spin_lock(&release_list_lock);
> }
> spin_unlock(&release_list_lock);
> - mutex_unlock(&cgroup_mutex);
> + cgroup_unlock();
> }
>
> static int __init cgroup_disable(char *str)
>
>
>
>

2009-01-20 01:26:36

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 2/3] cgroup: introduce cgroup_queue_deferred_work()

On Sun, Jan 18, 2009 at 12:06 AM, Lai Jiangshan <[email protected]> wrote:
> Sometimes we need require a lock to prevent something,
> but this lock cannot nest in cgroup_lock. So this work
> should be moved out of cgroup_lock's critical region.
>
> Using schedule_work() can move this work out of cgroup_lock's
> critical region. But it's a overkill for move a work to
> other process. And if we need flush_work() with cgroup_lock
> held, schedule_work() can not work for flush_work() will
> cause deadlock.
>
> Another solution is that deferring the work, and processing
> it after cgroup_lock released. This patch introduces
> cgroup_queue_deferred_work() for queue a cgroup_deferred_work.

I agree with Ingo - this seems rather complex. I think it would be
cleaner to address specific lock-dependency cases individually - in
some cases we can punt to a work queue; in other cases we can sanitize
the locking, etc.

Paul

>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Cc: Max Krasnyansky <[email protected]>
> Cc: Miao Xie <[email protected]>
> ---
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index e267e62..6d3e6dc 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
> int cgroup_scan_tasks(struct cgroup_scanner *scan);
> int cgroup_attach_task(struct cgroup *, struct task_struct *);
>
> +struct cgroup_deferred_work {
> + struct list_head list;
> + void (*func)(struct cgroup_deferred_work *);
> +};
> +
> +#define CGROUP_DEFERRED_WORK(name, function) \
> + struct cgroup_deferred_work name = { \
> + .list = LIST_HEAD_INIT((name).list), \
> + .func = (function), \
> + };
> +
> +int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work);
> +
> #else /* !CONFIG_CGROUPS */
>
> static inline int cgroup_init_early(void) { return 0; }
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c298310..75a352b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -540,6 +540,7 @@ void cgroup_lock(void)
> mutex_lock(&cgroup_mutex);
> }
>
> +static void cgroup_flush_deferred_work_locked(void);
> /**
> * cgroup_unlock - release lock on cgroup changes
> *
> @@ -547,9 +548,80 @@ void cgroup_lock(void)
> */
> void cgroup_unlock(void)
> {
> + cgroup_flush_deferred_work_locked();
> mutex_unlock(&cgroup_mutex);
> }
>
> +/* deferred_work_list is protected by cgroup_mutex */
> +static LIST_HEAD(deferred_work_list);
> +
> +/* flush deferred works with cgroup_lock released */
> +static void cgroup_flush_deferred_work_locked(void)
> +{
> + static bool running_deferred_work;
> +
> + if (likely(list_empty(&deferred_work_list)))
> + return;
> +
> + /*
> + * Ensure it's not recursive and also
> + * ensure deferred works are run orderly.
> + */
> + if (running_deferred_work)
> + return;
> + running_deferred_work = true;
> +
> + for ( ; ; ) {
> + struct cgroup_deferred_work *deferred_work;
> +
> + /* dequeue the first work, and mark it dequeued */
> + deferred_work = list_first_entry(&deferred_work_list,
> + struct cgroup_deferred_work, list);
> + list_del_init(&deferred_work->list);
> +
> + mutex_unlock(&cgroup_mutex);
> +
> + /*
> + * cgroup_mutex is released. The callback function can use
> + * cgroup_lock()/cgroup_unlock(). This behavior is safe
> + * for running_deferred_work is set to 'true'.
> + */
> + deferred_work->func(deferred_work);
> +
> + /*
> + * regain cgroup_mutex to access deferred_work_list
> + * and running_deferred_work.
> + */
> + mutex_lock(&cgroup_mutex);
> +
> + if (list_empty(&deferred_work_list))
> + break;
> + }
> +
> + running_deferred_work = false;
> +}
> +
> +/**
> + * cgroup_queue_deferred_work - queue a deferred work
> + * @deferred_work: work to queue.
> + *
> + * Returns 0 if @deferred_work was already on the queue, non-zero otherwise.
> + *
> + * Must be called when cgroup_lock held.
> + * The deferred work will be run after cgroup_lock released.
> + */
> +int cgroup_queue_deferred_work(struct cgroup_deferred_work *deferred_work)
> +{
> + int ret = 0;
> +
> + if (list_empty(&deferred_work->list)) {
> + list_add_tail(&deferred_work->list, &deferred_work_list);
> + ret = 1;
> + }
> +
> + return ret;
> +}
> +
> /*
> * A couple of forward declarations required, due to cyclic reference loop:
> * cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
>
>
>
>
>

2009-01-20 01:29:07

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls

On Sun, Jan 18, 2009 at 5:41 PM, Ingo Molnar <[email protected]> wrote:
>
> * Paul Menage <[email protected]> wrote:
>
>> On Sun, Jan 18, 2009 at 1:10 AM, Ingo Molnar <[email protected]> wrote:
>> > this just changes over a clean mutex call to a wrapped lock/unlock
>> > sequence that has higher overhead in the common case.
>> >
>> > We should do the exact opposite, we should change this opaque API:
>> >
>> > void cgroup_lock(void)
>> > {
>> > mutex_lock(&cgroup_mutex);
>> > }
>> >
>> > To something more explicit (and more maintainable) like:
>>
>> I disagree - cgroup_mutex is a very coarse lock that can be held for
>> pretty long periods of time by the cgroups framework, and should never
>> be part of any fastpath code. So the overhead of a function call should
>> be irrelevant.
>>
>> The change that you're proposing would send the message that
>> cgroup_mutex_lock(&cgroup_mutex) is appropriate to use in a
>> performance-sensitive function, when in fact we want to discourage such
>> code from taking this lock and instead use more appropriately
>> fine-grained locks.
>
> Uhm, how does that 'discourage' its use in fastpath code?

I agree, the existing code doesn't exactly discourage its use in
fastpath code, but we should be doing so. (The recent addition of
hierarchy_mutex is a step in that direction, although it still has
some issues to be cleaned up). But it seems to me that exposing the
lock is an invitation for people to use it even more than they do
currently. There's certainly no performance argument for exposing it,

>
> It just hides the real lock

Yes, I'd rather hide the real lock in this case.

Paul

2009-01-20 18:22:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls

On Mon, 2009-01-19 at 17:28 -0800, Paul Menage wrote:
> Yes, I'd rather hide the real lock in this case.

You can never hide locks, and wanting to do so will only lead to a
horrible mess.