LinuxLists.cc - [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

2008-06-27 07:59:36

Subject: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

This patch adds a hierarchy_mutex to the cgroup_subsys object that
protects changes to the hierarchy observed by that subsystem. It is
taken by the cgroup subsystem for the following operations:

- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
bind() callback)

Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its ts own hierarchy.

This patch also changes the cpuset hotplug code to take
cpuset_subsys.hierarchy_mutex rather than cgroup_mutex, removing a
cyclical locking dependancy between cpu_hotplug.lock and cgroup_mutex.

---

This is just a first cut - a full version of this patch would also
make use of the hierarchy_mutex at other places in cpuset.c that use
the cgroup hierarchy, but I wanted to see what people thought of the
concept. It appears to keep lockdep quiet.

include/linux/cgroup.h | 10 ++++++++--
kernel/cgroup.c | 35 ++++++++++++++++++++++++++++++++++-
kernel/cpuset.c | 4 ++--
3 files changed, 44 insertions(+), 5 deletions(-)

Index: lockfix-2.6.26-rc5-mm3/include/linux/cgroup.h
===================================================================
--- lockfix-2.6.26-rc5-mm3.orig/include/linux/cgroup.h
+++ lockfix-2.6.26-rc5-mm3/include/linux/cgroup.h
@@ -319,9 +319,15 @@ struct cgroup_subsys {
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;

- /* Protected by RCU */
- struct cgroupfs_root *root;
+ /*
+ * Protects sibling/children links of cgroups in this
+ * hierarchy, plus protects which hierarchy (or none) the
+ * subsystem is a part of (i.e. root/sibling)
+ */
+ struct mutex hierarchy_mutex;

+ /* Protected by RCU, this->hierarchy_mutex and cgroup_lock() */
+ struct cgroupfs_root *root;
struct list_head sibling;

void *private;
Index: lockfix-2.6.26-rc5-mm3/kernel/cgroup.c
===================================================================
--- lockfix-2.6.26-rc5-mm3.orig/kernel/cgroup.c
+++ lockfix-2.6.26-rc5-mm3/kernel/cgroup.c
@@ -728,23 +728,26 @@ static int rebind_subsystems(struct cgro
BUG_ON(cgrp->subsys[i]);
BUG_ON(!dummytop->subsys[i]);
BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
+ mutex_lock(&ss->hierarchy_mutex);
cgrp->subsys[i] = dummytop->subsys[i];
cgrp->subsys[i]->cgroup = cgrp;
list_add(&ss->sibling, &root->subsys_list);
rcu_assign_pointer(ss->root, root);
if (ss->bind)
ss->bind(ss, cgrp);
-
+ mutex_unlock(&ss->hierarchy_mutex);
} else if (bit & removed_bits) {
/* We're removing this subsystem */
BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
+ mutex_lock(&ss->hierarchy_mutex);
if (ss->bind)
ss->bind(ss, dummytop);
dummytop->subsys[i]->cgroup = dummytop;
cgrp->subsys[i] = NULL;
rcu_assign_pointer(subsys[i]->root, &rootnode);
list_del(&ss->sibling);
+ mutex_unlock(&ss->hierarchy_mutex);
} else if (bit & final_bits) {
/* Subsystem state should already exist */
BUG_ON(!cgrp->subsys[i]);
@@ -2291,6 +2294,29 @@ static void init_cgroup_css(struct cgrou
cgrp->subsys[ss->subsys_id] = css;
}

+static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
+{
+ /* We need to take each hierarchy_mutex in a consistent order */
+ int i;
+
+ for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ struct cgroup_subsys *ss = subsys[i];
+ if (ss->root == root)
+ mutex_lock_nested(&ss->hierarchy_mutex, i);
+ }
+}
+
+static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
+{
+ int i;
+
+ for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ struct cgroup_subsys *ss = subsys[i];
+ if (ss->root == root)
+ mutex_unlock(&ss->hierarchy_mutex);
+ }
+}
+
/*
* cgroup_create - create a cgroup
* @parent: cgroup that will be parent of the new cgroup
@@ -2342,8 +2368,10 @@ static long cgroup_create(struct cgroup
init_cgroup_css(css, ss, cgrp);
}

+ cgroup_lock_hierarchy(root);
list_add(&cgrp->sibling, &cgrp->parent->children);
root->number_of_cgroups++;
+ cgroup_unlock_hierarchy(root);

err = cgroup_create_dir(cgrp, dentry, mode);
if (err < 0)
@@ -2460,8 +2488,12 @@ static int cgroup_rmdir(struct inode *un
if (!list_empty(&cgrp->release_list))
list_del(&cgrp->release_list);
spin_unlock(&release_list_lock);
+
+ cgroup_lock_hierarchy(root);
/* delete my sibling from parent->children */
list_del(&cgrp->sibling);
+ cgroup_unlock_hierarchy(root);
+
spin_lock(&cgrp->dentry->d_lock);
d = dget(cgrp->dentry);
cgrp->dentry = NULL;
@@ -2504,6 +2536,7 @@ static void __init cgroup_init_subsys(st
* need to invoke fork callbacks here. */
BUG_ON(!list_empty(&init_task.tasks));

+ mutex_init(&ss->hierarchy_mutex);
ss->active = 1;
}

Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
===================================================================
--- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
+++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
@@ -1929,7 +1929,7 @@ static void scan_for_empty_cpusets(const

static void common_cpu_mem_hotplug_unplug(void)
{
- cgroup_lock();
+ mutex_lock(&cpuset_subsys.hierarchy_mutex);

top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
@@ -1941,7 +1941,7 @@ static void common_cpu_mem_hotplug_unplu
*/
rebuild_sched_domains();

- cgroup_unlock();
+ mutex_unlock(&cpuset_subsys.hierarchy_mutex);
}

/*

2008-06-28 00:02:52

by Paul Menage

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

On Fri, Jun 27, 2008 at 12:59 AM, Paul Menage <[email protected]> wrote:
>
> This patch adds a hierarchy_mutex to the cgroup_subsys object that
> protects changes to the hierarchy observed by that subsystem. It is
> taken by the cgroup subsystem for the following operations:
>
> - linking a cgroup into that subsystem's cgroup tree
> - unlinking a cgroup from that subsystem's cgroup tree
> - moving the subsystem to/from a hierarchy (including across the
> bind() callback)
>
> Thus if the subsystem holds its own hierarchy_mutex, it can safely
> traverse its ts own hierarchy.
>

It struck me that there's a small race in this code now that we're not
doing cgroup_lock() in the hotplug path.

- we start to attach a task T to cpuset C, with a single CPU "X" in
its "cpus" list
- cpuset_can_attach() returns successfully since the cpuset has a cpu
- CPU X gets hot-unplugged; any tasks in C are moved to their parent
cpuset and C loses its cpu.
- we update T->cgroups to point to C, which is broken since C has no cpus.

So we'll need some additional locking work on top of this - but I
think this patch is still a step in the right direction.

Paul

2008-07-02 03:55:38

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

Hi Paul,

Sorry for the delay.

Paul Menage wrote:
> On Fri, Jun 27, 2008 at 12:59 AM, Paul Menage <[email protected]> wrote:
>> This patch adds a hierarchy_mutex to the cgroup_subsys object that
>> protects changes to the hierarchy observed by that subsystem. It is
>> taken by the cgroup subsystem for the following operations:
>>
>> - linking a cgroup into that subsystem's cgroup tree
>> - unlinking a cgroup from that subsystem's cgroup tree
>> - moving the subsystem to/from a hierarchy (including across the
>> bind() callback)
>>
>> Thus if the subsystem holds its own hierarchy_mutex, it can safely
>> traverse its ts own hierarchy.
>>
>
> It struck me that there's a small race in this code now that we're not
> doing cgroup_lock() in the hotplug path.
>
> - we start to attach a task T to cpuset C, with a single CPU "X" in
> its "cpus" list
> - cpuset_can_attach() returns successfully since the cpuset has a cpu
> - CPU X gets hot-unplugged; any tasks in C are moved to their parent
> cpuset and C loses its cpu.
> - we update T->cgroups to point to C, which is broken since C has no cpus.
>
> So we'll need some additional locking work on top of this - but I
> think this patch is still a step in the right direction.
I was about to say "yeah, looks good" and then tried a couple of
different hot-plug scenarious.
We still have circular locking even with your patch

[ INFO: possible circular locking dependency detected ]
2.6.26-rc8 #4
-------------------------------------------------------
bash/2779 is trying to acquire lock:
(&cpu_hotplug.lock){--..}, at: [<ffffffff8025e024>] get_online_cpus+0x24/0x40

but task is already holding lock:
(sched_domains_mutex){--..}, at: [<ffffffff8022f5e9>]
partition_sched_domains+0x29/0x2b0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (sched_domains_mutex){--..}:
[<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
[<ffffffff80259d6b>] lock_acquire+0x5b/0x80
[<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
[<ffffffff8022f5e9>] partition_sched_domains+0x29/0x2b0
[<ffffffff80268f9d>] rebuild_sched_domains+0x9d/0x3f0
[<ffffffff80269f05>] cpuset_handle_cpuhp+0x205/0x220
[<ffffffff804d688f>] notifier_call_chain+0x3f/0x80
[<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
[<ffffffff804c1748>] _cpu_down+0xa8/0x290
[<ffffffff804c196b>] cpu_down+0x3b/0x60
[<ffffffff804c2c68>] store_online+0x48/0xa0
[<ffffffff803a46c4>] sysdev_store+0x24/0x30
[<ffffffff802eebba>] sysfs_write_file+0xca/0x140
[<ffffffff8029cb3b>] vfs_write+0xcb/0x170
[<ffffffff8029ccd0>] sys_write+0x50/0x90
[<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

-> #1 (&ss->hierarchy_mutex){--..}:
[<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
[<ffffffff80259d6b>] lock_acquire+0x5b/0x80
[<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
[<ffffffff80269d39>] cpuset_handle_cpuhp+0x39/0x220
[<ffffffff804d688f>] notifier_call_chain+0x3f/0x80
[<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
[<ffffffff804c1748>] _cpu_down+0xa8/0x290
[<ffffffff804c196b>] cpu_down+0x3b/0x60
[<ffffffff804c2c68>] store_online+0x48/0xa0
[<ffffffff803a46c4>] sysdev_store+0x24/0x30
[<ffffffff802eebba>] sysfs_write_file+0xca/0x140
[<ffffffff8029cb3b>] vfs_write+0xcb/0x170
[<ffffffff8029ccd0>] sys_write+0x50/0x90
[<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (&cpu_hotplug.lock){--..}:
[<ffffffff80259913>] __lock_acquire+0xa53/0xe50
[<ffffffff80259d6b>] lock_acquire+0x5b/0x80
[<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
[<ffffffff8025e024>] get_online_cpus+0x24/0x40
[<ffffffff8022fee1>] sched_getaffinity+0x11/0x80
[<ffffffff8026e6d9>] __synchronize_sched+0x19/0x90
[<ffffffff8022ed46>] detach_destroy_domains+0x46/0x50
[<ffffffff8022f6b9>] partition_sched_domains+0xf9/0x2b0
[<ffffffff80268f9d>] rebuild_sched_domains+0x9d/0x3f0
[<ffffffff8026a858>] cpuset_common_file_write+0x2b8/0x5c0
[<ffffffff8026657c>] cgroup_file_write+0x7c/0x1a0
[<ffffffff8029cb3b>] vfs_write+0xcb/0x170
[<ffffffff8029ccd0>] sys_write+0x50/0x90
[<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by bash/2779:
#0: (cgroup_mutex){--..}, at: [<ffffffff802653e2>] cgroup_lock+0x12/0x20
#1: (sched_domains_mutex){--..}, at: [<ffffffff8022f5e9>]
partition_sched_domains+0x29/0x2b0

stack backtrace:
Pid: 2779, comm: bash Not tainted 2.6.26-rc8 #4

Call Trace:
[<ffffffff80258c0c>] print_circular_bug_tail+0x8c/0x90
[<ffffffff802589c4>] ? print_circular_bug_entry+0x54/0x60
[<ffffffff80259913>] __lock_acquire+0xa53/0xe50
[<ffffffff8025e024>] ? get_online_cpus+0x24/0x40
[<ffffffff80259d6b>] lock_acquire+0x5b/0x80
[<ffffffff8025e024>] ? get_online_cpus+0x24/0x40
[<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
[<ffffffff8025867d>] ? mark_held_locks+0x4d/0x90
[<ffffffff8025e024>] get_online_cpus+0x24/0x40
[<ffffffff8022fee1>] sched_getaffinity+0x11/0x80
[<ffffffff8026e6d9>] __synchronize_sched+0x19/0x90
[<ffffffff8022ed46>] detach_destroy_domains+0x46/0x50
[<ffffffff8022f6b9>] partition_sched_domains+0xf9/0x2b0
[<ffffffff80258801>] ? trace_hardirqs_on+0xc1/0xe0
[<ffffffff80268f9d>] rebuild_sched_domains+0x9d/0x3f0
[<ffffffff8026a858>] cpuset_common_file_write+0x2b8/0x5c0
[<ffffffff80268c00>] ? cpuset_test_cpumask+0x0/0x20
[<ffffffff80269f20>] ? cpuset_change_cpumask+0x0/0x20
[<ffffffff80265260>] ? started_after+0x0/0x50
[<ffffffff8026657c>] cgroup_file_write+0x7c/0x1a0
[<ffffffff8029cb3b>] vfs_write+0xcb/0x170
[<ffffffff8029ccd0>] sys_write+0x50/0x90
[<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80

CPU3 attaching NULL sched-domain.

2008-07-02 22:32:22

by Paul Menage

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

On Tue, Jul 1, 2008 at 8:55 PM, Max Krasnyansky <[email protected]> wrote:
> I was about to say "yeah, looks good" and then tried a couple of
> different hot-plug scenarious.
> We still have circular locking even with your patch
>

What sequence of actions do you do? I've not been able to reproduce a
lockdep failure.

Paul

> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc8 #4
> -------------------------------------------------------
> bash/2779 is trying to acquire lock:
> (&cpu_hotplug.lock){--..}, at: [<ffffffff8025e024>] get_online_cpus+0x24/0x40
>
> but task is already holding lock:
> (sched_domains_mutex){--..}, at: [<ffffffff8022f5e9>]
> partition_sched_domains+0x29/0x2b0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (sched_domains_mutex){--..}:
> [<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
> [<ffffffff8022f5e9>] partition_sched_domains+0x29/0x2b0
> [<ffffffff80268f9d>] rebuild_sched_domains+0x9d/0x3f0
> [<ffffffff80269f05>] cpuset_handle_cpuhp+0x205/0x220
> [<ffffffff804d688f>] notifier_call_chain+0x3f/0x80
> [<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
> [<ffffffff804c1748>] _cpu_down+0xa8/0x290
> [<ffffffff804c196b>] cpu_down+0x3b/0x60
> [<ffffffff804c2c68>] store_online+0x48/0xa0
> [<ffffffff803a46c4>] sysdev_store+0x24/0x30
> [<ffffffff802eebba>] sysfs_write_file+0xca/0x140
> [<ffffffff8029cb3b>] vfs_write+0xcb/0x170
> [<ffffffff8029ccd0>] sys_write+0x50/0x90
> [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #1 (&ss->hierarchy_mutex){--..}:
> [<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
> [<ffffffff80269d39>] cpuset_handle_cpuhp+0x39/0x220
> [<ffffffff804d688f>] notifier_call_chain+0x3f/0x80
> [<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
> [<ffffffff804c1748>] _cpu_down+0xa8/0x290
> [<ffffffff804c196b>] cpu_down+0x3b/0x60
> [<ffffffff804c2c68>] store_online+0x48/0xa0
> [<ffffffff803a46c4>] sysdev_store+0x24/0x30
> [<ffffffff802eebba>] sysfs_write_file+0xca/0x140
> [<ffffffff8029cb3b>] vfs_write+0xcb/0x170
> [<ffffffff8029ccd0>] sys_write+0x50/0x90
> [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #0 (&cpu_hotplug.lock){--..}:
> [<ffffffff80259913>] __lock_acquire+0xa53/0xe50
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
> [<ffffffff8025e024>] get_online_cpus+0x24/0x40
> [<ffffffff8022fee1>] sched_getaffinity+0x11/0x80
> [<ffffffff8026e6d9>] __synchronize_sched+0x19/0x90
> [<ffffffff8022ed46>] detach_destroy_domains+0x46/0x50
> [<ffffffff8022f6b9>] partition_sched_domains+0xf9/0x2b0
> [<ffffffff80268f9d>] rebuild_sched_domains+0x9d/0x3f0
> [<ffffffff8026a858>] cpuset_common_file_write+0x2b8/0x5c0
> [<ffffffff8026657c>] cgroup_file_write+0x7c/0x1a0
> [<ffffffff8029cb3b>] vfs_write+0xcb/0x170
> [<ffffffff8029ccd0>] sys_write+0x50/0x90
> [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> other info that might help us debug this:
>
> 2 locks held by bash/2779:
> #0: (cgroup_mutex){--..}, at: [<ffffffff802653e2>] cgroup_lock+0x12/0x20
> #1: (sched_domains_mutex){--..}, at: [<ffffffff8022f5e9>]
> partition_sched_domains+0x29/0x2b0
>
> stack backtrace:
> Pid: 2779, comm: bash Not tainted 2.6.26-rc8 #4
>
> Call Trace:
> [<ffffffff80258c0c>] print_circular_bug_tail+0x8c/0x90
> [<ffffffff802589c4>] ? print_circular_bug_entry+0x54/0x60
> [<ffffffff80259913>] __lock_acquire+0xa53/0xe50
> [<ffffffff8025e024>] ? get_online_cpus+0x24/0x40
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff8025e024>] ? get_online_cpus+0x24/0x40
> [<ffffffff804d12c4>] mutex_lock_nested+0x94/0x250
> [<ffffffff8025867d>] ? mark_held_locks+0x4d/0x90
> [<ffffffff8025e024>] get_online_cpus+0x24/0x40
> [<ffffffff8022fee1>] sched_getaffinity+0x11/0x80
> [<ffffffff8026e6d9>] __synchronize_sched+0x19/0x90
> [<ffffffff8022ed46>] detach_destroy_domains+0x46/0x50
> [<ffffffff8022f6b9>] partition_sched_domains+0xf9/0x2b0
> [<ffffffff80258801>] ? trace_hardirqs_on+0xc1/0xe0
> [<ffffffff80268f9d>] rebuild_sched_domains+0x9d/0x3f0
> [<ffffffff8026a858>] cpuset_common_file_write+0x2b8/0x5c0
> [<ffffffff80268c00>] ? cpuset_test_cpumask+0x0/0x20
> [<ffffffff80269f20>] ? cpuset_change_cpumask+0x0/0x20
> [<ffffffff80265260>] ? started_after+0x0/0x50
> [<ffffffff8026657c>] cgroup_file_write+0x7c/0x1a0
> [<ffffffff8029cb3b>] vfs_write+0xcb/0x170
> [<ffffffff8029ccd0>] sys_write+0x50/0x90
> [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
>
> CPU3 attaching NULL sched-domain.
>
>
>
>

2008-07-03 07:10:13

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

Paul Menage wrote:
> On Tue, Jul 1, 2008 at 8:55 PM, Max Krasnyansky <[email protected]> wrote:
>> I was about to say "yeah, looks good" and then tried a couple of
>> different hot-plug scenarious.
>> We still have circular locking even with your patch
>>
>
> What sequence of actions do you do? I've not been able to reproduce a
> lockdep failure.

mkdir /dev/cpuset
mount -t cgroup -o cpuset cpuset /dev/cpuset
mkdir /dev/cpuset/0
mkdir /dev/cpuset/1
echo 0-2 > /dev/cpuset/0/cpuset.cpus
echo 3 > /dev/cpuset/1/cpuset.cpus
echo 0 > /dev/cpuset/cpuset.sched_load_balance
echo 0 > /sys/devices/system/cpu/cpu3/online

I just tried it again and got exact same lockdep output that I sent before.

Max

2008-07-10 17:26:31

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

Max Krasnyansky wrote:
>
> Paul Menage wrote:
>> On Tue, Jul 1, 2008 at 8:55 PM, Max Krasnyansky <[email protected]> wrote:
>>> I was about to say "yeah, looks good" and then tried a couple of
>>> different hot-plug scenarious.
>>> We still have circular locking even with your patch
>>>
>> What sequence of actions do you do? I've not been able to reproduce a
>> lockdep failure.
>
> mkdir /dev/cpuset
> mount -t cgroup -o cpuset cpuset /dev/cpuset
> mkdir /dev/cpuset/0
> mkdir /dev/cpuset/1
> echo 0-2 > /dev/cpuset/0/cpuset.cpus
> echo 3 > /dev/cpuset/1/cpuset.cpus
> echo 0 > /dev/cpuset/cpuset.sched_load_balance
> echo 0 > /sys/devices/system/cpu/cpu3/online
>
> I just tried it again and got exact same lockdep output that I sent before.

Paul, any luck with that ?

Max

2008-07-11 00:33:07

by Paul Menage

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

On Wed, Jul 2, 2008 at 9:46 PM, Max Krasnyansky <[email protected]> wrote:
>
> mkdir /dev/cpuset
> mount -t cgroup -o cpuset cpuset /dev/cpuset
> mkdir /dev/cpuset/0
> mkdir /dev/cpuset/1
> echo 0-2 > /dev/cpuset/0/cpuset.cpus
> echo 3 > /dev/cpuset/1/cpuset.cpus
> echo 0 > /dev/cpuset/cpuset.sched_load_balance
> echo 0 > /sys/devices/system/cpu/cpu3/online
>

OK, I still can't reproduce this, on a 2-cpu system using one cpu for
each cpuset.

But the basic problem seemns to be that we have cpu_hotplug.lock taken
at the outer level (when offlining a CPU) and at the inner level (via
get_online_cpus() called from the guts of partition_sched_domains(),
if we didn't already take it at the outer level.

While looking at the code trying to figure out a nice way around this,
it struck me that we have the call path

cpuset_track_online_nodes() ->
common_cpu_mem_hotplug_unplug() ->
scan_for_empty_cpusets() ->
access cpu_online_map with no calls to get_online_cpus()

Is that allowed? Maybe we need separate versions of
scan_for_empty_cpusets() that look at memory and cpus?

I think that we're going to want eventually a solution such pushing
the locking of cpuset_subsys.hierarchy_mutex down into the first part
of partition_sched_domains, that actually walks the cpuset tree

Paul

2008-07-12 22:48:45

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [RFC][PATCH] CGroups: Add a per-subsystem hierarchy lock

Paul Menage wrote:
> On Wed, Jul 2, 2008 at 9:46 PM, Max Krasnyansky <[email protected]> wrote:
>> mkdir /dev/cpuset
>> mount -t cgroup -o cpuset cpuset /dev/cpuset
>> mkdir /dev/cpuset/0
>> mkdir /dev/cpuset/1
>> echo 0-2 > /dev/cpuset/0/cpuset.cpus
>> echo 3 > /dev/cpuset/1/cpuset.cpus
>> echo 0 > /dev/cpuset/cpuset.sched_load_balance
>> echo 0 > /sys/devices/system/cpu/cpu3/online
>>
>
> OK, I still can't reproduce this, on a 2-cpu system using one cpu for
> each cpuset.
>
> But the basic problem seemns to be that we have cpu_hotplug.lock taken
> at the outer level (when offlining a CPU) and at the inner level (via
> get_online_cpus() called from the guts of partition_sched_domains(),
> if we didn't already take it at the outer level.
Yes. I'm looking at this stuff right now.

> While looking at the code trying to figure out a nice way around this,
> it struck me that we have the call path
>
> cpuset_track_online_nodes() ->
> common_cpu_mem_hotplug_unplug() ->
> scan_for_empty_cpusets() ->
> access cpu_online_map with no calls to get_online_cpus()
>
> Is that allowed? Maybe we need separate versions of
> scan_for_empty_cpusets() that look at memory and cpus?
Yes. This also came up in the other discussions. ie That we need to have this
split up somehow. Looks like my fix for domain handling exposed all kinds of
problems that we punted on before :).

> I think that we're going to want eventually a solution such pushing
> the locking of cpuset_subsys.hierarchy_mutex down into the first part
> of partition_sched_domains, that actually walks the cpuset tree
Right now I'm trying to get away with not using new locks. Since we need some
solution for 2.6.26 and new cgroup locking that you added is probably a .27
material.

Max