2008-06-26 07:57:25

by Paul Menage

[permalink] [raw]
Subject: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

CPUsets: Move most calls to rebuild_sched_domains() to the workqueue

In the current cpusets code the lock nesting between cgroup_mutex and
cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
and in all other paths that call rebuild_sched_domains() it nests
inside.

This patch makes most calls to rebuild_sched_domains() asynchronous
via the workqueue, which removes the nesting of the two locks in that
case. In the case of an actual hotplug event, cpuhotplug.lock nests
outside cgroup_mutex as now.

Signed-off-by: Paul Menage <[email protected]>

---

Note that all I've done with this patch is verify that it compiles
without warnings; I'm not sure how to trigger a hotplug event to test
the lock dependencies or verify that scheduler domain support is still
behaving correctly. Vegard, does this fix the problems that you were
seeing? Paul/Max, does this still seem sane with regard to scheduler domains?


kernel/cpuset.c | 35 +++++++++++++++++++++++------------
1 file changed, 23 insertions(+), 12 deletions(-)

Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
===================================================================
--- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
+++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
@@ -522,13 +522,9 @@ update_domain_attr(struct sched_domain_a
* domains when operating in the severe memory shortage situations
* that could cause allocation failures below.
*
- * Call with cgroup_mutex held. May take callback_mutex during
- * call due to the kfifo_alloc() and kmalloc() calls. May nest
- * a call to the get_online_cpus()/put_online_cpus() pair.
- * Must not be called holding callback_mutex, because we must not
- * call get_online_cpus() while holding callback_mutex. Elsewhere
- * the kernel nests callback_mutex inside get_online_cpus() calls.
- * So the reverse nesting would risk an ABBA deadlock.
+ * Call with cgroup_mutex held, and inside get_online_cpus(). May
+ * take callback_mutex during call due to the kfifo_alloc() and
+ * kmalloc() calls.
*
* The three key local variables below are:
* q - a kfifo queue of cpuset pointers, used to implement a
@@ -689,9 +685,7 @@ restart:

rebuild:
/* Have scheduler rebuild sched domains */
- get_online_cpus();
partition_sched_domains(ndoms, doms, dattr);
- put_online_cpus();

done:
if (q && !IS_ERR(q))
@@ -701,6 +695,21 @@ done:
/* Don't kfree(dattr) -- partition_sched_domains() does that. */
}

+/*
+ * Due to the need to nest cgroup_mutex inside cpuhotplug.lock, most
+ * of our invocations of rebuild_sched_domains() are done
+ * asynchronously via the workqueue
+ */
+static void delayed_rebuild_sched_domains(struct work_struct *work)
+{
+ get_online_cpus();
+ cgroup_lock();
+ rebuild_sched_domains();
+ cgroup_unlock();
+ put_online_cpus();
+}
+static DECLARE_WORK(rebuild_sched_domains_work, delayed_rebuild_sched_domains);
+
static inline int started_after_time(struct task_struct *t1,
struct timespec *time,
struct task_struct *t2)
@@ -853,7 +862,7 @@ static int update_cpumask(struct cpuset
return retval;

if (is_load_balanced)
- rebuild_sched_domains();
+ schedule_work(&rebuild_sched_domains_work);
return 0;
}

@@ -1080,7 +1089,7 @@ static int update_relax_domain_level(str

if (val != cs->relax_domain_level) {
cs->relax_domain_level = val;
- rebuild_sched_domains();
+ schedule_work(&rebuild_sched_domains_work);
}

return 0;
@@ -1121,7 +1130,7 @@ static int update_flag(cpuset_flagbits_t
mutex_unlock(&callback_mutex);

if (cpus_nonempty && balance_flag_changed)
- rebuild_sched_domains();
+ schedule_work(&rebuild_sched_domains_work);

return 0;
}
@@ -1929,6 +1938,7 @@ static void scan_for_empty_cpusets(const

static void common_cpu_mem_hotplug_unplug(void)
{
+ get_online_cpus();
cgroup_lock();

top_cpuset.cpus_allowed = cpu_online_map;
@@ -1942,6 +1952,7 @@ static void common_cpu_mem_hotplug_unplu
rebuild_sched_domains();

cgroup_unlock();
+ put_online_cpus();
}

/*


2008-06-26 09:34:29

by Vegard Nossum

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, Jun 26, 2008 at 9:56 AM, Paul Menage <[email protected]> wrote:
> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>
> In the current cpusets code the lock nesting between cgroup_mutex and
> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
> and in all other paths that call rebuild_sched_domains() it nests
> inside.
>
> This patch makes most calls to rebuild_sched_domains() asynchronous
> via the workqueue, which removes the nesting of the two locks in that
> case. In the case of an actual hotplug event, cpuhotplug.lock nests
> outside cgroup_mutex as now.
>
> Signed-off-by: Paul Menage <[email protected]>
>
> ---
>
> Note that all I've done with this patch is verify that it compiles
> without warnings; I'm not sure how to trigger a hotplug event to test
> the lock dependencies or verify that scheduler domain support is still
> behaving correctly. Vegard, does this fix the problems that you were
> seeing? Paul/Max, does this still seem sane with regard to scheduler
> domains?

Nope, sorry :-(

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.26-rc8-dirty #39
-------------------------------------------------------
bash/3510 is trying to acquire lock:
(events){--..}, at: [<c0145690>] cleanup_workqueue_thread+0x10/0x70

but task is already holding lock:
(&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&cpu_hotplug.lock){--..}:
[<c0158e65>] __lock_acquire+0xf45/0x1040
[<c0158ff8>] lock_acquire+0x98/0xd0
[<c057e6a1>] mutex_lock_nested+0xb1/0x300
[<c015da3c>] get_online_cpus+0x2c/0x40
[<c0162c98>] delayed_rebuild_sched_domains+0x8/0x30
[<c014548b>] run_workqueue+0x15b/0x1f0
[<c0145f09>] worker_thread+0x99/0xf0
[<c0148772>] kthread+0x42/0x70
[<c0105a63>] kernel_thread_helper+0x7/0x14
[<ffffffff>] 0xffffffff

-> #1 (rebuild_sched_domains_work){--..}:
[<c0158e65>] __lock_acquire+0xf45/0x1040
[<c0158ff8>] lock_acquire+0x98/0xd0
[<c0145486>] run_workqueue+0x156/0x1f0
[<c0145f09>] worker_thread+0x99/0xf0
[<c0148772>] kthread+0x42/0x70
[<c0105a63>] kernel_thread_helper+0x7/0x14
[<ffffffff>] 0xffffffff

-> #0 (events){--..}:
[<c0158a15>] __lock_acquire+0xaf5/0x1040
[<c0158ff8>] lock_acquire+0x98/0xd0
[<c01456b6>] cleanup_workqueue_thread+0x36/0x70
[<c055d91a>] workqueue_cpu_callback+0x7a/0x130
[<c014d497>] notifier_call_chain+0x37/0x70
[<c014d509>] __raw_notifier_call_chain+0x19/0x20
[<c014d52a>] raw_notifier_call_chain+0x1a/0x20
[<c055bb28>] _cpu_down+0x148/0x240
[<c055bc4b>] cpu_down+0x2b/0x40
[<c055ce69>] store_online+0x39/0x80
[<c02fb91b>] sysdev_store+0x2b/0x40
[<c01dd0a2>] sysfs_write_file+0xa2/0x100
[<c019ecc6>] vfs_write+0x96/0x130
[<c019f38d>] sys_write+0x3d/0x70
[<c0104ceb>] sysenter_past_esp+0x78/0xd1
[<ffffffff>] 0xffffffff

other info that might help us debug this:

3 locks held by bash/3510:
#0: (&buffer->mutex){--..}, at: [<c01dd02b>] sysfs_write_file+0x2b/0x100
#1: (cpu_add_remove_lock){--..}, at: [<c015d97f>]
cpu_maps_update_begin+0xf/0x20
#2: (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50

stack backtrace:
Pid: 3510, comm: bash Not tainted 2.6.26-rc8-dirty #39
[<c0156517>] print_circular_bug_tail+0x77/0x90
[<c0155b93>] ? print_circular_bug_entry+0x43/0x50
[<c0158a15>] __lock_acquire+0xaf5/0x1040
[<c010aeb5>] ? native_sched_clock+0xb5/0x110
[<c0157895>] ? mark_held_locks+0x65/0x80
[<c0158ff8>] lock_acquire+0x98/0xd0
[<c0145690>] ? cleanup_workqueue_thread+0x10/0x70
[<c01456b6>] cleanup_workqueue_thread+0x36/0x70
[<c0145690>] ? cleanup_workqueue_thread+0x10/0x70
[<c055d91a>] workqueue_cpu_callback+0x7a/0x130
[<c0580613>] ? _spin_unlock_irqrestore+0x43/0x70
[<c014d497>] notifier_call_chain+0x37/0x70
[<c014d509>] __raw_notifier_call_chain+0x19/0x20
[<c014d52a>] raw_notifier_call_chain+0x1a/0x20
[<c055bb28>] _cpu_down+0x148/0x240
[<c015d97f>] ? cpu_maps_update_begin+0xf/0x20
[<c055bc4b>] cpu_down+0x2b/0x40
[<c055ce69>] store_online+0x39/0x80
[<c055ce30>] ? store_online+0x0/0x80
[<c02fb91b>] sysdev_store+0x2b/0x40
[<c01dd0a2>] sysfs_write_file+0xa2/0x100
[<c019ecc6>] vfs_write+0x96/0x130
[<c01dd000>] ? sysfs_write_file+0x0/0x100
[<c019f38d>] sys_write+0x3d/0x70
[<c0104ceb>] sysenter_past_esp+0x78/0xd1
=======================


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-26 09:50:37

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, Jun 26, 2008 at 2:34 AM, Vegard Nossum <[email protected]> wrote:
> On Thu, Jun 26, 2008 at 9:56 AM, Paul Menage <[email protected]> wrote:
>> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>>
>> In the current cpusets code the lock nesting between cgroup_mutex and
>> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
>> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
>> and in all other paths that call rebuild_sched_domains() it nests
>> inside.
>>
>> This patch makes most calls to rebuild_sched_domains() asynchronous
>> via the workqueue, which removes the nesting of the two locks in that
>> case. In the case of an actual hotplug event, cpuhotplug.lock nests
>> outside cgroup_mutex as now.
>>
>> Signed-off-by: Paul Menage <[email protected]>
>>
>> ---
>>
>> Note that all I've done with this patch is verify that it compiles
>> without warnings; I'm not sure how to trigger a hotplug event to test
>> the lock dependencies or verify that scheduler domain support is still
>> behaving correctly. Vegard, does this fix the problems that you were
>> seeing? Paul/Max, does this still seem sane with regard to scheduler
>> domains?
>
> Nope, sorry :-(
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc8-dirty #39
> -------------------------------------------------------
> bash/3510 is trying to acquire lock:
> (events){--..}, at: [<c0145690>] cleanup_workqueue_thread+0x10/0x70
>
> but task is already holding lock:
> (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50
>
> which lock already depends on the new lock.
>

Does that mean that you can't ever call get_online_cpus() from a
workqueue thread?

Paul.

2008-06-26 18:50:13

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

Paul Menage wrote:
> On Thu, Jun 26, 2008 at 2:34 AM, Vegard Nossum <[email protected]> wrote:
>> On Thu, Jun 26, 2008 at 9:56 AM, Paul Menage <[email protected]> wrote:
>>> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>>>
>>> In the current cpusets code the lock nesting between cgroup_mutex and
>>> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
>>> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
>>> and in all other paths that call rebuild_sched_domains() it nests
>>> inside.
>>>
>>> This patch makes most calls to rebuild_sched_domains() asynchronous
>>> via the workqueue, which removes the nesting of the two locks in that
>>> case. In the case of an actual hotplug event, cpuhotplug.lock nests
>>> outside cgroup_mutex as now.
>>>
>>> Signed-off-by: Paul Menage <[email protected]>
>>>
>>> ---
>>>
>>> Note that all I've done with this patch is verify that it compiles
>>> without warnings; I'm not sure how to trigger a hotplug event to test
>>> the lock dependencies or verify that scheduler domain support is still
>>> behaving correctly.
You can just do:
echo 0 > /sys/devices/cpu/cpuN/online
echo 1 > /sys/devices/cpu/cpuN/online

>>> Vegard, does this fix the problems that you were
>>> seeing? Paul/Max, does this still seem sane with regard to scheduler
>>> domains?
>> Nope, sorry :-(
>>
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 2.6.26-rc8-dirty #39
>> -------------------------------------------------------
>> bash/3510 is trying to acquire lock:
>> (events){--..}, at: [<c0145690>] cleanup_workqueue_thread+0x10/0x70
>>
>> but task is already holding lock:
>> (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50
>>
>> which lock already depends on the new lock.
>>
>
> Does that mean that you can't ever call get_online_cpus() from a
> workqueue thread?

In general it should be ok (no different from user-space task calling
it). But there is still circular dependency because we're calling into
domain partitioning code.
Below is more detailed lockdep report with your patch applied on top of
-rc8.
Looks like this might be a good time to rethink overall locking in there.

> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc8 #3
> -------------------------------------------------------
> bash/2836 is trying to acquire lock:
> (cgroup_mutex){--..}, at: [<ffffffff802653e2>] cgroup_lock+0x12/0x20
>
> but task is already holding lock:
> (&cpu_hotplug.lock){--..}, at: [<ffffffff8025e062>] cpu_hotplug_begin+0x22/0x60
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&cpu_hotplug.lock){--..}:
> [<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
> [<ffffffff8025e024>] get_online_cpus+0x24/0x40
> [<ffffffff8022fee1>] sched_getaffinity+0x11/0x80
> [<ffffffff8026e5c9>] __synchronize_sched+0x19/0x90
> [<ffffffff8022ed46>] detach_destroy_domains+0x46/0x50
> [<ffffffff8022f6b9>] partition_sched_domains+0xf9/0x2b0
> [<ffffffff80268eea>] rebuild_sched_domains+0x9a/0x3e0
> [<ffffffff80269243>] delayed_rebuild_sched_domains+0x13/0x30
> [<ffffffff80247d5e>] run_workqueue+0xde/0x220
> [<ffffffff80247f00>] worker_thread+0x60/0xb0
> [<ffffffff8024c069>] kthread+0x49/0x90
> [<ffffffff8020c898>] child_rip+0xa/0x12
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #1 (sched_domains_mutex){--..}:
> [<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
> [<ffffffff8022f5e9>] partition_sched_domains+0x29/0x2b0
> [<ffffffff80268eea>] rebuild_sched_domains+0x9a/0x3e0
> [<ffffffff80269243>] delayed_rebuild_sched_domains+0x13/0x30
> [<ffffffff80247d5e>] run_workqueue+0xde/0x220
> [<ffffffff80247f00>] worker_thread+0x60/0xb0
> [<ffffffff8024c069>] kthread+0x49/0x90
> [<ffffffff8020c898>] child_rip+0xa/0x12
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #0 (cgroup_mutex){--..}:
> [<ffffffff80259913>] __lock_acquire+0xa53/0xe50
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
> [<ffffffff802653e2>] cgroup_lock+0x12/0x20
> [<ffffffff80269bf1>] cpuset_handle_cpuhp+0x31/0x230
> [<ffffffff804d677f>] notifier_call_chain+0x3f/0x80
> [<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
> [<ffffffff804c1638>] _cpu_down+0xa8/0x290
> [<ffffffff804c185b>] cpu_down+0x3b/0x60
> [<ffffffff804c2b58>] store_online+0x48/0xa0
> [<ffffffff803a45b4>] sysdev_store+0x24/0x30
> [<ffffffff802eeaaa>] sysfs_write_file+0xca/0x140
> [<ffffffff8029ca2b>] vfs_write+0xcb/0x170
> [<ffffffff8029cbc0>] sys_write+0x50/0x90
> [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> other info that might help us debug this:
>
> 3 locks held by bash/2836:
> #0: (&buffer->mutex){--..}, at: [<ffffffff802eea23>] sysfs_write_file+0x43/0x140
> #1: (cpu_add_remove_lock){--..}, at: [<ffffffff804c1847>] cpu_down+0x27/0x60
> #2: (&cpu_hotplug.lock){--..}, at: [<ffffffff8025e062>] cpu_hotplug_begin+0x22/0x60
>
> stack backtrace:
> Pid: 2836, comm: bash Not tainted 2.6.26-rc8 #3
>
> Call Trace:
> [<ffffffff80258c0c>] print_circular_bug_tail+0x8c/0x90
> [<ffffffff802589c4>] ? print_circular_bug_entry+0x54/0x60
> [<ffffffff80259913>] __lock_acquire+0xa53/0xe50
> [<ffffffff8025867d>] ? mark_held_locks+0x4d/0x90
> [<ffffffff804d1345>] ? mutex_lock_nested+0x225/0x250
> [<ffffffff802653e2>] ? cgroup_lock+0x12/0x20
> [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
> [<ffffffff802653e2>] ? cgroup_lock+0x12/0x20
> [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
> [<ffffffff802653e2>] cgroup_lock+0x12/0x20
> [<ffffffff80269bf1>] cpuset_handle_cpuhp+0x31/0x230
> [<ffffffff8022edaa>] ? update_sched_domains+0x5a/0x70
> [<ffffffff804d677f>] notifier_call_chain+0x3f/0x80
> [<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
> [<ffffffff804c1638>] _cpu_down+0xa8/0x290
> [<ffffffff804c185b>] cpu_down+0x3b/0x60
> [<ffffffff804c2b58>] store_online+0x48/0xa0
> [<ffffffff803a45b4>] sysdev_store+0x24/0x30
> [<ffffffff802eeaaa>] sysfs_write_file+0xca/0x140
> [<ffffffff8029ca2b>] vfs_write+0xcb/0x170
> [<ffffffff8029cbc0>] sys_write+0x50/0x90
> [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80



2008-06-26 19:20:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, 2008-06-26 at 11:49 -0700, Max Krasnyansky wrote:

> > Does that mean that you can't ever call get_online_cpus() from a
> > workqueue thread?
>
> In general it should be ok (no different from user-space task calling
> it). But there is still circular dependency because we're calling into
> domain partitioning code.
> Below is more detailed lockdep report with your patch applied on top of
> -rc8.
> Looks like this might be a good time to rethink overall locking in there.

Gautham has been working on some of this recently. Perhaps he can share
his patch-stack again.


2008-06-26 20:34:35

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, Jun 26, 2008 at 11:49 AM, Max Krasnyansky <[email protected]> wrote:
>>
>> Does that mean that you can't ever call get_online_cpus() from a
>> workqueue thread?
>
> In general it should be ok (no different from user-space task calling it).

No, I think it is a problem.

When a CPU goes down, the hotplug code holds cpu_hotplug.lock and
calls cleanup_workqueue_thread() which waits for any running work on
that thread to finish.

So if the workqueue thread running on that CPU calls get_online_cpus()
while the hotplug thread is waiting for it, it will deadlock. (It's
not clear to me how lockdep figured this out - I guess it's something
to do with the *_acquire() annotations that tell lockdep to treat the
workqueue structure as a pseudo-lock?)

I guess the fix for that would be to have a non-workqueue thread to
handle the async domain rebuilding, which isn't tied to a particular
cpu the way workqueue threads are.

> But there is still circular dependency because we're calling into
> domain partitioning code.

OK, so the problem is that since cpu_hotplug contains a hand-rolled
rwsem, lockdep is going to spot false deadlocks.

Is there any reason not to replace cpu_hotplug.lock and
cpu_hotplug.refcount with an rw_semaphore, and have the following:

void get_online_cpus(void)
{
might_sleep();
if (cpu_hotplug.active_writer == current)
return;
down_read(&cpu_hotplug.lock);
}

void put_online_cpus(void)
{
if (cpu_hotplug.active_writer == current)
return;
up_read(&cpu_hotplug.lock);
}

static void cpu_hotplug_begin(void)
{
down_write(&cpu_hotplug.lock);
cpu_hotplug.active_writer = current;
}

static void cpu_hotplug_done(void)
{
cpu_hotplug.active_writer = NULL;
up_write(&cpu_hotplug.lock);
}

I think that combined with moving the async rebuild_sched_domains to a
separate thread should solve the problem, but I'm wondering why
cpu_hotplug.lock was implemented this way in the first place.

Paul

2008-06-26 21:17:54

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, Jun 26, 2008 at 1:34 PM, Paul Menage <[email protected]> wrote:
>
> void get_online_cpus(void)
> {
> might_sleep();
> if (cpu_hotplug.active_writer == current)
> return;
> down_read(&cpu_hotplug.lock);
> }
>
> void put_online_cpus(void)
> {
> if (cpu_hotplug.active_writer == current)
> return;
> up_read(&cpu_hotplug.lock);
> }
>
> static void cpu_hotplug_begin(void)
> {
> down_write(&cpu_hotplug.lock);
> cpu_hotplug.active_writer = current;
> }
>
> static void cpu_hotplug_done(void)
> {
> cpu_hotplug.active_writer = NULL;
> up_write(&cpu_hotplug.lock);
> }
>
> I think that combined with moving the async rebuild_sched_domains to a
> separate thread should solve the problem, but I'm wondering why
> cpu_hotplug.lock was implemented this way in the first place.

Oh, I guess that doesn't work because of recursive calls to
get_online_cpus(). Maybe we need a down_read_recursive() that skips
ahead of waiting writers if the lock is already held in read mode?

Paul

Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, Jun 26, 2008 at 12:56:49AM -0700, Paul Menage wrote:
> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>
> In the current cpusets code the lock nesting between cgroup_mutex and
> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
> and in all other paths that call rebuild_sched_domains() it nests
> inside.
>
> This patch makes most calls to rebuild_sched_domains() asynchronous
> via the workqueue, which removes the nesting of the two locks in that
> case. In the case of an actual hotplug event, cpuhotplug.lock nests
> outside cgroup_mutex as now.
>
> Signed-off-by: Paul Menage <[email protected]>
>
> ---
>
> Note that all I've done with this patch is verify that it compiles
> without warnings; I'm not sure how to trigger a hotplug event to test
> the lock dependencies or verify that scheduler domain support is still
> behaving correctly. Vegard, does this fix the problems that you were
> seeing? Paul/Max, does this still seem sane with regard to scheduler domains?
>

Hi Paul,

Using a multithreaded workqueue(kevent here) for this is not such a
great idea this,since currently we cannot call
get_online_cpus() from a workitem executed by a multithreaded workqueue.

Can one use a single threaded workqueue here instead ?

Or, better, I think we can ask Oleg to re-submit the patch he had to make
get_online_cpus() safe to be called from within the workqueue. It does
require a special post CPU_DEAD notification, but as it does work the
last time I checked.

>
> kernel/cpuset.c | 35 +++++++++++++++++++++++------------
> 1 file changed, 23 insertions(+), 12 deletions(-)
>
> Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> ===================================================================
> --- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
> +++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> @@ -522,13 +522,9 @@ update_domain_attr(struct sched_domain_a
> * domains when operating in the severe memory shortage situations
> * that could cause allocation failures below.
> *
> - * Call with cgroup_mutex held. May take callback_mutex during
> - * call due to the kfifo_alloc() and kmalloc() calls. May nest
> - * a call to the get_online_cpus()/put_online_cpus() pair.
> - * Must not be called holding callback_mutex, because we must not
> - * call get_online_cpus() while holding callback_mutex. Elsewhere
> - * the kernel nests callback_mutex inside get_online_cpus() calls.
> - * So the reverse nesting would risk an ABBA deadlock.
> + * Call with cgroup_mutex held, and inside get_online_cpus(). May
> + * take callback_mutex during call due to the kfifo_alloc() and
> + * kmalloc() calls.
> *
> * The three key local variables below are:
> * q - a kfifo queue of cpuset pointers, used to implement a
> @@ -689,9 +685,7 @@ restart:
>
> rebuild:
> /* Have scheduler rebuild sched domains */
> - get_online_cpus();
> partition_sched_domains(ndoms, doms, dattr);
> - put_online_cpus();
>
> done:
> if (q && !IS_ERR(q))
> @@ -701,6 +695,21 @@ done:
> /* Don't kfree(dattr) -- partition_sched_domains() does that. */
> }
>
> +/*
> + * Due to the need to nest cgroup_mutex inside cpuhotplug.lock, most
> + * of our invocations of rebuild_sched_domains() are done
> + * asynchronously via the workqueue
> + */
> +static void delayed_rebuild_sched_domains(struct work_struct *work)
> +{
> + get_online_cpus();
> + cgroup_lock();
> + rebuild_sched_domains();
> + cgroup_unlock();
> + put_online_cpus();
> +}
> +static DECLARE_WORK(rebuild_sched_domains_work, delayed_rebuild_sched_domains);
> +
> static inline int started_after_time(struct task_struct *t1,
> struct timespec *time,
> struct task_struct *t2)
> @@ -853,7 +862,7 @@ static int update_cpumask(struct cpuset return
> retval;
>
> if (is_load_balanced)
> - rebuild_sched_domains();
> + schedule_work(&rebuild_sched_domains_work);
> return 0;
> }
>
> @@ -1080,7 +1089,7 @@ static int update_relax_domain_level(str
>
> if (val != cs->relax_domain_level) {
> cs->relax_domain_level = val;
> - rebuild_sched_domains();
> + schedule_work(&rebuild_sched_domains_work);
> }
>
> return 0;
> @@ -1121,7 +1130,7 @@ static int update_flag(cpuset_flagbits_t
> mutex_unlock(&callback_mutex);
>
> if (cpus_nonempty && balance_flag_changed)
> - rebuild_sched_domains();
> + schedule_work(&rebuild_sched_domains_work);
>
> return 0;
> }
> @@ -1929,6 +1938,7 @@ static void scan_for_empty_cpusets(const
>
> static void common_cpu_mem_hotplug_unplug(void)
> {
> + get_online_cpus();
> cgroup_lock();
>
> top_cpuset.cpus_allowed = cpu_online_map;
> @@ -1942,6 +1952,7 @@ static void common_cpu_mem_hotplug_unplu
> rebuild_sched_domains();
>
> cgroup_unlock();
> + put_online_cpus();
> }
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Thanks and Regards
gautham

Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Fri, Jun 27, 2008 at 08:52:28AM +0530, Gautham R Shenoy wrote:
> On Thu, Jun 26, 2008 at 12:56:49AM -0700, Paul Menage wrote:
> > CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
> >
> > In the current cpusets code the lock nesting between cgroup_mutex and
> > cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
> > in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
> > and in all other paths that call rebuild_sched_domains() it nests
> > inside.
> >
> > This patch makes most calls to rebuild_sched_domains() asynchronous
> > via the workqueue, which removes the nesting of the two locks in that
> > case. In the case of an actual hotplug event, cpuhotplug.lock nests
> > outside cgroup_mutex as now.
> >
> > Signed-off-by: Paul Menage <[email protected]>
> >
> > ---
> >
> > Note that all I've done with this patch is verify that it compiles
> > without warnings; I'm not sure how to trigger a hotplug event to test
> > the lock dependencies or verify that scheduler domain support is still
> > behaving correctly. Vegard, does this fix the problems that you were
> > seeing? Paul/Max, does this still seem sane with regard to scheduler domains?
> >
>
> Hi Paul,
>

This time CC'ing Oleg!

> Using a multithreaded workqueue(kevent here) for this is not such a
> great idea this,since currently we cannot call
> get_online_cpus() from a workitem executed by a multithreaded workqueue.
>
> Can one use a single threaded workqueue here instead ?
>
> Or, better, I think we can ask Oleg to re-submit the patch he had to make
> get_online_cpus() safe to be called from within the workqueue. It does
> require a special post CPU_DEAD notification, but as it does work the
> last time I checked.
>
> >
> > kernel/cpuset.c | 35 +++++++++++++++++++++++------------
> > 1 file changed, 23 insertions(+), 12 deletions(-)
> >
> > Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> > ===================================================================
> > --- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
> > +++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> > @@ -522,13 +522,9 @@ update_domain_attr(struct sched_domain_a
> > * domains when operating in the severe memory shortage situations
> > * that could cause allocation failures below.
> > *
> > - * Call with cgroup_mutex held. May take callback_mutex during
> > - * call due to the kfifo_alloc() and kmalloc() calls. May nest
> > - * a call to the get_online_cpus()/put_online_cpus() pair.
> > - * Must not be called holding callback_mutex, because we must not
> > - * call get_online_cpus() while holding callback_mutex. Elsewhere
> > - * the kernel nests callback_mutex inside get_online_cpus() calls.
> > - * So the reverse nesting would risk an ABBA deadlock.
> > + * Call with cgroup_mutex held, and inside get_online_cpus(). May
> > + * take callback_mutex during call due to the kfifo_alloc() and
> > + * kmalloc() calls.
> > *
> > * The three key local variables below are:
> > * q - a kfifo queue of cpuset pointers, used to implement a
> > @@ -689,9 +685,7 @@ restart:
> >
> > rebuild:
> > /* Have scheduler rebuild sched domains */
> > - get_online_cpus();
> > partition_sched_domains(ndoms, doms, dattr);
> > - put_online_cpus();
> >
> > done:
> > if (q && !IS_ERR(q))
> > @@ -701,6 +695,21 @@ done:
> > /* Don't kfree(dattr) -- partition_sched_domains() does that. */
> > }
> >
> > +/*
> > + * Due to the need to nest cgroup_mutex inside cpuhotplug.lock, most
> > + * of our invocations of rebuild_sched_domains() are done
> > + * asynchronously via the workqueue
> > + */
> > +static void delayed_rebuild_sched_domains(struct work_struct *work)
> > +{
> > + get_online_cpus();
> > + cgroup_lock();
> > + rebuild_sched_domains();
> > + cgroup_unlock();
> > + put_online_cpus();
> > +}
> > +static DECLARE_WORK(rebuild_sched_domains_work, delayed_rebuild_sched_domains);
> > +
> > static inline int started_after_time(struct task_struct *t1,
> > struct timespec *time,
> > struct task_struct *t2)
> > @@ -853,7 +862,7 @@ static int update_cpumask(struct cpuset return
> > retval;
> >
> > if (is_load_balanced)
> > - rebuild_sched_domains();
> > + schedule_work(&rebuild_sched_domains_work);
> > return 0;
> > }
> >
> > @@ -1080,7 +1089,7 @@ static int update_relax_domain_level(str
> >
> > if (val != cs->relax_domain_level) {
> > cs->relax_domain_level = val;
> > - rebuild_sched_domains();
> > + schedule_work(&rebuild_sched_domains_work);
> > }
> >
> > return 0;
> > @@ -1121,7 +1130,7 @@ static int update_flag(cpuset_flagbits_t
> > mutex_unlock(&callback_mutex);
> >
> > if (cpus_nonempty && balance_flag_changed)
> > - rebuild_sched_domains();
> > + schedule_work(&rebuild_sched_domains_work);
> >
> > return 0;
> > }
> > @@ -1929,6 +1938,7 @@ static void scan_for_empty_cpusets(const
> >
> > static void common_cpu_mem_hotplug_unplug(void)
> > {
> > + get_online_cpus();
> > cgroup_lock();
> >
> > top_cpuset.cpus_allowed = cpu_online_map;
> > @@ -1942,6 +1952,7 @@ static void common_cpu_mem_hotplug_unplu
> > rebuild_sched_domains();
> >
> > cgroup_unlock();
> > + put_online_cpus();
> > }
> >
> > /*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
> --
> Thanks and Regards
> gautham

--
Thanks and Regards
gautham

2008-06-27 04:54:11

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

Gautham R Shenoy wrote:
> On Fri, Jun 27, 2008 at 08:52:28AM +0530, Gautham R Shenoy wrote:
>> On Thu, Jun 26, 2008 at 12:56:49AM -0700, Paul Menage wrote:
>>> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>>>
>>> In the current cpusets code the lock nesting between cgroup_mutex and
>>> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
>>> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
>>> and in all other paths that call rebuild_sched_domains() it nests
>>> inside.
>>>
>>> This patch makes most calls to rebuild_sched_domains() asynchronous
>>> via the workqueue, which removes the nesting of the two locks in that
>>> case. In the case of an actual hotplug event, cpuhotplug.lock nests
>>> outside cgroup_mutex as now.
>>>
>> Using a multithreaded workqueue(kevent here) for this is not such a
>> great idea this,since currently we cannot call
>> get_online_cpus() from a workitem executed by a multithreaded workqueue.
>>
>> Can one use a single threaded workqueue here instead ?
We could certainly do it in the single threaded workqueue. It won't help to
avoid circular locking dependencies though.
Did you get a chance to read entire thread on this topic ? There were some
questions that you might be able to answer.

Max

2008-06-27 05:10:51

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

Paul Menage wrote:
> On Thu, Jun 26, 2008 at 1:34 PM, Paul Menage <[email protected]> wrote:
>> void get_online_cpus(void)
>> {
>> might_sleep();
>> if (cpu_hotplug.active_writer == current)
>> return;
>> down_read(&cpu_hotplug.lock);
>> }
>>
>> void put_online_cpus(void)
>> {
>> if (cpu_hotplug.active_writer == current)
>> return;
>> up_read(&cpu_hotplug.lock);
>> }
>>
>> static void cpu_hotplug_begin(void)
>> {
>> down_write(&cpu_hotplug.lock);
>> cpu_hotplug.active_writer = current;
>> }
>>
>> static void cpu_hotplug_done(void)
>> {
>> cpu_hotplug.active_writer = NULL;
>> up_write(&cpu_hotplug.lock);
>> }
>>
>> I think that combined with moving the async rebuild_sched_domains to a
>> separate thread should solve the problem, but I'm wondering why
>> cpu_hotplug.lock was implemented this way in the first place.
>
> Oh, I guess that doesn't work because of recursive calls to
> get_online_cpus(). Maybe we need a down_read_recursive() that skips
> ahead of waiting writers if the lock is already held in read mode?

Instead of changing cpu_hotplug locking should we maybe try to avoid using
cgroup_lock in rebuild_sched_domains() ?
There is a comment in cpuset.c that says
* If a task is only holding callback_mutex, then it has read-only
* access to cpusets.

I'm not sure if it's still valid. rebuild_sched_domains() only needs read only
access, it does not really modify any cpuset structures.

Max

2008-06-27 05:51:39

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On Thu, Jun 26, 2008 at 10:10 PM, Max Krasnyansky <[email protected]> wrote:
>
> Instead of changing cpu_hotplug locking should we maybe try to avoid using
> cgroup_lock in rebuild_sched_domains() ?

Yes, that would be good too.

> There is a comment in cpuset.c that says
> * If a task is only holding callback_mutex, then it has read-only
> * access to cpusets.
>
> I'm not sure if it's still valid. rebuild_sched_domains() only needs read only
> access, it does not really modify any cpuset structures.

The comment is still valid, if you interpret it narrowly enough.
Holding callback_mutex gives you read-only access to structures that
are under the control of cpusets. But rebuild_sched_domains() needs to
traverse the hierarchy of cpusets, and that hierarchy is controlled by
cgroups. Currently the only synchronization primitives exposed by
cgroups are:

- cgroup_lock()/cgroup_unlock() to prevent all cgroup modifications
(also used as the main synchronization primitive by some subsystems,
i.e. it's in danger of becoming the cgroups equivalent of the BKL).

- task_lock()/task_unlock() to prevent a specific task from changing cgroups

Possible options for richer locking support include:

- lock/unlock a hierarchy, to prevent creation/deletion of cgroups in
that hierarchy

- lock/unlock a cgroup to prevent deletion of that cgroup

- lock/unlock a cgroup to prevent task movement in/out of that cgroup

For the case of rebuild_sched_domains, we need the first of these
options. This lock would be taken in cgroup.c at the points where it
attached and removed cgroups from a cgroup tree, and could be taken by
something like cpusets that needed to keep the hierarchy stable while
scanning it. I think it would be fine to make it a mutex rather than a
spinlock.

cpu_hotplug.lock has to nest outside this hierarchy lock due to it
being taken at the root of the hotplug/unplug path. So as long as we
can ensure that we can always nest the hierarchy lock inside any
get_online_cpus()/put_online_cpus() pairs, we should be OK.

Paul

2008-06-27 16:40:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

On 06/27, Gautham R Shenoy wrote:
>
> > Using a multithreaded workqueue(kevent here) for this is not such a
> > great idea this,since currently we cannot call
> > get_online_cpus() from a workitem executed by a multithreaded workqueue.
> >
> > Can one use a single threaded workqueue here instead ?

Yes, this should help.

But I think you are right, we should just fix this annoying problem,

> > Or, better, I think we can ask Oleg to re-submit the patch he had to make
> > get_online_cpus() safe to be called from within the workqueue. It does
> > require a special post CPU_DEAD notification, but as it does work the
> > last time I checked.

OK, will do on Sunday.

Oleg.

2008-06-27 17:31:29

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue

Paul Menage wrote:
> On Thu, Jun 26, 2008 at 10:10 PM, Max Krasnyansky <[email protected]> wrote:
>> Instead of changing cpu_hotplug locking should we maybe try to avoid using
>> cgroup_lock in rebuild_sched_domains() ?
>
> Yes, that would be good too.
>
>> There is a comment in cpuset.c that says
>> * If a task is only holding callback_mutex, then it has read-only
>> * access to cpusets.
>>
>> I'm not sure if it's still valid. rebuild_sched_domains() only needs read only
>> access, it does not really modify any cpuset structures.
>
> The comment is still valid, if you interpret it narrowly enough.
> Holding callback_mutex gives you read-only access to structures that
> are under the control of cpusets. But rebuild_sched_domains() needs to
> traverse the hierarchy of cpusets, and that hierarchy is controlled by
> cgroups.
Yes that's what I meant by "not sure if it's still valid" I looked at
the code and it did not look like callback_mutex protected overall
hierarchy. Thanx for confirming that.

> Currently the only synchronization primitives exposed by
> cgroups are:
>
> - cgroup_lock()/cgroup_unlock() to prevent all cgroup modifications
> (also used as the main synchronization primitive by some subsystems,
> i.e. it's in danger of becoming the cgroups equivalent of the BKL).
>
> - task_lock()/task_unlock() to prevent a specific task from changing cgroups
>
> Possible options for richer locking support include:
>
> - lock/unlock a hierarchy, to prevent creation/deletion of cgroups in
> that hierarchy
Sounds good.

> - lock/unlock a cgroup to prevent deletion of that cgroup
Can that be just an atomic refcount ?

> - lock/unlock a cgroup to prevent task movement in/out of that cgroup
Sounds good.

> For the case of rebuild_sched_domains, we need the first of these
> options. This lock would be taken in cgroup.c at the points where it
> attached and removed cgroups from a cgroup tree, and could be taken by
> something like cpusets that needed to keep the hierarchy stable while
> scanning it. I think it would be fine to make it a mutex rather than a
> spinlock.
Agree

> cpu_hotplug.lock has to nest outside this hierarchy lock due to it
> being taken at the root of the hotplug/unplug path. So as long as we
> can ensure that we can always nest the hierarchy lock inside any
> get_online_cpus()/put_online_cpus() pairs, we should be OK.
Yes. Although that basically means that we always have to take
cpu_hotplug.lock before hierarchy lock.

I like the proposal in general. Specifically for the
rebuild_sched_domain() I'm now thinking that maybe we can get away with
not involving cpuset at all. I think that what Peter meant originally.
I'll send more thoughts on this separately.

Max