2008-07-15 11:45:09

by Max Krasnyansky

[permalink] [raw]
Subject: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

From: Max Krasnyanskiy <[email protected]>

I do not really like the current solution of dropping cgroup lock
but it shows what I have in mind in general.

Basically as Paul J. pointed out rebuild_sched_domains() is the
only way to rebuild sched domains correctly based on the current
cpuset settings. What this means is that we need to be able to
call it from different contexts, like cpuhotplug for example.

In order to get there we need to redo the lock nesting a bit.
This patch tries to do that. New lock nesting is explained in the
comments.
Paul M has some patches for fine grain cgroup locking. That should
simplify things in the future.

This passes light testing (building stuff and creating/removing
domains and bring cpus off/online at the same time) and keeps lockdep
mostly happy. The only thing that I haven't resolved yet is some
complains from lockdep and preemtable RCU is enabled.

Signed-off-by: Max Krasnyanskiy <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
kernel/cpuset.c | 91 ++++++++++++++++++++++++++++--------------------------
1 files changed, 47 insertions(+), 44 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3c3ef02..7b9fa17 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -522,13 +522,8 @@ update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
* domains when operating in the severe memory shortage situations
* that could cause allocation failures below.
*
- * Call with cgroup_mutex held. May take callback_mutex during
- * call due to the kfifo_alloc() and kmalloc() calls. May nest
- * a call to the get_online_cpus()/put_online_cpus() pair.
- * Must not be called holding callback_mutex, because we must not
- * call get_online_cpus() while holding callback_mutex. Elsewhere
- * the kernel nests callback_mutex inside get_online_cpus() calls.
- * So the reverse nesting would risk an ABBA deadlock.
+ * Call under get_online_cpus().
+ * Must not be called with cgroup_lock or callback_mutex held.
*
* The three key local variables below are:
* q - a kfifo queue of cpuset pointers, used to implement a
@@ -581,6 +576,10 @@ void rebuild_sched_domains(void)
doms = NULL;
dattr = NULL;

+ /* We have to iterate cgroup hierarchy, make sure nobody is messing
+ * with it. */
+ cgroup_lock();
+
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(&top_cpuset)) {
ndoms = 1;
@@ -598,10 +597,10 @@ void rebuild_sched_domains(void)

q = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
if (IS_ERR(q))
- goto done;
+ goto unlock;
csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
if (!csa)
- goto done;
+ goto unlock;
csn = 0;

cp = &top_cpuset;
@@ -688,10 +687,16 @@ restart:
BUG_ON(nslot != ndoms);

rebuild:
+ /* Drop cgroup lock before calling the scheduler.
+ * This is not strictly necesseary but simplifies lock nesting. */
+ cgroup_unlock();
+
/* Have scheduler rebuild sched domains */
- get_online_cpus();
partition_sched_domains(ndoms, doms, dattr);
- put_online_cpus();
+ goto done;
+
+unlock:
+ cgroup_unlock();

done:
if (q && !IS_ERR(q))
@@ -701,6 +706,27 @@ done:
/* Don't kfree(dattr) -- partition_sched_domains() does that. */
}

+/*
+ * Internal version of the rebuild_sched_domains() that ensures proper
+ * lock nesting. rebuild_sched_domains() must be called under
+ * get_online_cpus() and it needs to take cgroup_lock(). Since most of
+ * the cpuset code is already holding cgroup_lock() while calling
+ * rebuild_sched_domains() we must drop it here and take it under
+ * get_online_cpus().
+ * This should go away when fine grained cgroup locking is available.
+ *
+ * Must be called with cgroup_lock() held.
+ * Must not be called under get_online_cpus().
+ */
+static void __rebuild_sched_domains(void)
+{
+ cgroup_unlock();
+ get_online_cpus();
+ rebuild_sched_domains();
+ put_online_cpus();
+ cgroup_lock();
+}
+
static inline int started_after_time(struct task_struct *t1,
struct timespec *time,
struct task_struct *t2)
@@ -831,7 +857,7 @@ static int update_cpumask(struct cpuset *cs, char *buf)
heap_free(&heap);

if (is_load_balanced)
- rebuild_sched_domains();
+ __rebuild_sched_domains();
return 0;
}

@@ -1042,7 +1068,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)

if (val != cs->relax_domain_level) {
cs->relax_domain_level = val;
- rebuild_sched_domains();
+ __rebuild_sched_domains();
}

return 0;
@@ -1083,7 +1109,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
mutex_unlock(&callback_mutex);

if (cpus_nonempty && balance_flag_changed)
- rebuild_sched_domains();
+ __rebuild_sched_domains();

return 0;
}
@@ -1194,15 +1220,6 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,

if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
return -ENOSPC;
- if (tsk->flags & PF_THREAD_BOUND) {
- cpumask_t mask;
-
- mutex_lock(&callback_mutex);
- mask = cs->cpus_allowed;
- mutex_unlock(&callback_mutex);
- if (!cpus_equal(tsk->cpus_allowed, mask))
- return -EINVAL;
- }

return security_task_setscheduler(tsk, 0, NULL);
}
@@ -1216,14 +1233,11 @@ static void cpuset_attach(struct cgroup_subsys *ss,
struct mm_struct *mm;
struct cpuset *cs = cgroup_cs(cont);
struct cpuset *oldcs = cgroup_cs(oldcont);
- int err;

mutex_lock(&callback_mutex);
guarantee_online_cpus(cs, &cpus);
- err = set_cpus_allowed_ptr(tsk, &cpus);
+ set_cpus_allowed_ptr(tsk, &cpus);
mutex_unlock(&callback_mutex);
- if (err)
- return;

from = oldcs->mems_allowed;
to = cs->mems_allowed;
@@ -1677,15 +1691,9 @@ static struct cgroup_subsys_state *cpuset_create(
}

/*
- * Locking note on the strange update_flag() call below:
- *
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains(). The get_online_cpus()
- * call in rebuild_sched_domains() must not be made while holding
- * callback_mutex. Elsewhere the kernel nests callback_mutex inside
- * get_online_cpus() calls. So the reverse nesting would risk an
- * ABBA deadlock.
+ * will call rebuild_sched_domains().
*/

static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1894,7 +1902,7 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
* in order to minimize text size.
*/

-static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
+static void common_cpu_mem_hotplug_unplug(void)
{
cgroup_lock();

@@ -1902,13 +1910,6 @@ static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
scan_for_empty_cpusets(&top_cpuset);

- /*
- * Scheduler destroys domains on hotplug events.
- * Rebuild them based on the current settings.
- */
- if (rebuild_sd)
- rebuild_sched_domains();
-
cgroup_unlock();
}

@@ -1934,12 +1935,14 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
case CPU_ONLINE_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
- common_cpu_mem_hotplug_unplug(1);
break;
+
default:
return NOTIFY_DONE;
}

+ common_cpu_mem_hotplug_unplug();
+ rebuild_sched_domains();
return NOTIFY_OK;
}

@@ -1953,7 +1956,7 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,

void cpuset_track_online_nodes(void)
{
- common_cpu_mem_hotplug_unplug(0);
+ common_cpu_mem_hotplug_unplug();
}
#endif

--
1.5.5.1


2008-07-15 16:07:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

Max K wrote:
> Basically as Paul J. pointed out rebuild_sched_domains() is the
> only way to rebuild sched domains correctly based on the current
> cpuset settings. What this means is that we need to be able to
> call it from different contexts, like cpuhotplug for example.

The following is just a wild idea.

On systems using cpusets to setup scheduler domains (anything more
elaborate than the default of a single, system-wide, domain) have a
special purpose thread sit around, that does nothing except rebuild
sched domains on demand.

If this rebuild thread was the -only- way that sched domains were
allowed to be rebuilt, and if this rebuild was done -asynchronously-
sometime shortly after requested, without any error or status feedback,
then it would seem to simplify the locking issues.

I would guess that any system big enough to have a non-default, cpuset
determined, sched domain configuration is big enough to not mind one
more thread sitting around, doing nothing most of the time.

If the above is a really stupid idea, have fun taking out your
agression on it ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-07-15 16:11:29

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

On Tue, Jul 15, 2008 at 9:07 AM, Paul Jackson <[email protected]> wrote:
> If this rebuild thread was the -only- way that sched domains were
> allowed to be rebuilt, and if this rebuild was done -asynchronously-
> sometime shortly after requested, without any error or status feedback,
> then it would seem to simplify the locking issues.

I sent a patch that was similar a couple of weeks ago, that used a
workqueue to do the rebuild. It didn't quite work then since it wasn't
safe to call get_online_cpus() from a multi-threaded workqueue then,
but I believe there's been a patch since then that makes this safe.
And if not, we could always have a single-threaded workqueue that
wasn't bound to any particular CPU.

Paul

2008-07-15 16:14:40

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

On Tue, Jul 15, 2008 at 4:44 AM, Max Krasnyansky <[email protected]> wrote:
> From: Max Krasnyanskiy <[email protected]>
>
> I do not really like the current solution of dropping cgroup lock
> but it shows what I have in mind in general.

I think that dropping the cgroup lock will open up races for cpusets.
The idea of a separate workqueue/thread to do the sched domain
rebuilding is simplest.

Paul

2008-07-15 17:19:46

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

Paul Menage wrote:
> On Tue, Jul 15, 2008 at 9:07 AM, Paul Jackson <[email protected]> wrote:
>> If this rebuild thread was the -only- way that sched domains were
>> allowed to be rebuilt, and if this rebuild was done -asynchronously-
>> sometime shortly after requested, without any error or status feedback,
>> then it would seem to simplify the locking issues.
>
> I sent a patch that was similar a couple of weeks ago, that used a
> workqueue to do the rebuild. It didn't quite work then since it wasn't
> safe to call get_online_cpus() from a multi-threaded workqueue then,
> but I believe there's been a patch since then that makes this safe.
> And if not, we could always have a single-threaded workqueue that
> wasn't bound to any particular CPU.

Actually I think we do not have to make it super strict "only rebuilt
from that thread rule". I'd only off-load cpuset_write64(),
update_flag() to the thread. It'd be nice to keep hotplug path clean
synchronous. It's synchronous without cpusets so there is really no good
reason when it needs to be async without them. And the toughest part is
not even hotplug where lock nesting is pretty clear
get_online_cpus() ->
rebuild_sched_domains() ->
cgroup_lock();
// Build cpumaps
cpuset_callback_lock();
...
cpuset_callback_unlock();
cgroup_unlock();

partition_sched_domains() ->
mutex_unlock(&sched_domains_mutex);
// Rebuild sched domains
mutex_unlock(&sched_domains_mutex);
put_online_cpus()

It's the other paths where cgroup_lock() is taken by cgroups before even
calling into cpusets, like cgroup destroy case.
So I think we should just off-load those.

Max

2008-07-15 17:28:00

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

Paul Menage wrote:
> On Tue, Jul 15, 2008 at 4:44 AM, Max Krasnyansky <[email protected]> wrote:
>> From: Max Krasnyanskiy <[email protected]>
>>
>> I do not really like the current solution of dropping cgroup lock
>> but it shows what I have in mind in general.
>
> I think that dropping the cgroup lock will open up races for cpusets.
> The idea of a separate workqueue/thread to do the sched domain
> rebuilding is simplest.

Actually I audited (to the best of my knowledge) all the paths in
cpusets and rebuild_sched_domains() is the last action. ie We drop the
lock right after it anyways. It's just it's embedded deep in the call
stack and therefor I cannot drop it at the higher level.

The only path where I think it's not safe is the cgroup destroy thing
where we do

cgroup.c
cgroup_lock();
for_each_cgroups(...)
cg->destroy();
cgroup_unlock();

So in theory it's just that one patch that really needs the workqueue
trick. But I do agree that it'll make it less tricky across the board.
So I'll pick up you work queue based patch, convert it to single
threaded, bang on a bit later today and send a patch on top of this one.

Max

2008-07-15 20:51:28

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context

Paul Menage wrote:
> On Tue, Jul 15, 2008 at 4:44 AM, Max Krasnyansky <[email protected]> wrote:
>> From: Max Krasnyanskiy <[email protected]>
>>
>> I do not really like the current solution of dropping cgroup lock
>> but it shows what I have in mind in general.
>
> I think that dropping the cgroup lock will open up races for cpusets.
> The idea of a separate workqueue/thread to do the sched domain
> rebuilding is simplest.

Actually I think we do not have to make it super strict "only rebuilt
from that thread rule". I'd only off-load cpuset_write64(),
update_flag() to the thread. It'd be nice to keep hotplug path clean
synchronous. It's synchronous without cpusets so there is really no good
reason when it needs to be async without them. And the toughest part is
not even hotplug where lock nesting is pretty clear
get_online_cpus() ->
rebuild_sched_domains() ->
cgroup_lock();
// Build cpumaps
cpuset_callback_lock();
...
cpuset_callback_unlock();
cgroup_unlock();

partition_sched_domains() ->
mutex_unlock(&sched_domains_mutex);
// Rebuild sched domains
mutex_unlock(&sched_domains_mutex);
put_online_cpus()

It's the other paths where cgroup_lock() is taken by cgroups before even
calling into cpusets, like cgroup destroy case.
So I think we should just off-load those.

Max