LinuxLists.cc - [PATCH] cpuset semaphore depth check optimize

2005-09-12 11:30:54

Subject: [PATCH] cpuset semaphore depth check optimize

Optimize the deadlock avoidance check on the global cpuset
semaphore cpuset_sem. Instead of adding a depth counter to the
task struct of each task, rather just two words are enough, one
to store the depth and the other the current cpuset_sem holder.

Thanks to Nikita Danilov for the idea.

Signed-off-by: Paul Jackson <[email protected]>

Index: linux-2.6.13-mem_exclusive_oom/include/linux/sched.h
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/include/linux/sched.h
+++ linux-2.6.13-mem_exclusive_oom/include/linux/sched.h
@@ -765,7 +765,6 @@ struct task_struct {
short il_next;
#endif
#ifdef CONFIG_CPUSETS
- short cpuset_sem_nest_depth;
struct cpuset *cpuset;
nodemask_t mems_allowed;
int cpuset_mems_generation;
Index: linux-2.6.13-mem_exclusive_oom/kernel/cpuset.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/kernel/cpuset.c
+++ linux-2.6.13-mem_exclusive_oom/kernel/cpuset.c
@@ -180,6 +180,8 @@ static struct super_block *cpuset_sb = N
*/

static DECLARE_MUTEX(cpuset_sem);
+static struct task_struct *cpuset_sem_owner;
+static int cpuset_sem_depth;

/*
* The global cpuset semaphore cpuset_sem can be needed by the
@@ -200,16 +202,19 @@ static DECLARE_MUTEX(cpuset_sem);

static inline void cpuset_down(struct semaphore *psem)
{
- if (current->cpuset_sem_nest_depth == 0)
+ if (cpuset_sem_owner != current) {
down(psem);
- current->cpuset_sem_nest_depth++;
+ cpuset_sem_owner = current;
+ }
+ cpuset_sem_depth++;
}

static inline void cpuset_up(struct semaphore *psem)
{
- current->cpuset_sem_nest_depth--;
- if (current->cpuset_sem_nest_depth == 0)
+ if (--cpuset_sem_depth == 0) {
+ cpuset_sem_owner = NULL;
up(psem);
+ }
}

/*

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-09-12 11:41:44

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Paul Jackson <[email protected]> wrote:
>
> +static struct task_struct *cpuset_sem_owner;
> +static int cpuset_sem_depth;
>
> /*
> * The global cpuset semaphore cpuset_sem can be needed by the
> @@ -200,16 +202,19 @@ static DECLARE_MUTEX(cpuset_sem);
>
> static inline void cpuset_down(struct semaphore *psem)
> {
> - if (current->cpuset_sem_nest_depth == 0)
> + if (cpuset_sem_owner != current) {
> down(psem);
> - current->cpuset_sem_nest_depth++;
> + cpuset_sem_owner = current;
> + }
> + cpuset_sem_depth++;
> }

Better, but still hacky. The rest of the kernel manages to avoid the need
for nestable semaphores by getting the locking design sorted out. Can
cpusets do that sometime?

2005-09-12 11:48:51

by Nikita Danilov

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Paul Jackson writes:

[...]

>
> static inline void cpuset_down(struct semaphore *psem)
> {
> - if (current->cpuset_sem_nest_depth == 0)
> + if (cpuset_sem_owner != current) {
> down(psem);
> - current->cpuset_sem_nest_depth++;
> + cpuset_sem_owner = current;
> + }
> + cpuset_sem_depth++;
> }

Err... note that now cpuset_{down,up}() take semaphore as a parameter,
but use global cpuset_sem_{owner,depth} to track recursion. This, I
believe, is an inconsistent API---it only works for a single semaphore
as passing different @psem's would lead to deadlocks and meaningless
owner and depth.

What about making these functions

static void cpuset_{down,up}(void);

operating on cpuset_sem internally?

"I won't rest till it's the best ..." :-)

Nikita.

2005-09-12 13:26:19

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Nikita wrote:
> static void cpuset_{down,up}(void);

I started with (void) calls when I first wrote this hack,
then changed it to taking a semaphore pointer, intentionally.

The calls:
cpuset_down(&cpuset_sem);
cpuset_up(&cpuset_sem);
exactly replace calls:
down(&cpuset_sem);
up(&cpuset_sem);

I wanted that visual resemblance.

I agree, it's asymmetric, which is not so good.

But the resemblance is more valuable, in my view.

So I will stick with what I've got, unless I see stronger
signs of a concensus to the contrary.

Is that ok?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-12 14:39:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

On Mon, 12 Sep 2005, Andrew Morton wrote:
>
> Better, but still hacky. The rest of the kernel manages to avoid the need
> for nestable semaphores by getting the locking design sorted out. Can
> cpusets do that sometime?

Well, if you "rest of the kernel" you ignore the BKL, then yes.

Nesting isn't wrong per se - sometimes it allows things that would
otherwise be very nasty to code around. We've been very strict with not
allowing nesting for the low-level primitives (ie spinlocks etc), and
instead requiring that people use them very carefully, but I don't think
nesting is necessarily wrong for high-level constructs.

Personally, the thing that makes me think the patch is ugly is the fact
that the different parts of the nested semaphore are all separate. I'd
prefer to see a

struct nested_semaphore {
struct semaphore sem;
struct task_struct *owner;
unsigned int count;
};

and then operate on _that_ level instead.

But keep it internal to the cpuset stuff - while I don't think nested
semaphores are evil, they _are_ sometimes an excuse to be lazy and do
things wrong just because it's easier.

Maybe that's the case in cpusets too, and Andrew may be right about this.

Linus

2005-09-12 14:55:35

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Andrew wrote:
> Better, but still hacky. The rest of the kernel manages to avoid the need
> for nestable semaphores by getting the locking design sorted out. Can
> cpusets do that sometime?

Short answer:
=============
Yup - absolutely a hack. So far it's working better than what I had
before. I have no plans to replace it. I'm open to suggestions.

Long answer:
============
I had this without such a hack for a year, with the code that required
calling refresh_mems() after getting cpuset_sem in the top level
cpuset code, before any path that might get to __alloc_pages().
It was *the* primary source of serious bugs in the cpuset code this
last year. You've been lucky enough to only see a couple of them.

I've got paths in __alloc_pages() and below that acquire cpuset_sem,
but only rarely. I am not aware of any other global kernel lock
that routinely shows up both _above_ and _below_ __alloc_pages on
the stack. Normally such would be a big problem for numa scaling.
This one avoids the scaling problem by only being on rare paths when
called within __alloc_pages. But that means I can't test for bugs;
I have to code so it is obviously right in the first place (not a
bad idea in general ;).

I have *never* actually recreated such a deadlock myself, on a stock
kernel. I've seen exactly one live kernel backtrace showing such
a deadlock from a customer site in my life, and reliable reports of
a few more. I've never seen such a backtrace or heard evidence of a
such a deadlock on an internal test machine running a stock kernel.

I've spent days trying to concoct test scenarios that would trip these
deadlocks on a stock kernel, with no success. The specific low memory
conditions required, along with other timing races, are too difficult
(so far anyway) to recreate on demand.

But for one time, everyone of these bugs (double tripping on
cpuset_sem) was found, and fixed, solely by inspection. And the bugs
kept coming.

The final straw came when I took a vacation, came back, forgot how my
scheme worked and started writing totally broken code and stupid
comments, based on my newly acquired misunderstanding.

Something had to give.

It was time to cast elegance aside and find a locking scheme that even
a stupid person could understand and get right.

This hack has two serious advantages:
1) It's as obvious as a big old scar on one's face. In just a
couple days, it has generated more informed discussion than the
old one ever did, by readers who had no trouble seeing what it
did, commenting without hesitation on alternatives. The rare
comments I got on the previous scheme came only after hours of
head scratching, and with some uncertainty.
2) It works. Every variation of this I've coded has, so far as I
know, been rock solid. The discussion has only been over style
and data space issues. It looks like it will continue to work
if per chance I add more cpuset features that allocate memory
somewhere in the code. The only rule is that every up/down on
cpuset_sem has to be done by the wrappers, which is why I have
insisted on having the wrapper calls look like regular calls
to "up(&cpuset_sem)" and "down(&cpuset_sem)", so this convention
would be rather obvious.

"If you're going to hack, at least do it in plain sight."

If someone would like to recommend a better scheme, I'm all ears.
I am well aware that there are others in the kernel who know locking
better than I do. And I am well aware that the experts in such stuff
don't end up coding hacks such as this.

Barring some inspiration however, probably requiring the help of input
from others, it is unlikely that I will be posting a patch to remove
this hack anytime soon.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-12 15:32:21

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Linus wrote:
> and then operate on _that_ level instead.

As I noted in my reply a few minutes ago, the one unusual rule that
this scheme imposes is that all up's and down's on cpuset_sem must be
done via the wrappers.

So I have continued to strive to have the lock and unlock calls have as
literal substrings "up(&cpuset_sem)" and "down (&cpuset_sem)", such
as with "cpuset_up(&cpuset_sem)" and "cpuset_down(&cpuset_sem)".

This serves as a clear visual reminder of this extra wrapper rule.

The usual "best practices" of:
1) consistent API's (referring to Nikita's suggestion that these
routines have "void" arguments instead of "&cpuset_sem"),
2) encapsulating related data (your suggestion here), and
3) [in my inbox] Nikita's cpuset_lock/cpuset_unlock hiding, echoing
an earlier suggestion of Linus's

are appropriate and desirable mechanisms for building clean abstraction
layers.

I am more of a mind to code this as a thinly veiled hack for use just
within cpuset.c, not another abstraction layer.

I can certainly code this as a proper layer, if you like. My intuition
is that, in this case, doing so would slightly increase the mental load
on the reader, not decrease it.

In actuality, I don't code for elegance so much as I code to minimize
the time it takes the typical reader to -correctly- understand what's
going on.

But if after all my eloquence of the last hour, Linus, Nikita and
Andrew are all in agreement that cpuset_lock/cpuset_unlock with
struct encapuslation of the 3 data items is preferrable, I'll gladly
code that up. Well, actually, just a single clear "make it so"
from Linus or Andrew would likely be sufficient.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-12 15:38:09

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Nikita wrote:
> static void cpusets_lock(void);
> static void cpusets_unlock(void);

Great minds think alike. That's exactly what
Linus first proposed, when confronted with this.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-12 17:08:10

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Hi,

On Mon, 12 Sep 2005, Paul Jackson wrote:

> I've got paths in __alloc_pages() and below that acquire cpuset_sem,
> but only rarely. I am not aware of any other global kernel lock
> that routinely shows up both _above_ and _below_ __alloc_pages on
> the stack. Normally such would be a big problem for numa scaling.
> This one avoids the scaling problem by only being on rare paths when
> called within __alloc_pages. But that means I can't test for bugs;
> I have to code so it is obviously right in the first place (not a
> bad idea in general ;).

Maybe I'm missing something, but why don't you use two locks?
One general management lock to create/insert/remove cpusets and a
low-level lock (maybe even a spinlock) which manages the state of an
active cpuset.
For that you have to figure out (and document) how the fields in the
cpuset structure are used in various situations.

bye, Roman

2005-09-13 07:04:15

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

* Linus Torvalds <[email protected]> wrote:

> Personally, the thing that makes me think the patch is ugly is the
> fact that the different parts of the nested semaphore are all
> separate. I'd prefer to see a
>
> struct nested_semaphore {
> struct semaphore sem;
> struct task_struct *owner;
> unsigned int count;
> };
>
> and then operate on _that_ level instead.

btw., this is how the -rt tree implements (read-)nesting for rwsems and
rwlocks. The more sharing and embedding of types and primitives, the
more compact the whole code becomes, and the easier it is to change
fundamental properties.

Ingo

2005-09-13 11:43:16

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Hi,

On Mon, 12 Sep 2005, Paul Jackson wrote:

> There are two reasons a cpuset is accessed from within the
> allocation code.
> 1) Update the per-task mems_allowed if the current tasks cpuset
> has changed its memory placement due to some other task
> independently modifying that cpuset.
> 2) Walk up the cpuset hierarchy, starting from the tasks
> cpuset, looking for a cpuset that is marked mem_exclusive,
> and using the mems_allowed from that exclusive cpuset.

If I read the source correctly, a cpuset cannot be removed or moved while
it's attached to a task, which makes it a lot simpler.

This means you have to take the second lock when accessing tsk->cpuset
(you can basically replace task_lock()). Any allocator callback can now
use the second lock to scan the cpusets. IOW as soon as count is different
from zero, the cpuset is active and certain members become read-only.
Activation/deactivation is controlled by the second lock.

You can BTW avoid locking in cpuset_exit() completely in the common case:

tsk->cpuset = NULL;
if (atomic_dec_and_test(&cs->count) && notify_on_release(cs)) {
...
}

Here you only need to release the reference, noone else should use that
task anymore. You only have to check in attach_task() that you don't
change a dead task.

There may be a subtle problem with cpuset_fork(), there is a window from
dup_task_struct() until cpuset_fork(), where we have two pointers to a
cpuset but only a single reference. Another process could now change the
cpuset of the parent process to a different cpuset and the child process
may end up with an invalid cpuset and I don't see how this protected. The
only (simple) solution I see is to do this:

lock();
tsk->cpuset = current->cpuset;
atomic_inc(&tsk->cpuset->count);
unlock();

bye, Roman

2005-09-13 17:37:39

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Thank-you, Roman, for looking into cpuset locks.

I am starting to see the picture you are painting, with a global
cpuset_sem to guard changes in the cpuset hierarchy (add/remove)
and a per-cpuset spinlock to briefly pin down an active cpuset
and guard select values (such as mems_allowed) for access from
allocator callbacks and other tasks.

Roman wrote:
> If I read the source correctly, a cpuset cannot be removed or moved while
> it's attached to a task, which makes it a lot simpler.

Yes - a cpuset cannot be removed while attached (count > 0). And there
is no 'move' that I know of. A rename(2) system call on a cpuset in
the cpuset filesystem gets the vfs default -EINVAL response.

So, yes, if I can pin a cpuset with a per-cpuset spinlock on it, then
its parent chain (and whatever else I take care to guard with that
spinlock) is held as well (the cpuset->parent chain is pinned). I
guess this some of what you meant by your phrase "makes it a lot
simpler".

> This means you have to take the second lock ...

By "second lock", did you mean what you described in your earlier
message as:

> low-level lock (maybe even a spinlock) which manages the state of an
> active cpuset.

You also wrote:
> You can BTW avoid locking in cpuset_exit() completely in the common case:
>
> tsk->cpuset = NULL;
> if (atomic_dec_and_test(&cs->count) && notify_on_release(cs)) {

I don't think that works. And I suspect you are proposing the same bug
that I had, and fixed with the following patch:

> changeset: 1793:feaa2e5b1e0026994365daf8d1ad3bd745ba8a14
> user: Paul Jackson <[email protected]>
> date: Fri May 27 08:07:26 2005 +0019
> files: kernel/cpuset.c
> description:
> [PATCH] cpuset exit NULL dereference fix
>
> There is a race in the kernel cpuset code, between the code
> to handle notify_on_release, and the code to remove a cpuset.
> The notify_on_release code can end up trying to access a
> cpuset that has been removed. In the most common case, this
> causes a NULL pointer dereference from the routine cpuset_path.
> However all manner of bad things are possible, in theory at least.
>
> The existing code decrements the cpuset use count, and if the
> count goes to zero, processes the notify_on_release request,
> if appropriate. However, once the count goes to zero, unless we
> are holding the global cpuset_sem semaphore, there is nothing to
> stop another task from immediately removing the cpuset entirely,
> and recycling its memory.

You also wrote:
> There may be a subtle problem with cpuset_fork()

Hmmm ... interesting. I will think about this some more.

> The only (simple) solution I see is to do this:
>
> lock();
> tsk->cpuset = current->cpuset;
> atomic_inc(&tsk->cpuset->count);
> unlock();

What "lock()" and "unlock()" is this? Your "second lock",
aka "low-level lock (maybe even a spinlock)" ?

Separate question:

When cpuset_excl_nodes_overlap() or cpuset_zone_allowed()
callbacks are walking backup the cpuset hierarchy in the
nearest_exclusive_ancestor() code, how do they handle the
per-cpuset spinlocks?

My assumption is that it would be like this. Say my current task
is attached to cpuset "/dev/cpuset/A/B/C", and it happens I will
have to walk all the way to the top to find the exclusive ancestor
that I am searching for. I'd expect the spinlocks to get taken
and released in the following order on these cpusets:

lock /A/B/C
lock /A/B
lock /A
lock /
==> found what I was searching for
unlock /
unlock /A
unlock /A/B
unlock /A/B/C

If at any point, perhaps in the creation of a cpuset, I had reason
to want to get a child cpusets spinlock while already holding the
parents, I would have to release the parent, and relock starting
with the child first, working upward. The proper order to obtain
these locks would always be bottom up.

Does this resemble what you are suggesting, Roman?

Your review of this is most appreciated. Once again - thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-13 22:22:12

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Ingo, confirming Linus's suggestion:
> btw., this is how the -rt tree implements (read-)nesting for rwsems and
> rwlocks. The more sharing and embedding of types and primitives, the
> more compact the whole code becomes, and the easier it is to change
> fundamental properties.

I completely agree.

Such is the art of fine programming.

My basic concern was that Linus was trying to put lipstick
on a pig.

If one gets the underlying structure right, then one should
package it in the best way one can, such as you and Linus
describe.

If one has a hack, better to leave it naked to the world,
with a minimum of artiface.

That way it attracts attention from those who know better
and are repulsed. And that way, when something better
comes along, it will be easy to remove the simple hack.

It looks like Roman is on my case. This is good.

(Of course, if you have a barn full of hogs, maybe
it's time to paint the barn ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-14 06:01:41

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

pj wrote:
> I'd expect the spinlocks to get taken
> and released in the following order on these cpusets:
>
> lock /A/B/C
> lock /A/B
> lock /A
> lock /
> ==> found what I was searching for
> unlock /
> unlock /A
> unlock /A/B
> unlock /A/B/C

The appropriate condition required to prevent deadlock is weaker than
stated above. The condition should be:

* A task can hold the spinlocks for multiple cpusets, but only
* if it acquires in bottom up order. That is, whenever a task
* tries to lock a cpuset, the only cpusets it may already have
* locked must be descendents of the one it is going for.

With this, the following sequence of lock operations would also
be acceptable, holding the bottom lock, while walking up the tree,
locking and unlocking each ancestor in turn, until one is found
that satisfies the present query. Only the bottom most lock has
to be held in this approach, for the duration.

lock /A/B/C
lock /A/B
unlock /A/B
lock /A
unlock /A
lock /
==> found what I was searching for
unlock /
unlock /A/B/C

This sequence is a little easier to implement, because there is no need
to keep a variable length queue of locks to be undone. At most two
locks are held at anytime.

If a variable length queue of locks to be undone had been needed, it
could have been implemented using one more field in each cpuset,
forming a LIFO linked list of cpusets to be unlocked.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-14 15:57:10

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Hi,

On Tue, 13 Sep 2005, Paul Jackson wrote:

> > If I read the source correctly, a cpuset cannot be removed or moved while
> > it's attached to a task, which makes it a lot simpler.
>
> Yes - a cpuset cannot be removed while attached (count > 0). And there
> is no 'move' that I know of. A rename(2) system call on a cpuset in
> the cpuset filesystem gets the vfs default -EINVAL response.
>
> So, yes, if I can pin a cpuset with a per-cpuset spinlock on it, then
> its parent chain (and whatever else I take care to guard with that
> spinlock) is held as well (the cpuset->parent chain is pinned). I
> guess this some of what you meant by your phrase "makes it a lot
> simpler".

I don't think a per-cpuset spinlock will be necessary (at least
initially).
The complete active condition is actually (atomic_read(&cs->count) ||
!list_empty(&cs->children)). These means if any child is possibly active
so is the parent.
Modifications in the cpuset hierarchy require the cpuset_sem and an
inactive cpuset, (de)activating a cpuset requires the cpuset_sem and
(let's call it) cpuset_tasklock.
Callbacks from the allocator now only need cpuset_tasklock to access the
cpuset via tsk->cpuset and to keep it active and an active cpuset can't be
removed from the hierarchy.

> You also wrote:
> > You can BTW avoid locking in cpuset_exit() completely in the common case:
> >
> > tsk->cpuset = NULL;
> > if (atomic_dec_and_test(&cs->count) && notify_on_release(cs)) {
>
> I don't think that works. And I suspect you are proposing the same bug
> that I had, and fixed with the following patch:

You're right, it should better look like this:

tsk->cpuset = NULL;
if (atomic_read(&cs->count) == 1 && notify_on_release(cs)) {
...
}
atomic_dec(&cs->count);

This way it only may happen that two notifaction are sent.

bye, Roman

2005-09-14 19:47:09

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Roman wrote:
> I don't think a per-cpuset spinlock will be necessary ...

Arghh. I'm still playing 'pin the tail on the donkey' guessing games
here, trying to guess what you think is necessary.

So, if a per-cpuset spinlock isn't necessary, then are you saying that
going from one global local to two global locks (where the second one
might be a spinlock) might work? Or are you saying just the current
one global semaphore should work? I'm guessing the former, as you can
see from my further replies below.

Could you please be a little more verbose? Thanks.

And could you answer a couple of my previous questions, on the same
area:

Question 1:

Roman: This means you have to take the second lock ...

Paul: By "second lock", did you mean what you described in your
earlier message as:
> low-level lock (maybe even a spinlock) which manages the
> state of an active cpuset.

[Aside - note your phrase 'manages the state of an active cpuset'.
It doesn't surprise me that I thought from this you had in
mind a per-cpuset lock, not just a second global lock.]

Question 2:

Roman: There may be a subtle problem with cpuset_fork()

Paul: Hmmm ... interesting. I will think about this some more.

Roman:
> The only (simple) solution I see is to do this:
>
> lock();
> tsk->cpuset = current->cpuset;
> atomic_inc(&tsk->cpuset->count);
> unlock();

Paul: What "lock()" and "unlock()" is this? Your "second lock",
aka "low-level lock (maybe even a spinlock)" ?

Back to new comments ...

Roman wrote:
> The complete active condition is actually (atomic_read(&cs->count) ||
> !list_empty(&cs->children)). These means if any child is possibly active
> so is the parent.

Yes - agreed.

Roman wrote:
> Modifications in the cpuset hierarchy require the cpuset_sem and an
> inactive cpuset, (de)activating a cpuset requires the cpuset_sem and
> (let's call it) cpuset_tasklock.

Is this 'cpuset_tasklock' the same as the earlier 'second lock'
and the 'lock()/unlock()'? My current guess is yes - same.

I suspect, though I haven't gotten it clear enough yet in my
mind to be confident, that something like I guess you're describing
would be sufficient to keep a cpuset from evaporating out from under
us.

And from this last comment of yours, I am guessing that 'cpuset_tasklock'
is one global lock, not per cpuset, and that the answer to my first
question above is that you are suggesting going from one global lock
to two global locks.

But I don't see what, in your proposal, ensures that it is safe to
read out the mems_allowed vector (multi-word, perhaps). I need to
do more than make sure cpusets don't evaporate out from under me
at inopportune times. I also need to freeze their values, so I
can do non atomic reads of multiple distinct values, or of multiword
values, out of them. What does that?

And I am also still confused as to how this second cpuset_tasklock
works, though that might be more due to my stupidity than any lack of
clarity in your explanations. I'll probably need a little more
tutorial there, before we're done.

I'm also inclined, if I see that it is within reach, to prefer a
per-cpuset lock, rather than just global locks. If I could get
the locks that are required by the callbacks, such as from beneath
__alloc_pages(), to only need per-cpuset locks, then this would reduce
the risk that these turn into performance and scalability issues
someday on really large systems. I've got a nice hierarchy to the
cpusets, so imposing the partial order on per-cpuset locks necessary
to avoid deadlock should be easy enough.

Roman wrote [modified to reinsert some ellided code - pj]:
> You're right, it should better look like this:
>
> tsk->cpuset = NULL;
> if (atomic_read(&cs->count) == 1 && notify_on_release(cs)) {
> char *pathbuf = NULL;
>
> cpuset_down(&cpuset_sem);
> if (atomic_dec_and_test(&cs->count))
> check_for_release(cs, &pathbuf);
> cpuset_up(&cpuset_sem);
> cpuset_release_agent(pathbuf);
> }
> atomic_dec(&cs->count);
>
> This way it only may happen that two notifaction are sent.

I don't think that works at all. Consider the following sequence:
1) The first 'atomic_read' returns 2
2) [ The other task holding a reference drops out. ]
(so count is 1 now)
3) The atomic_dec() moves the count from 1 to 0.
4) Oops - we just missed doing a release.

Your comment "This way it only may happen that two notifaction are
sent." went whizzing right past me ...

==> I suspect that I am actually close to understanding what you're
suggesting, and having an informal agreement with you on what
to do.

When I get to that point, I will need to put this aside for a
week, and spend more time on another task my manager needs.

Then I should be able to return to this, and code up a polished
version of what we agreed to, and present it for review.

Once again, thanks for you assistance.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-15 10:51:35

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Hi,

On Wed, 14 Sep 2005, Paul Jackson wrote:

> Arghh. I'm still playing 'pin the tail on the donkey' guessing games
> here, trying to guess what you think is necessary.

Sorry, I thought it was more obvious... :)

> So, if a per-cpuset spinlock isn't necessary, then are you saying that
> going from one global local to two global locks (where the second one
> might be a spinlock) might work? Or are you saying just the current
> one global semaphore should work? I'm guessing the former, as you can
> see from my further replies below.

In the basic model you need two locks. 1) the current semaphore for
general cpusem management 2) a second lock to synchronize with kernel
callbacks, the type of this lock depends on the callback requirements,
but for simplicity let's assume a spinlock. It's possible to split it
further into per-cpuset spinlocks, but let's get the basic scheme
working, more fine grained locking can be built on top of this.

> Question 1:
>
> Roman: This means you have to take the second lock ...
>
> Paul: By "second lock", did you mean what you described in your
> earlier message as:
> > low-level lock (maybe even a spinlock) which manages the
> > state of an active cpuset.
>
> [Aside - note your phrase 'manages the state of an active cpuset'.
> It doesn't surprise me that I thought from this you had in
> mind a per-cpuset lock, not just a second global lock.]

Active meant here that it's attached to a task. Any access to
"tsk->cpuset" has to be protected by the spinlock (in the general case,
e.g. for cpuset_exit() it's avoidable).

> Roman:
> > The only (simple) solution I see is to do this:
> >
> > lock();
> > tsk->cpuset = current->cpuset;
> > atomic_inc(&tsk->cpuset->count);
> > unlock();
>
> Paul: What "lock()" and "unlock()" is this? Your "second lock",
> aka "low-level lock (maybe even a spinlock)" ?

Yes, see aove.

> > Modifications in the cpuset hierarchy require the cpuset_sem and an
> > inactive cpuset, (de)activating a cpuset requires the cpuset_sem and
> > (let's call it) cpuset_tasklock.
>
> Is this 'cpuset_tasklock' the same as the earlier 'second lock'
> and the 'lock()/unlock()'? My current guess is yes - same.

Yes.

> I suspect, though I haven't gotten it clear enough yet in my
> mind to be confident, that something like I guess you're describing
> would be sufficient to keep a cpuset from evaporating out from under
> us.
>
> And from this last comment of yours, I am guessing that 'cpuset_tasklock'
> is one global lock, not per cpuset, and that the answer to my first
> question above is that you are suggesting going from one global lock
> to two global locks.
>
> But I don't see what, in your proposal, ensures that it is safe to
> read out the mems_allowed vector (multi-word, perhaps). I need to
> do more than make sure cpusets don't evaporate out from under me
> at inopportune times. I also need to freeze their values, so I
> can do non atomic reads of multiple distinct values, or of multiword
> values, out of them. What does that?

All kernel callbacks are done using the spinlock, as soon as you hold this
spinlock it's safe to dereference tsk->cpuset, as attach_task() can't
change this field. Also attach_task() makes sure that the count of an
attached cpuset is always different from zero, so cpuset_rmdir() cannot
remove it. attach_task() and cpuset_rmdir() are synchronized using the
semaphore.

You have to concentrate on the state of a cpuset, it can be either active
or inactive. Certain operations (e.g. removal) are only possible if it's
inactive and these are protected by the semaphore. Other operations are
only possible when the cpuset is active and you need the spinlock for
them, as long as the cpuset is active and you only hold the semaphore
some operations are not possible (e.g. changing the parent).
attach_task() switches between these two states and so has to shortly
take both locks.

> I'm also inclined, if I see that it is within reach, to prefer a
> per-cpuset lock, rather than just global locks. If I could get
> the locks that are required by the callbacks, such as from beneath
> __alloc_pages(), to only need per-cpuset locks, then this would reduce
> the risk that these turn into performance and scalability issues
> someday on really large systems. I've got a nice hierarchy to the
> cpusets, so imposing the partial order on per-cpuset locks necessary
> to avoid deadlock should be easy enough.

Yes, but that's a separate step.

> Roman wrote [modified to reinsert some ellided code - pj]:
> > You're right, it should better look like this:
> >
> > tsk->cpuset = NULL;
> > if (atomic_read(&cs->count) == 1 && notify_on_release(cs)) {
> > char *pathbuf = NULL;
> >
> > cpuset_down(&cpuset_sem);
> > if (atomic_dec_and_test(&cs->count))
> > check_for_release(cs, &pathbuf);
> > cpuset_up(&cpuset_sem);
> > cpuset_release_agent(pathbuf);
> > }
> > atomic_dec(&cs->count);
> >
> > This way it only may happen that two notifaction are sent.
>
> I don't think that works at all. Consider the following sequence:
> 1) The first 'atomic_read' returns 2
> 2) [ The other task holding a reference drops out. ]
> (so count is 1 now)
> 3) The atomic_dec() moves the count from 1 to 0.
> 4) Oops - we just missed doing a release.

If you want to do it like this, just add an else before the last
atomic_dec().
The main point I was trying to make is that clearing tsk->cpuset doesn't
require the spinlock, as long as you check for dead tasks in
attach_task(). Ignore the rest if it's too confusing.

bye, Roman

2005-09-15 17:46:07

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Roman wrote:
> Sorry, I thought it was more obvious... :)

Likely it was obvious. On the other hand, I wouldn't be in this mess
if I understood everything I should here.

Most of what you wrote this time actually made sense to me as I read
it. Thanks for taking the time.

One part left to discuss and then I will put this aside for a few
days while I focus on my day job.

Roman, replying to pj:
> > > I don't think that works at all. Consider the following sequence:
> > > 1) ...
> > > 4) Oops - we just missed doing a release.
> >
> > If you want to do it like this, just add an else before the last
> > atomic_dec().
> > The main point I was trying to make is that clearing tsk->cpuset doesn't
> > require the spinlock, as long as you check for dead tasks in
> > attach_task(). Ignore the rest if it's too confusing.

I don't think confusion is the problem here. I think your proposal was
wrong.

There are *three* ways to get to a cpuset:
1) via some task->cpuset pointer,
2) via that cpuset's path below /dev/cpuset, and
3) walking up the parent chain from a child.

In the first half of your line:

> > if (atomic_read(&cs->count) == 1 && notify_on_release(cs)) {

if per chance the cs->count was one, then for an instant no other task
was using this cpuset, and it had no children. But you can still get
to the cpuset via its /dev/cpuset path.

So by the time we get to the second half of this line where we check
for "notify_on_release(cs)", all hell could have broken loose, and
there might be 17 tasks using this self same cpuset, and 19 child
cpusets of this cpuset. These interlopers. could have arrived by
accessing the cpuset using its path below /dev/cpuset.

The flip side is just as plausible. We cannot, in any case, execute an
unguarded atomic_dec on cpuset->count, if that cpuset has been marked
notify_on_release, and if that cpuset is accessible by any of the
above three possible ways, due to the risk the decrement will put the
count to zero, and we'd miss issuing a release notifier.

Actually, even the single check that is in the code now:

if (notify_on_release(cs)) {

would be bogus if we required instruction level synchronization, but it
is racing on an attribute set by some poor schlob user, so the kernel
need only preserve ordering at the system call level, not at the machine
instruction level.

Putting that part aside, why would you make a point of stating that
"that clearing tsk->cpuset doesn't require the spinlock"? I don't
take cpuset_sem when I clear tsk->cpuset, so why would you think I'd
take this new spinlock instead?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-15 19:18:20

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Paul wrote:
> if per chance the cs->count was one, then for an instant no other task
> was using this cpuset, and it had no children.

Correction - a count of one means no other tasks using, period.

As you well know, whether there are children or not depends on
the state of the cpuset->children list.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-17 02:07:00

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Hi,

On Thu, 15 Sep 2005, Paul Jackson wrote:

> > > if (atomic_read(&cs->count) == 1 && notify_on_release(cs)) {
>
> if per chance the cs->count was one, then for an instant no other task
> was using this cpuset, and it had no children. But you can still get
> to the cpuset via its /dev/cpuset path.
>
> So by the time we get to the second half of this line where we check
> for "notify_on_release(cs)", all hell could have broken loose, and
> there might be 17 tasks using this self same cpuset, and 19 child
> cpusets of this cpuset. These interlopers. could have arrived by
> accessing the cpuset using its path below /dev/cpuset.

Define "using", as long as the count is different from the cpuset is
active and the possible actions on it are limited.

> The flip side is just as plausible. We cannot, in any case, execute an
> unguarded atomic_dec on cpuset->count, if that cpuset has been marked
> notify_on_release, and if that cpuset is accessible by any of the
> above three possible ways, due to the risk the decrement will put the
> count to zero, and we'd miss issuing a release notifier.

No, as long as you own the cpuset (i.e. increased count) no one else can
take it away from you (without proper locking). So once you get the
locking right, the rest will fall in place. :-)

> Putting that part aside, why would you make a point of stating that
> "that clearing tsk->cpuset doesn't require the spinlock"? I don't
> take cpuset_sem when I clear tsk->cpuset, so why would you think I'd
> take this new spinlock instead?

But you take task_lock currently, which is replaced by the new spinlock.
In general _any_ access to tsk->cpuset is protected by the spinlock.

bye, Roman

2005-09-17 02:28:08

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Hi,

On Sat, 17 Sep 2005, Roman Zippel wrote:

> Define "using", as long as the count is different from the cpuset is
> active and the possible actions on it are limited.

Oops, add a "zero," after "from" to make this an understandable sentence. :)

bye, Roman

2005-09-20 07:58:00

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Either I'm as dense as a petrified log, or we are still talking past
each other on this topic of what locking is needed in cpuset_exit()
when clearing tsk->cpuset, checking the cs->count and checking for
notify_on_release.

You're saying I don't need the spinlock when clearing tsk->cpuset in
cpuset_exit(), and I am saying that I do need cpuset_sem when handing
the count field if notify_on_release is enabled.

You keep saying I don't need the spinlock, and showing code that
has -neither- lock around the section that checks or decrements
the count and conditionally does the notify_on_release stuff.

I keep protesting that the portion of your code that handles the
count and notify_on_release stuff is unsafe, which I believe it
is, for lack of holding cpuset_sem.

You keep pointing out that the clearing tsk->cpuset doesn't need
the spinlock to be safe.

I agree that I don't need the spinlock when clearing tsk->cpuset in the
cpuset_exit code:

> > tsk->cpuset = NULL;
> > if (atomic_read(&cs->count) == 1 && notify_on_release(cs)) {

But I do need to hold cpuset_sem around the portion that deals with
count, if the cpuset is marked notify_on_release, to avoid missing a
notify_on_release due to confusions from a second task in cpuset_attach
or cpuset_exit at the same time.

If I don't hold cpuset_sem while doing that atomic_read(&cs->count),
then that atomic_read() is utterly totally bleeping useless, for
attach_task can bump the count right back up on the next memory
cycle, from some other task on another cpu. Worse than totally
useless, because I might have read a count of 2, just as the other
task executing the same cpuset_exit code at the same time also read a
count of 2. Then we both decrement the count, and abandon the cpuset,
failing to handle the notify_on_release.

Similarly, the atomic_dec(&cs->count) a few lines later, also
unguarded by cpuset_sem, is not safe, if I have a cpuset marked
notify_on_release. If I am not holding cpuset_sem, then regardless of
any checks I might have made in previous lines of code, the cs->count
could be one (1) when I get here, and the atomic_dec() could send it to
zero without my realizing it, resulting in a missed notify on release.

Unless I hold cpuset_sem, modifying cs->count is only safe if I don't
care to look at the result, as happens in fork when I blindly increment
it, or happens in exit if notify_on_release is false and I can blindly
decrement the count. Reading cs->count when not holding cpuset_sem is
never of any use for reference counting purposes, for what you read
means nothing one cycle later.

We were talking past each other. I'm sure of it.

And I'm pretty sure I understand enough of this to code it now.

So I plan to put it aside for a few days, while I tend to more
pressing matters.

Thank-you, Roman. I owe you one.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-20 12:06:14

by Robin Holt

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Paul,

Can you give a _short_ explanation of why notify_on_release is
essential? Could the intent be accomplished with something
like destroy on exit which then goes through and does the
remove of shildren and finally removes the cpuset?

If we can agree on that, then the exit path becomes
if (atomic_dec_and_lock(&current->cpuset.refcount)) {
/* Code to remove children. */
}
which no longer needs to call a usermode helper and is _FAR_
better in my personal biased opinion.

Thanks,
Robin

PS: For reference, here is what /sbin/cpuset_release_agent
looks like:

[holt@attica:sbin] cat cpuset_release_agent
#!/bin/sh

# Do not modify this file /sbin/cpuset_release_agent.
#
# It is invoked directly from the kernel's call_usermodehelper()
# routine, whenever the last user of cpuset goes away, if that
# cpuset had its 'notify_on_release' option set to '1'. This
# cpuset_release_agent script is responsible for removing the
# abandoned cpuset, whose cpuset file system path is passed
# in argv[1].

rmdir /dev/cpuset/$1

2005-09-20 12:21:36

by Robin Holt

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Wow, too early in the morning... Aside from other typoes

> If we can agree on that, then the exit path becomes
> if (atomic_dec_and_lock(&current->cpuset.refcount)) {

if (atomic_dec_and_lock(&cs->refcount, &cs->parent->child_list)

Sorry,
Robin

2005-09-20 13:10:23

by Simon Derr

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

On Tue, 20 Sep 2005, Robin Holt wrote:

> Paul,
>
> Can you give a _short_ explanation of why notify_on_release is
> essential? Could the intent be accomplished with something
> like destroy on exit which then goes through and does the
> remove of shildren and finally removes the cpuset?
>
> If we can agree on that, then the exit path becomes
> if (atomic_dec_and_lock(&current->cpuset.refcount)) {
> /* Code to remove children. */
> }
> which no longer needs to call a usermode helper and is _FAR_
> better in my personal biased opinion.
>
>

The original cpuset patch did not have a notify_on_release feature, it had
an 'autoclean' feature, that did exactly what you describe.

But the locking was not clean, mostly at the filesystem level. Paul and I
tried together to fix it, but at some point Paul proposed this solution,
to use a usermode helper, and it seemed to be the easiest way to get out
of this locking issue, so we agreed to drop the old 'autoclean' scheme.

(Even if Paul had a hard time convincing me, as I did not like the idea of
this usermode helper too much).

Simon.

2005-09-20 14:23:28

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Robin asked:
> Can you give a _short_ explanation of why notify_on_release is
> essential?

short - me - hah !

Ok - I'll try ;).

It's not the children, it's the parents.

Only leaf nodes in the /dev/cpuset tree can get removed,
when they have no children and no tasks. This might
trigger walking up the parent chain, if in turn the parent
now has no children and no tasks.

Removing one or more vfs file system directories, working
from the bottom up, while in the task exit code, would have
required far more vfs locking voodoo than I can even imagine
possible.

With the usermodehelper, running rmdir(2) from a separate
thread asynchonously from userland, the only way these
directories ever get removed is via an ordinary rmdir(2)
system call, one empty directory at a time. I trust that
the vfs file system can handle that locking; I don't even
need to know how it does it.

Let's say we have the following notify_on_release cpusets,
each with exactly one task in them:

/dev/cpuset/A task1
/dev/cpuset/A/B task2
/dev/cpuset/A/B/C task3

Lets further say that tasks 1, 2 and 3, exit, in that
order, task1 first.

When task1 exits, no cpusets are removed.

When task2 exits, no cpusets are removed.

When task3 exits, cpuset A/B/C needs to be removed. That in
turn triggers removing A/B. That in turn triggers removing A.

I was confident that I could do these three removals safely
by doing three rmdir(2) system calls, one at a time, in order,
working from the bottom up.

I had no clue how to do these three removals safely, working
from the bottom up, while in the kernel exit code for task3.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-20 14:55:16

by Robin Holt

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

On Tue, Sep 20, 2005 at 07:22:55AM -0700, Paul Jackson wrote:
> Robin asked:
> Let's say we have the following notify_on_release cpusets,
> each with exactly one task in them:
>
> /dev/cpuset/A task1
> /dev/cpuset/A/B task2
> /dev/cpuset/A/B/C task3
>
> Lets further say that tasks 1, 2 and 3, exit, in that
> order, task1 first.
>
> When task1 exits, no cpusets are removed.
>
> When task2 exits, no cpusets are removed.
>
> When task3 exits, cpuset A/B/C needs to be removed. That in
> turn triggers removing A/B. That in turn triggers removing A.
>
> I was confident that I could do these three removals safely
> by doing three rmdir(2) system calls, one at a time, in order,
> working from the bottom up.
>
> I had no clue how to do these three removals safely, working
> from the bottom up, while in the kernel exit code for task3.

This makes things even easier!!!

When you create a cpuset, set the refcount to 0. The root
cpuset is the exception and has a refcount of 1.

When tasks are added to the cpuset, increment the refcount.

When child cpusets are created, increment the refcount. Each
cpuset has a list of children that is protected by a single
lock.

Whenever you are decrementing the cpuset's refcount, use
atomic_dec_and_lock on the parents child list lock. If the
notify_on_release property is set, you remove the child from
the list.

When the vfs code is traversing the list, you need to ensure
that it does not iterate unless the child list lock is held.
I have not looked at how you implemented the vfs stuff, but
that should be easily accomplished.

Where are the holes?

Thanks,
Robin

2005-09-20 15:08:25

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

On Tue, 20 Sep 2005, Robin Holt wrote:
>
> This makes things even easier!!!
>
> When you create a cpuset, set the refcount to 0. The root
> cpuset is the exception and has a refcount of 1.
>
> When tasks are added to the cpuset, increment the refcount.
>
> When child cpusets are created, increment the refcount. Each
> cpuset has a list of children that is protected by a single
> lock.

You just described the "dentry" reference counting. Which has the same
issues: as long as a dentry has any children, it needs to remain.

Except dentries are a lot more complex, because we want to keep cached
versions of them around even when the count goes to zero (and zero only
means that we're _allowed_ to remove it under memory pressure). And
dentries can move from one parent to another.

Linus

2005-09-20 15:14:41

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

Robin wrote:
> This makes things even easier!!!

Easy is good.

> When the vfs code is traversing the list, you need to ensure
> that it does not iterate unless the child list lock is held.
> I have not looked at how you implemented the vfs stuff, but
> that should be easily accomplished.
>
> Where are the holes?

The hole is my understanding of vfs.

My gut instinct is to recoil in horror from this suggestion, because it
seems to involve walking the cpuset tree from bottom up, setting locks,
while some other task might be doing a normal file system open on one
of these cpusets, walking the tree top-down, setting locks.

That smells like deadlock city to me.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-20 15:17:41

by Simon Derr

[permalink] [raw]

Subject: Re: [PATCH] cpuset semaphore depth check optimize

On Tue, 20 Sep 2005, Robin Holt wrote:

> This makes things even easier!!!
>
> When you create a cpuset, set the refcount to 0. The root
> cpuset is the exception and has a refcount of 1.
>
> When tasks are added to the cpuset, increment the refcount.
>
> When child cpusets are created, increment the refcount. Each
> cpuset has a list of children that is protected by a single
> lock.
>
> Whenever you are decrementing the cpuset's refcount, use
> atomic_dec_and_lock on the parents child list lock. If the
> notify_on_release property is set, you remove the child from
> the list.
>
> When the vfs code is traversing the list, you need to ensure
> that it does not iterate unless the child list lock is held.
> I have not looked at how you implemented the vfs stuff, but
> that should be easily accomplished.

IIRC, that's the key.
There was never a real issue about notify_on_release on the internal
locking of the cpusets. But with the VFS...

The problem lies in how the VFS takes the semaphores on the inodes
when doing a rmdir(). It locks the parent's inode, and then the child's
inode. And because of the cascading cpuset removal with the previous
'autoclean' feature, the cpuset code tried to do the same thing, but in
the reverse order. There was once a version of the code that seemed to
work, but with inodes semaphores released and re-taken in the good order.
Ugly, and unsafe.

(All this from memory, I don't guarantee its accuracy).

Simon.