2005-05-26 08:26:14

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

Andrew,

This one fixes a problem that actual cpuset users are seeing
crash their system. I hope it will be pushed along fairly soon.

There is a race in the kernel cpuset code, between the code
to handle notify_on_release, and the code to remove a cpuset.
The notify_on_release code can end up trying to access a
cpuset that has been removed. In the most common case, this
causes a NULL pointer dereference from the routine cpuset_path.
However all manner of bad things are possible, in theory at least.

The existing code decrements the cpuset use count, and if the
count goes to zero, processes the notify_on_release request,
if appropriate. However, once the count goes to zero, unless we
are holding the global cpuset_sem semaphore, there is nothing to
stop another task from immediately removing the cpuset entirely,
and recycling its memory.

The obvious fix would be to always hold the cpuset_sem
semaphore while decrementing the use count and dealing with
notify_on_release. However we don't want to force a global
semaphore into the mainline task exit path, as that might create
a scaling problem.

The actual fix is almost as easy - since this is only an issue
for cpusets using notify_on_release, which the top level big
cpusets don't normally need to use, only take the cpuset_sem
for cpusets using notify_on_release.

This code has been run for hours without a hiccup, while running
a cpuset create/destroy stress test that could crash the existing
kernel in seconds. This patch applies to the current -linus
git kernel.

Signed-off-by: Paul Jackson <[email protected]>

Index: 2.6-cpuset_path_fix/kernel/cpuset.c
===================================================================
--- 2.6-cpuset_path_fix.orig/kernel/cpuset.c 2005-05-25 19:20:27.000000000 -0700
+++ 2.6-cpuset_path_fix/kernel/cpuset.c 2005-05-26 00:50:32.000000000 -0700
@@ -166,9 +166,8 @@ static struct super_block *cpuset_sb = N
* The hooks from fork and exit, cpuset_fork() and cpuset_exit(), don't
* (usually) grab cpuset_sem. These are the two most performance
* critical pieces of code here. The exception occurs on exit(),
- * if the last task using a cpuset exits, and the cpuset was marked
- * notify_on_release. In that case, the cpuset_sem is taken, the
- * path to the released cpuset calculated, and a usermode call made
+ * when a task in a notify_on_release cpuset exits. Then cpuset_sem
+ * is taken, and if the cpuset count is zero, a usermode call made
* to /sbin/cpuset_release_agent with the name of the cpuset (path
* relative to the root of cpuset file system) as the argument.
*
@@ -1404,6 +1403,18 @@ void cpuset_fork(struct task_struct *tsk
*
* Description: Detach cpuset from @tsk and release it.
*
+ * Note that cpusets marked notify_on_release force every task
+ * in them to take the global cpuset_sem semaphore when exiting.
+ * This could impact scaling on very large systems. Be reluctant
+ * to use notify_on_release cpusets where very high task exit
+ * scaling is required on large systems.
+ *
+ * Don't even think about derefencing 'cs' after the cpuset use
+ * count goes to zero, except inside a critical section guarded
+ * by the cpuset_sem semaphore. If you don't hold cpuset_sem,
+ * then a zero cpuset use count is a license to any other task to
+ * nuke the cpuset immediately.
+ *
**/

void cpuset_exit(struct task_struct *tsk)
@@ -1415,10 +1426,13 @@ void cpuset_exit(struct task_struct *tsk
tsk->cpuset = NULL;
task_unlock(tsk);

- if (atomic_dec_and_test(&cs->count)) {
+ if (notify_on_release(cs)) {
down(&cpuset_sem);
- check_for_release(cs);
+ if (atomic_dec_and_test(&cs->count))
+ check_for_release(cs);
up(&cpuset_sem);
+ } else {
+ atomic_dec(&cs->count);
}
}


--
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373


2005-05-26 09:01:13

by Simon Derr

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

On Thu, 26 May 2005, Paul Jackson wrote:

> The existing code decrements the cpuset use count, and if the
> count goes to zero, processes the notify_on_release request,
> if appropriate. However, once the count goes to zero, unless we
> are holding the global cpuset_sem semaphore, there is nothing to
> stop another task from immediately removing the cpuset entirely,
> and recycling its memory.
Good catch.

>
> The obvious fix would be to always hold the cpuset_sem
> semaphore while decrementing the use count and dealing with
> notify_on_release. However we don't want to force a global
> semaphore into the mainline task exit path, as that might create
> a scaling problem.
>
> The actual fix is almost as easy - since this is only an issue
> for cpusets using notify_on_release, which the top level big
> cpusets don't normally need to use, only take the cpuset_sem
> for cpusets using notify_on_release.

I'm a bit concerned about this. Since there might well be
'notify_on_release' cpusets all over the system, and that there is only
one cpuset_sem semaphore, I feel like this 'scaling problem' still exists
even with:

if (notify_on_release(cs)) {
down(&cpuset_sem);
...

Maybe adding more per-cpuset data such as a per-cpuset removal_sem might
be worth it ?

Simon.




2005-05-26 12:09:34

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

On Thu, May 26, 2005 at 11:00:10AM +0200, Simon Derr wrote:
>
> I'm a bit concerned about this. Since there might well be
> 'notify_on_release' cpusets all over the system, and that there is only
> one cpuset_sem semaphore, I feel like this 'scaling problem' still exists
> even with:
>
> if (notify_on_release(cs)) {
> down(&cpuset_sem);
> ...
>
> Maybe adding more per-cpuset data such as a per-cpuset removal_sem might
> be worth it ?

Why not change the atomic into a lock and a refcount. Grab the lock before
each increment/decrement of the refcount and only continue with the removal
code when the refcount reaches 0. For a normal cpuset, the refcount could
be biased to 1. Then child cpusets are created, they could increment their
parent cpuset's refcount. When the notify_on_release flag used to be set,
we decrement the refcount by one. Whenever the refcount reaches 0, we
automatically remove the cpuset. Seems really clear, but would require
touching a larger chunk of the code.

Robin

2005-05-26 13:46:31

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

On Thu, May 26, 2005 at 11:00:10AM +0200, Simon Derr wrote:
> > The obvious fix would be to always hold the cpuset_sem
> > semaphore while decrementing the use count and dealing with
> > notify_on_release. However we don't want to force a global
> > semaphore into the mainline task exit path, as that might create
> > a scaling problem.
> >
> > The actual fix is almost as easy - since this is only an issue
> > for cpusets using notify_on_release, which the top level big
> > cpusets don't normally need to use, only take the cpuset_sem
> > for cpusets using notify_on_release.
>
> I'm a bit concerned about this. Since there might well be
> 'notify_on_release' cpusets all over the system, and that there is only
> one cpuset_sem semaphore, I feel like this 'scaling problem' still exists
> even with:
>
> if (notify_on_release(cs)) {
> down(&cpuset_sem);
> ...
>
> Maybe adding more per-cpuset data such as a per-cpuset removal_sem might
> be worth it ?

This would have to be in addition to the existing cpuset_sem wont it ??
Not sure if we need to add any more complexity unless the scaling problem
is really huge.

However if we do end up making any changes then IMO locking can be made
more granular, instead of one sem for cpus, memory and all operations
on cpusets

-Dinakar

2005-05-26 15:43:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix



On Thu, 26 May 2005, Robin Holt wrote:
>
> Why not change the atomic into a lock and a refcount. Grab the lock before
> each increment/decrement of the refcount and only continue with the removal
> code when the refcount reaches 0.

For this, there is a nice

atomic_dec_and_lock()

function that is the same as "atomic_dec_and_test()", except it grabs the
lock if the count decrements to zero.

I'm surprised people haven't picked up on it - it's been around for a
while, the VFS layer uses it for some quite fundamental data structures
(inode and dcache refcounts), and it's even documented in "atomic_ops.txt"

And it's _designed_ for refcountign things like this.

Basic rule of kernel programming: if a globally visible object does not
have a refcount, IT IS A BUG.

Linus

2005-05-26 18:46:46

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

Linus wrote:
> For this, there is a nice
>
> atomic_dec_and_lock()
>
> function that is the same as "atomic_dec_and_test()", except it grabs the
> lock if the count decrements to zero.

If we take Robin's idea to add a spinlock per cpuset, then yes this
atomic_dec_and_lock() might be useful. Thanks for mentioning it.

I am not convinced we need to go that way. See my upcoming replies to
Simon and Robin, for alternative approaches that are simpler.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-26 20:27:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

Simon wrote:
> I feel like this 'scaling problem' still exists even with:

Can you back this feeling with numbers, with real measurements?

I'll bet it's not actually a problem for any size system or load you
care about.

I have not yet actually seen any measurements that taking a quick
semaphore in the exit path is a scaling problem.

Until I see such a problem, I am hoping to only take the most simple and
basic precautions to avoid it.


> Maybe adding more per-cpuset data such as a per-cpuset removal_sem

Robin suggests more possibilities along this same line, in his reply.

I don't see where any further locks are needed. And for some of my
customers configurations, running with the majority of the system in a
single cpuset, per-cpuset locks aren't a big gain over a system wide
semaphore anyway.

We have two issues here:

1) After the exiting task releases its cpuset (decrements the cpuset
reference count), we may need the cpusets 'notify_on_release' flag
and its path in the cpuset filesystem, if the count went to zero,
in order to perform notify_on_release processing.

For the benefit of those who haven't memorized the kernel/cpuset.c
code, notify_on_release processing means calling a usermodehelper
to invoke the /sbin/cpuset_release_agent command, with the path
of the just released cpuset as a command line argument. This /sbin
script usually just does "rmdir /dev/cpuset/$1", removing the abandoned
cpuset.

2) We have two distinct ways of marking a cpuset 'in use'. The tasks
that point to a cpuset increment its reference 'count'. The children
cpusets of that cpuset put themselves on the 'children' list of that
cpuset.

The problem is that we cannot atomically examine both ways at the
same time, using the mechanisms in the code now, except by holding
the global cpuset_sem semaphore.

So far as we know, both of these two issues are only causing us grief in
the exit code (cpuset_exit). Everywhere else that it might matter, we
are far enough off the beaten code path that grabbing the global
cpuset_sem is no problem.

Perhaps we could solve issue (A) by having the cpuset_exit code bundle
up in local variables whatever information it might need from the
cpuset, before decrementing the count, for possible notify_on_release
processing. Then if the atomic_dec_and_test hits the zero count,
the cpuset_exit code will be able to process the notify_on_release
requests, without further referencing the cpuset.

Perhaps we can solve issue (B) by having child cpusets _also_ bump the
atomic reference count. Then when the count hits zero, it will mean
both that there are no more referencing tasks and that there are no
child cpusets. With this, we can atomically identify when a cpuset is
released, without holding the global cpuset_sem semaphore.


So ... I'm inclined to think that there is really no scaling problem
with this patch as proposed. But if there really is a scaling problem,
I'd next persue solution steps (A) and (B) above, before attempting
per-cpuset locks.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-26 20:52:22

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

Robin wrote:
> Whenever the refcount reaches 0, we
> automatically remove the cpuset.

If by this you mean replacing the usermodehelper callout to
/sbin/cpuset_release_agent, with a direct removal of the cpuset
in the kernel, then:
* this changes a kernel API - not to be done lightly, and
* adding an in-kernel way to nuke a cpuset, in addition to
having the rmdir(2) system call do it, seems like it would
present even thornier locking issues.

The usermodehelper callout is working just fine for us.

As to the rest of your post, proposing a per-cpuset spinlock,
see my previous reply to Simon.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-26 23:40:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

Would it make sense, Simon, to recommend to Andrew that
he take the simple patch I submitted yesterday for this
now, since it avoids a kernel crash and the risk of other
uglies?

Then, when we understand that improved scalability is needed,
and we have agreed on a solution to that, offer up a second
patch?

I don't mind doing fancier solutions if we need them, but
unless you sense we can obtain rapid agreement on something,
I don't think it's a good idea to avoid the simple fix, while
we are evaluating more complex solutions.

I for one do not have a sense of rapid agreement ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-27 00:24:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

On rereading my post above, I realize that it is confusingly presented.

Let me try again ...

Issue (1) is that we might need parts of the cpuset to process
notify_on_release after we have released the cpuset by decrementing its
reference count to zero. Unless we hold the global cpuset_sem
semaphore, if we try to access that released cpuset to get what we need,
we can crash the kernel. This is the issue that prompted this patch.

Issue (2) is that we have two ways of tracking users of a cpuset, with
both a reference count of tasks linked to the cpuset, and also a linked
list of child cpusets. Unless we hold the global cpuset_sem semaphore,
there is no atomicly safe way to answer the question "is this cpuset
free now?"

The solution to Issue (1) is to make local variable copies of the
information we need from the cpuset, before we let go of it (before we
decrement its reference count).

The solution to Issue (2) is to have child cpusets also manipulate the
cpuset reference count, so that the cpuset reference count provides an
atomic way to detect all uses of a cpuset.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-27 08:07:54

by Simon Derr

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

On Thu, 26 May 2005, Paul Jackson wrote:

> Would it make sense, Simon, to recommend to Andrew that
> he take the simple patch I submitted yesterday for this
> now, since it avoids a kernel crash and the risk of other
> uglies?
>
> Then, when we understand that improved scalability is needed,
> and we have agreed on a solution to that, offer up a second
> patch?
>
Of course !



> > I feel like this 'scaling problem' still exists even with:
> Can you back this feeling with numbers, with real measurements?

I don't.
My point is only that if you think there is a scaling problem in
taking cpuset_sem for each call to cpuset_exit(), that scaling problem
won't disappear by taking cpuset_sem only for 'notify_on_remove' cpusets,
as such cpusets might exist over the whole system.

Simon.

2005-05-27 08:45:13

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

Simon wrote:
> > Would it make sense, Simon, to recommend to Andrew that
> > he take the simple patch I submitted yesterday ...
> >
> > Then, when we understand ... offer up a second patch?
>
> Of course !

Ok - I'll resubmit that patch. Hopefully you can reply to that
resubmitted patch with an "Acked-by: ..."

Andrew withdrew the original patch, when it became a matter needing
further discussion.

> My point is only that if you think there is a scaling problem in
> taking cpuset_sem for each call to cpuset_exit(), that scaling problem
> won't disappear by taking cpuset_sem only for 'notify_on_remove' cpusets,

Yes - that is a good and valid point.

I also lack any real evidence of a scaling problem. It's just a
theoretical concern. My unreliable weather forecast is that it will be
a while before it's a serious concern.

This means I am willing to take simple measures to minimize the concern,
but I'd prefer to await hard evidence of the problem before trying more
ambitious measures.

==

My impression is that cpusets has two classes of users:
1) Extreme HPC apps, scaling to hundreds or thousands of CPUs, and
2) More mixed or service oriented apps, with less extreme scaling.

I also suspect that it is the second class that most requires
notify_on_release. The extreme HPC guys have less need for
notify_on_release.

So I find a solution that lets one trade off extreme scaling versus
heavier use of notify_on_release to be appealing, if it can be done
trivially, as this patch does.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401