A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program. This is really bad for bash implementation. Furthermore, many shell
scripts assume that pid numbers will not be used for some length of time.
Thanks to Ted Tso for the key ideas of this implementation.
Signed-off-by: Salman Qazi <[email protected]>
---
kernel/pid.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..8cedeab 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
+ int prev;
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
+
+ do {
+ prev = last;
+ last = cmpxchg(&pid_ns->last_pid,
+ prev, pid);
+ if (last >= pid)
+ break;
+ } while (prev != last);
+
return pid;
}
offset = find_next_offset(map, offset);
Salman <[email protected]> writes:
> +++ b/kernel/pid.c
> @@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> if (likely(atomic_read(&map->nr_free))) {
> do {
> if (!test_and_set_bit(offset, map->page)) {
> + int prev;
> atomic_dec(&map->nr_free);
> - pid_ns->last_pid = pid;
> +
> + do {
> + prev = last;
> + last = cmpxchg(&pid_ns->last_pid,
> + prev,
> + pid);
At some point not all architectures in Linux supported cmpxchg,
so it was not allowed to use it unconditionally in portable code.
This might have changed now (at least the UP only architectures fall
back to a generic cmpxchg now I think), but I'm not sure you have full
coverage on SMP.
-Andi
--
[email protected] -- Speaking for myself only.
* Salman <[email protected]> wrote:
> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program. This is really bad for bash implementation. Furthermore, many shell
> scripts assume that pid numbers will not be used for some length of time.
>
> Thanks to Ted Tso for the key ideas of this implementation.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> kernel/pid.c | 11 ++++++++++-
> 1 files changed, 10 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..8cedeab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> if (likely(atomic_read(&map->nr_free))) {
> do {
> if (!test_and_set_bit(offset, map->page)) {
> + int prev;
> atomic_dec(&map->nr_free);
> - pid_ns->last_pid = pid;
> +
> + do {
> + prev = last;
> + last = cmpxchg(&pid_ns->last_pid,
> + prev, pid);
> + if (last >= pid)
> + break;
> + } while (prev != last);
> +
> return pid;
> }
> offset = find_next_offset(map, offset);
Looks rather interesting. (Cleanliness-wise i'd suggest to split out the while
loop into a helper function.)
Ingo
On Tue, Jun 08, 2010 at 11:24:38PM -0700, Salman wrote:
> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program. This is really bad for bash implementation. Furthermore, many shell
> scripts assume that pid numbers will not be used for some length of time.
>
> Thanks to Ted Tso for the key ideas of this implementation.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> kernel/pid.c | 11 ++++++++++-
> 1 files changed, 10 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..8cedeab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> if (likely(atomic_read(&map->nr_free))) {
> do {
> if (!test_and_set_bit(offset, map->page)) {
> + int prev;
> atomic_dec(&map->nr_free);
> - pid_ns->last_pid = pid;
> +
> + do {
> + prev = last;
> + last = cmpxchg(&pid_ns->last_pid,
> + prev, pid);
> + if (last >= pid)
> + break;
You should make sure to handle pid wrap-around for this last/pid comparison.
I think proper way to do that would be:
/* last is the pid we started scanning at
* last_read is the last observed value of pid_ns->last_pid
*/
last_read = last;
do {
prev = last_read;
last_read = cmpxchg(&pid_ns->last_pid, prev, pid);
/* Exit if one of these conditions is true:
* - cmpxchg succeeded
* - last <= pid <= last_read (other thread already bumped last_pid)
* - last_read <= last <= pid (same with wraparound)
* - pid <= last_read <= last (same with different wraparound)
*/
} while (last_read != prev &&
(last > pid || pid > last_read) &&
(last_read > last || last > pid) &&
(pid > last_read || last_read > last));
The last_read == pid case is also interesting - it means another thread found
the same pid, forked a child with that pid, and the child exited already
(since the bitmap was cleared). However we don't need to handle that case -
first, that race is much less likely to happen, and second, the duplicate
pid would be returned in two separate tasks - so this would not cause problems
in bash as in your example.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Tue, Jun 08, 2010 at 11:24:38PM -0700, Salman wrote:
> A program that repeatedly forks and waits is susceptible to having
> the same pid repeated, especially when it competes with another
> instance of the same program. This is really bad for bash
> implementation. Furthermore, many shell scripts assume that pid
> numbers will not be used for some length of time.
>
> Thanks to Ted Tso for the key ideas of this implementation.
>
> Signed-off-by: Salman Qazi <[email protected]>
Here's a slightly more succint way of expressing it. Others will have
decide if it's easier to understand. (It is for me, but I wrote it. :-P)
- Ted
kernel/pid.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..c51f413 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -154,8 +154,13 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
- return pid;
+ while (1) {
+ i = cmpxchg(&pid_ns->last_pid,
+ last, pid);
+ if (i == last || i >= pid)
+ return pid;
+ last = i;
+ }
}
offset = find_next_offset(map, offset);
pid = mk_pid(pid_ns, map, offset);
On Wed, Jun 09, 2010 at 04:49:03AM -0700, Michel Lespinasse wrote:
> You should make sure to handle pid wrap-around for this last/pid comparison.
> I think proper way to do that would be:
Good point! I forgot about checking for pid wrap-around.
> The last_read == pid case is also interesting - it means another
> thread found the same pid, forked a child with that pid, and the
> child exited already (since the bitmap was cleared). However we
> don't need to handle that case - first, that race is much less
> likely to happen, and second, the duplicate pid would be returned in
> two separate tasks - so this would not cause problems in bash as in
> your example.
In order for that to happen, all of this would have to happen between
the time that last_pid was initially sampled at the very beginning of
alloc_pidmap(). Could that happen? I think it could, if kzalloc()
took a very long time to run, and the scheduler was _very_ unfair. We
could try to guard against that by checking to see if last_pid has
changed after allocating map->page (in the unlikely case of !map->page
being NULL in the first place) and if so, restarting alloc_pidmap() by
jumping back to the beginning of the function.
Could that happen with bash? I'm not as confident as you that it
could never happen. The fact that we saw the race in the first place
in bash means that it could still happen in this case, I think. In
any case, if we have two processes getting the same pid in the absence
of a fork, that would be a bad thing and could lead to all sorts of
confusion, so it's probably a good thing to guard against even if it
is a rare case.
BTW, Salman, I haven't seen it, but I vaguely remember in the mail
exchange with the person who reported this that we have a C/C++
reproduction case? If it's small enough, it might be a good idea to
include it in the patch description.
- Ted
On Wed, 9 Jun 2010, Ingo Molnar wrote:
>
> * Salman <[email protected]> wrote:
> >
> > A program that repeatedly forks and waits is susceptible to having the
> > same pid repeated, especially when it competes with another instance of the
> > same program. This is really bad for bash implementation. Furthermore, many shell
> > scripts assume that pid numbers will not be used for some length of time.
> >
> > Thanks to Ted Tso for the key ideas of this implementation.
>
> Looks rather interesting. (Cleanliness-wise i'd suggest to split out the while
> loop into a helper function.)
I have to say that usually I can look at a patch and see what it does.
This time I had absolutely no clue.
There was a whole big context missing: that original load of "last_pid"
into "last" at the top of the function, far outside the patch. And in
this case I don't think splitting out the trivial cmpxchg() loop would
help that problem - that would just make the "load original" part of the
big picture be even further away from the final "replace if same" part,
and I think _that_ is a much more critical part of the subtleties there.
So I had to read the patch _and_ go read the code it patched, in order to
at all understand what it did. I think the patch explanation should have
done it, and I also think this would need a bit comment at the top.
[ In fact, I'd argue that the _old_ code would have needed a big comment
at the top about last_pid usage, but i somebody had done that, they'd
probably also have seen the race while explaning how the code worked ;]
So having looked at the patch and the code, I agree with the patch, but
I'd like some more explanation to go with it.
[ Or Ted's version: as mentioned, I don't think the complexity is actually
in the final cmpxchg loop itself, but in the bigger picture, so I don't
think the differences between Ted's and Salman's versions are that big ]
Linus
On Wed, Jun 09, 2010 at 08:39:00AM -0700, Linus Torvalds wrote:
>
> So I had to read the patch _and_ go read the code it patched, in order to
> at all understand what it did. I think the patch explanation should have
> done it, and I also think this would need a bit comment at the top.
>
> [ In fact, I'd argue that the _old_ code would have needed a big comment
> at the top about last_pid usage, but i somebody had done that, they'd
> probably also have seen the race while explaning how the code worked ;]
>
Salman had created a very nice ASCII art diagram of the race in the
mail thread with the internal bug reporter who noticed the problem.
We could include that, if you don't mind the commit description
growing by 30-40 lines. :-) I agree though that better documentation
n the source code about _how_ alloc_pidmap was supposed to avoid all
possible races would have probably been a good idea.
> [ Or Ted's version: as mentioned, I don't think the complexity is actually
> in the final cmpxchg loop itself, but in the bigger picture, so I don't
> think the differences between Ted's and Salman's versions are that big ]
Yah, I had been staring at the code for a while, so I had the feeling
that my intuition of which patch would be clearer was probably biased.
We do need to deal with pid wrap possibility just to be completely
correct, although the chance of hitting _that_ are pretty remote.
- Ted
On Wed, 9 Jun 2010, [email protected] wrote:
>
> Salman had created a very nice ASCII art diagram of the race in the
> mail thread with the internal bug reporter who noticed the problem.
> We could include that, if you don't mind the commit description
> growing by 30-40 lines. :-)
I don't horribly mind, but I also don't think it's necessarily all that
useful to go that far. For the commit message, there are two real reasons:
- explaining the patch itself upstream to make people like me understand
_why_ it needs to be applied.
But then a denser explanation than a 30-40 line ASCII art diagram would
probably suffice.
- explaning the code after-the-fact to people who end up seeing it in
log/blame output
And then we don't care so much about the old bug per se, as about how
the code is supposed to work - so a lot of verbiage about what used to
happen is much less important than describing just what's going on with
the cmpxchg (where the "loop" is all the way from the original value
load, especially in this case where we don't have to loop all the way
back).
In fact, conceptually the cmpxchg loop is pretty much the whole function,
it's just that if we know that any racing elements will be idempotent wrt
anything but the final assignment of 'last_pid'. So we can end up just
looping over the final part. I think _that_ is the clever part, and I
think that is the part that needs explanation.
(And that is also why the diff itself doesn't include the "early part of
the loop", and why I couldn't understand it purely as a patch - because it
in this case the patch is "artificially small" because of how we don't
need to loop all the way up)
> We do need to deal with pid wrap possibility just to be completely
> correct, although the chance of hitting _that_ are pretty remote.
That seems to be purely an artifact of the cleverness of avoiding re-doing
the whole loop, no? If we _had_ re-done the whole loop (the "normal"
cmpxchg model), we would have re-started the whole pid search and handled
the overflow in the existing overflow handling code.
Linus
On Wed, Jun 09, 2010 at 09:06:37AM -0700, Linus Torvalds wrote:
>
> That seems to be purely an artifact of the cleverness of avoiding re-doing
> the whole loop, no? If we _had_ re-done the whole loop (the "normal"
> cmpxchg model), we would have re-started the whole pid search and handled
> the overflow in the existing overflow handling code.
This brings up a question. If we're going to use a cmpxchg() loop, is
there any point to doing the test-and-set game with the bitmap? It
should be just as fast to just use cmpxchg as it is to do the
test-and-set, and isn't it more likely that various CPU architectures
will have the cmpxchg than test-and-set-bit?
We might be able to radically simplify alloc_pidmap(). Would that
cause a huge problem on some non-Intel architectures, though?
- Ted
On Wed, 9 Jun 2010, [email protected] wrote:
>
> This brings up a question. If we're going to use a cmpxchg() loop, is
> there any point to doing the test-and-set game with the bitmap?
I think there very much is.
Otherwise you have three threads, two of which pick the same pid (because
the test-and-set isn't atomic), and a third of which picks a new one. The
cmpxchg succeeds (the third one wins, and everybody picks that winner),
but you only expanded the map by two entries, and you're going to return
the same pid nr to two people.
So the cmpxchg only protects "last_pid". It does _not_ in any way protect
the pid we're actually going to return.
Or am I missing something?
Linus
On Wed, 9 Jun 2010, Linus Torvalds wrote:
>
> Otherwise you have three threads, two of which pick the same pid (because
> the test-and-set isn't atomic), and a third of which picks a new one.
In fact, I don't think you need three threads at all. It's perfectly ok to
just have two threads, and they'd both end up picking the same 'pid'
without the atomicity guarantees of that 'test_and_set()' bitmap access.
And they'd both be perfectly fine setting last_pid to that (shared) pid if
I read that cmpxchg loop right. No?
Linus
On Wed, Jun 09, 2010 at 10:25:50AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 9 Jun 2010, Linus Torvalds wrote:
> >
> > Otherwise you have three threads, two of which pick the same pid (because
> > the test-and-set isn't atomic), and a third of which picks a new one.
>
> In fact, I don't think you need three threads at all. It's perfectly ok to
> just have two threads, and they'd both end up picking the same 'pid'
> without the atomicity guarantees of that 'test_and_set()' bitmap access.
>
> And they'd both be perfectly fine setting last_pid to that (shared) pid if
> I read that cmpxchg loop right. No?
Well, I was thinking about something like this:
while (1) {
last = pid_ns->last_pid;
pid = last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
if (cmpxchg(&pid_ns->last_pid, last, pid) == last)
return pid;
}
Which I don't think is racy, unless I'm missing something. Both might
end up picking the same pid, but only one will successfully set
last_pid, and the other will just loop and try again.
There appears to be some interesting uses of the bitmap by
find_ge_pid() and next_pidmap() that I haven't completely grokked yet,
especially as to why they're needed, though. Assuming they are
needed, we might end up needing the bitmap after all, though.
- Ted
On Wed, 9 Jun 2010, [email protected] wrote:
>
> Well, I was thinking about something like this:
No, this is wrong for the case where you end up having to allocate a new
pidmap and/or overflow max_pid.
It also doesn't set the bit at all.
> There appears to be some interesting uses of the bitmap by
> find_ge_pid() and next_pidmap() that I haven't completely grokked yet,
We need that bitmap to handle the overflow max_pid case. We are _not_
returning just increasing pid numbers.
Linus
On Wed, Jun 09, 2010 at 10:43:33AM -0700, Linus Torvalds wrote:
>
> We need that bitmap to handle the overflow max_pid case. We are _not_
> returning just increasing pid numbers.
Doh! I knew I was forgetting something obvious. I was hoping we
could get rid of the bitmap entirely, but I guess not....
(Unless users would stand for 64-bit pid numbers... no? Dang. :-)
- Ted
On Wed, Jun 9, 2010 at 10:47 AM, <[email protected]> wrote:
> On Wed, Jun 09, 2010 at 10:43:33AM -0700, Linus Torvalds wrote:
>>
>> We need that bitmap to handle the overflow max_pid case. We are _not_
>> returning just increasing pid numbers.
>
> Doh! ?I knew I was forgetting something obvious. ?I was hoping we
> could get rid of the bitmap entirely, but I guess not....
>
> (Unless users would stand for 64-bit pid numbers... no? ?Dang. :-)
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>
(sorry about the previous message, to those who got it... my mail
client silently switched to HTML mode)
I am working on a new version of the change taking into account
comments (both about substance and style) by Michel, Ted and Linus. I
agree with Michel in that I am not sure that the rare case of same
last_pid being set by two threads is worth fixing.