2010-06-11 17:18:18

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

Fixed the whitespace issue that Michel pointed out.

Btw., who is responsible for ACKing this?

--

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program. This is really bad for bash implementation. Furthermore,
many shell scripts assume that pid numbers will not be used for some length
of time.

Race Description:

A B

// pid == offset == n // pid == offset == n + 1
test_and_set_bit(offset, map->page)
test_and_set_bit(offset, map->page);
pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
// pid == n + 1 is freed (wait())

// Next fork()...
last = pid_ns->last_pid; // == n
pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
int failed = 0;
pid_t last_pid = 0;
int i;
printf("%d\n", sizeof(pid_t));
for (i = 0; i < 10000000; ++i) {
if (i % 32786 == 0)
printf("Iter: %d\n", i/32768);
int child_exit_code = i % 256;
pid_t pid = fork();
if (pid == -1) {
fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
exit(1);
}
if (pid == 0) {
// Child
exit(child_exit_code);
} else {
// Parent
if (i > 0) {
int distance = PidDistance(last_pid, pid);
if (distance == 0 || distance > 30000) {
fprintf(stderr,
"Unexpected pid sequence: previous fork: pid=%d, "
"current fork: pid=%d for iteration=%d.\n",
last_pid, pid, i);
failed = 1;
}
}
last_pid = pid;
int status;
int reaped = wait(&status);
if (reaped != pid) {
fprintf(stderr,
"Wait return value: expected pid=%d, "
"got %d, iteration %d\n",
pid, reaped, i);
failed = 1;
} else if (WEXITSTATUS(status) != child_exit_code) {
fprintf(stderr,
"Unexpected exit status %x, iteration %d\n",
WEXITSTATUS(status), i);
failed = 1;
}
}
}
exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <[email protected]>
---
kernel/pid.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..8e9067c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,20 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}

+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ */
+static int pid_before(int base, int a, int b)
+{
+ /*
+ * This is the same as saying
+ *
+ * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
+ * and that mapping orders 'a' and 'b' with respect to 'base'.
+ */
+ return (unsigned)(a - base) < (unsigned)(b - base);
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -153,8 +167,32 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
+ int prev;
+ int last_write = last;
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
+
+ /*
+ * We might be racing with someone else
+ * trying to set pid_ns->last_pid.
+ * We want the winner to have the
+ * "later" value, because if the
+ * "earlier" value prevails, then
+ * a pid may get reused immediately.
+ *
+ * Since pids rollover, it is not
+ * sufficent to just pick the bigger
+ * value. We have to consider
+ * where we started counting from.
+ */
+ do {
+ prev = last_write;
+ last_write = cmpxchg(
+ &pid_ns->last_pid,
+ prev, pid);
+ } while ((prev != last_write) &&
+ (pid_before(last, last_write,
+ pid)));
+
return pid;
}
offset = find_next_offset(map, offset);


2010-06-11 17:44:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Fri, Jun 11, 2010 at 10:17 AM, Salman <[email protected]> wrote:
> Fixed the whitespace issue that Michel pointed out.
>
> Btw., who is responsible for ACKing this?

I don't know about "responsible", but I'll Ack it. Much of that code
is really old. And exactly since this is a really old issue, I think
I'll leave it unapplied for now, and let it simmer in some test-queue
(get it into next somehow?) until I get back.

Also, now that I look at that complex end-condition for the while-loop
and the big comment, I do start to think that Ingo was right, and it
would be better to make that thing a helper function of its own,
called something like "set_last_pid(pid_ns, last, pid);"

That would lessen the indentation a lot too, and with the commentary,
it would all look fairly pretty.

So Ack-with-cleanup-suggestion. And maybe Andrew can take it into -mm
while I'm gone? Or Ingo into some core-branch? You fight it out, but I
think this is already acceptable, just perhaps still open to
improvement.

Linus

2010-06-11 22:50:17

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program. This is really bad for bash implementation. Furthermore,
many shell scripts assume that pid numbers will not be used for some length
of time.

Race Description:

A B

// pid == offset == n // pid == offset == n + 1
test_and_set_bit(offset, map->page)
test_and_set_bit(offset, map->page);
pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
// pid == n + 1 is freed (wait())

// Next fork()...
last = pid_ns->last_pid; // == n
pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
int failed = 0;
pid_t last_pid = 0;
int i;
printf("%d\n", sizeof(pid_t));
for (i = 0; i < 10000000; ++i) {
if (i % 32786 == 0)
printf("Iter: %d\n", i/32768);
int child_exit_code = i % 256;
pid_t pid = fork();
if (pid == -1) {
fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
exit(1);
}
if (pid == 0) {
// Child
exit(child_exit_code);
} else {
// Parent
if (i > 0) {
int distance = PidDistance(last_pid, pid);
if (distance == 0 || distance > 30000) {
fprintf(stderr,
"Unexpected pid sequence: previous fork: pid=%d, "
"current fork: pid=%d for iteration=%d.\n",
last_pid, pid, i);
failed = 1;
}
}
last_pid = pid;
int status;
int reaped = wait(&status);
if (reaped != pid) {
fprintf(stderr,
"Wait return value: expected pid=%d, "
"got %d, iteration %d\n",
pid, reaped, i);
failed = 1;
} else if (WEXITSTATUS(status) != child_exit_code) {
fprintf(stderr,
"Unexpected exit status %x, iteration %d\n",
WEXITSTATUS(status), i);
failed = 1;
}
}
}
exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <[email protected]>
---
kernel/pid.c | 39 ++++++++++++++++++++++++++++++++++++++-
1 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..fbbd5f6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,43 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}

+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ */
+static int pid_before(int base, int a, int b)
+{
+ /*
+ * This is the same as saying
+ *
+ * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
+ * and that mapping orders 'a' and 'b' with respect to 'base'.
+ */
+ return (unsigned)(a - base) < (unsigned)(b - base);
+}
+
+/*
+ * We might be racing with someone else trying to set pid_ns->last_pid.
+ * We want the winner to have the "later" value, because if the
+ * "earlier" value prevails, then a pid may get reused immediately.
+ *
+ * Since pids rollover, it is not sufficient to just pick the bigger
+ * value. We have to consider where we started counting from.
+ *
+ * 'base' is the value of pid_ns->last_pid that we observed when
+ * we started looking for a pid.
+ *
+ * 'pid' is the pid that we eventually found.
+ */
+static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
+{
+ int prev;
+ int last_write = base;
+ do {
+ prev = last_write;
+ last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
+ } while ((prev != last_write) && (pid_before(base, last_write, pid)));
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -154,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
+ set_last_pid(pid_ns, last, pid);
return pid;
}
offset = find_next_offset(map, offset);

2010-06-11 23:08:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

Yup. Much prettier. Thanks.

Ingo, Andrew, you can now have a cage-match on who actually takes it.

Linus

2010-06-14 23:59:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Fri, 11 Jun 2010 15:49:54 -0700
Salman <[email protected]> wrote:

> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program. This is really bad for bash implementation. Furthermore,
> many shell scripts assume that pid numbers will not be used for some length
> of time.
>
> Race Description:
>
> ...
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..fbbd5f6 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -122,6 +122,43 @@ static void free_pidmap(struct upid *upid)
> atomic_inc(&map->nr_free);
> }
>
> +/*
> + * If we started walking pids at 'base', is 'a' seen before 'b'?
> + */
> +static int pid_before(int base, int a, int b)
> +{
> + /*
> + * This is the same as saying
> + *
> + * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
> + * and that mapping orders 'a' and 'b' with respect to 'base'.
> + */
> + return (unsigned)(a - base) < (unsigned)(b - base);
> +}

pid.c uses an exotic mix of `int' and `pid_t' to represent pids. `int'
seems to preponderate.

> +/*
> + * We might be racing with someone else trying to set pid_ns->last_pid.
> + * We want the winner to have the "later" value, because if the
> + * "earlier" value prevails, then a pid may get reused immediately.
> + *
> + * Since pids rollover, it is not sufficient to just pick the bigger
> + * value. We have to consider where we started counting from.
> + *
> + * 'base' is the value of pid_ns->last_pid that we observed when
> + * we started looking for a pid.
> + *
> + * 'pid' is the pid that we eventually found.
> + */
> +static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
> +{
> + int prev;
> + int last_write = base;
> + do {
> + prev = last_write;
> + last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
> + } while ((prev != last_write) && (pid_before(base, last_write, pid)));
> +}

<gets distracted>

hm. For a long time cmpxchg() wasn't available on all architectures.
That _seems_ to have been fixed.

arch/score assumes that cmpxchg() operates on unsigned longs.

arch/powerpc plays the necessary games to make 4- and 8-byte scalars work.

ia64 handles 1, 2, 4 and 8-byte quantities.

arm handles 1, 2 and 4-byte scalars.

as does blackfin.

So from the few architectures I looked at, it seems that we do indeed
handle cmpxchg() on all architectures although not very consistently.
arch/score will blow up if someone tries to use cmpxchg() on 1- or
2-byte scalars.

<looks at the consumers>

infiniband deos cmpxchg() on u64*'s, which will blow up on many
architectures.

Using

grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/

I can't see any cmpxchg() callers in truly generic code. lockdep and
kernel/trace/ring_buffer.c aren't used on the more remote
architectures, I think.

Traditionally, atomic_cmpxchg() was the safe and portable one to use.

2010-06-15 00:56:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote:
> Using
>
> grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/
>
> I can't see any cmpxchg() callers in truly generic code. lockdep and
> kernel/trace/ring_buffer.c aren't used on the more remote
> architectures, I think.

What about:

drivers/gpu/drm/drm_lock.c: prev = cmpxchg(lock, old, new);
kernel/lockdep.c: n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de
kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
fs/btrfs/inode.c: if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN
fs/ext4/inode.c: } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl

The last is quite new --- I had just recently done a similar set of
research as you did before accepting the patch that added cmpxchg into
ext4 (during the last merge window), and I thought cmpxchg() had
entered the "supported by all architectures" category. It looked like
it had only recently reached state, but I had reached the conclusion
that it was safe to use.

- Ted

2010-06-15 01:58:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, 14 Jun 2010 20:56:19 -0400 [email protected] wrote:

> On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote:
> > Using
> >
> > grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/
> >
> > I can't see any cmpxchg() callers in truly generic code. lockdep and
> > kernel/trace/ring_buffer.c aren't used on the more remote
> > architectures, I think.
>
> What about:
>
> drivers/gpu/drm/drm_lock.c: prev = cmpxchg(lock, old, new);
> kernel/lockdep.c: n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de

I put these in the not-used-on-weird-architectures bucket.

> kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc

I guess that'll flush out any stragglers.

I suspect sched_clock.c might be generating fair amounts of code which
UP builds don't need.

> fs/btrfs/inode.c: if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN
> fs/ext4/inode.c: } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl
>
> The last is quite new --- I had just recently done a similar set of
> research as you did before accepting the patch that added cmpxchg into
> ext4 (during the last merge window), and I thought cmpxchg() had
> entered the "supported by all architectures" category. It looked like
> it had only recently reached state, but I had reached the conclusion
> that it was safe to use.

I think you're probably right, as long as one sticks with 4-byte
scalars. The cmpxchg-is-now-generic change snuck in under the radar
(mine, at least).

2010-06-15 03:26:18

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:

> > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
>
> I guess that'll flush out any stragglers.

And break most non-x86 32-bit architectures, including 32-bit powerpc.
Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
is set, and it looks like only x86 and ia64 set it.

Paul.

2010-06-15 04:23:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Tue, 15 Jun 2010 13:26:08 +1000 Paul Mackerras <[email protected]> wrote:

> On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
>
> > > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> >
> > I guess that'll flush out any stragglers.
>
> And break most non-x86 32-bit architectures, including 32-bit powerpc.

If CONFIG_SMP=y, yes. On UP there's a generic implementation
(include/asm-generic/cmpxchg-local.h, include/asm-generic/cmpxchg.h)

> Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> is set, and it looks like only x86 and ia64 set it.
>

If that happens then the best fix is for those architectures to get
themselves a cmpxchg64(). Unless for some reason it's simply
unimplementable? Worst case I guess one could use a global spinlock.
Second-worst-case: hashed spinlocks.

2010-06-15 04:38:54

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

Le lundi 14 juin 2010 à 21:21 -0700, Andrew Morton a écrit :
> On Tue, 15 Jun 2010 13:26:08 +1000 Paul Mackerras <[email protected]> wrote:
>
> > On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
> >
> > > > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> > >
> > > I guess that'll flush out any stragglers.
> >
> > And break most non-x86 32-bit architectures, including 32-bit powerpc.
>
> If CONFIG_SMP=y, yes. On UP there's a generic implementation
> (include/asm-generic/cmpxchg-local.h, include/asm-generic/cmpxchg.h)
>
> > Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> > is set, and it looks like only x86 and ia64 set it.
> >
>
> If that happens then the best fix is for those architectures to get
> themselves a cmpxchg64(). Unless for some reason it's simply
> unimplementable? Worst case I guess one could use a global spinlock.
> Second-worst-case: hashed spinlocks.

Hmm, this reminds me a patch I had somewhere, not yet sent to David :)

1) cmpxchg() was not available for all arches, maybe
atomic_long_cmpxchg() is ?

2) I am not sure atomic_long_t quarantee that all 32 or 64 bits of the
underlying long container are available.

Thanks !

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a291edb..335ca89 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1268,17 +1268,18 @@ skip_hashing:

void rt_bind_peer(struct rtable *rt, int create)
{
- static DEFINE_SPINLOCK(rt_peer_lock);
struct inet_peer *peer;

peer = inet_getpeer(rt->rt_dst, create);

- spin_lock_bh(&rt_peer_lock);
- if (rt->peer == NULL) {
- rt->peer = peer;
+#if defined(__HAVE_ARCH_CMPXCHG)
+ if (!cmpxchg(&rt->peer, NULL, peer))
peer = NULL;
- }
- spin_unlock_bh(&rt_peer_lock);
+#else
+ BUILD_BUG_ON(sizeof(rt->peer) != sizeof(atomic_long_t));
+ if (!atomic_long_cmpxchg((atomic_long_t *)&rt->peer, 0, (long)peer))
+ peer = NULL;
+#endif
if (peer)
inet_putpeer(peer);
}

2010-06-15 06:59:48

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, 2010-06-14 at 21:21 -0700, Andrew Morton wrote:
> If that happens then the best fix is for those architectures to get
> themselves a cmpxchg64(). Unless for some reason it's simply
> unimplementable? Worst case I guess one could use a global spinlock.
> Second-worst-case: hashed spinlocks.

Right, ppc32 at least can't so that would be spinlocks...

Cheers,
Ben.

2010-06-15 07:25:45

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, Jun 14, 2010 at 09:21:50PM -0700, Andrew Morton wrote:

> If that happens then the best fix is for those architectures to get
> themselves a cmpxchg64(). Unless for some reason it's simply
> unimplementable? Worst case I guess one could use a global spinlock.
> Second-worst-case: hashed spinlocks.

Spinlocks won't help as long as you can use cmpxchg64 on any old u64
that is also accessed directly. UP can disable interrupts, of course,
but for SMP we would have to restrict cmpxchg64 to some type
(atomic64_t, maybe) for which you have to use an accessor function to
read it, so we can take the spinlock around the reads.

I suspect it isn't worth the trouble. Note that I'm talking
specifically about cmpxchg64 here, not cmpxchg.

Paul.

2010-06-15 12:56:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
>
> I think you're probably right, as long as one sticks with 4-byte
> scalars. The cmpxchg-is-now-generic change snuck in under the radar
> (mine, at least).

Hmmm, what about unsigned longs? (Which might be 8 bytes on some
architectures....)

- Ted

2010-06-15 13:06:46

by Kyle McMartin

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Tue, Jun 15, 2010 at 08:56:50AM -0400, [email protected] wrote:
> > I think you're probably right, as long as one sticks with 4-byte
> > scalars. The cmpxchg-is-now-generic change snuck in under the radar
> > (mine, at least).
>
> Hmmm, what about unsigned longs? (Which might be 8 bytes on some
> architectures....)
>

Longs are fine, since Linux only supports LP64 (and would need major work
to support anything else.)

The problem documented above is that on 32-bit, a 64-bit read is
non-atomic, so even if you use a hashed spinlock to protect a u64
variable on 32-bit, reads will be non-atomic, and so must take the same
lock in order to be safe. Hence, you need accessor functions.

This is what the generic atomic code does, perhaps we could add a new
API that gives us hooks to do proper hashed spinlocks on crap
architectures but falls back to simple assignment and real cmpxchg on
real platforms.

--Kyle

2010-06-15 14:36:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Mon, 2010-06-14 at 18:55 -0700, Andrew Morton wrote:
> > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
>
> I guess that'll flush out any stragglers.

cmpxchg64() is special, at the time i386 didn't handle the 8 byte
cmpxchg(), although we could easily make it do today.

> I suspect sched_clock.c might be generating fair amounts of code which
> UP builds don't need.

Only sched_clock_remote() and its caller, something like the below, not
much code..

UP machines can still have utterly sucky TSC, although the
inter-cpu-drift thing isn't much of an issue ;-)

---
kernel/sched_clock.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 52f1a14..7ff5b56 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -170,6 +170,7 @@ again:
return clock;
}

+#ifdef CONFIG_SMP
static u64 sched_clock_remote(struct sched_clock_data *scd)
{
struct sched_clock_data *my_scd = this_scd();
@@ -205,6 +206,7 @@ again:

return val;
}
+#endif

/*
* Similar to cpu_clock(), but requires local IRQs to be disabled.
@@ -226,9 +228,11 @@ u64 sched_clock_cpu(int cpu)

scd = cpu_sdc(cpu);

+#ifdef CONFIG_SMP
if (cpu != smp_processor_id())
clock = sched_clock_remote(scd);
else
+#endif
clock = sched_clock_local(scd);

return clock;

2010-06-15 18:37:19

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/1] (Was: Fix a race in pid generation that causes pids to be reused immediately.)

On 06/11, Salman wrote:
>
> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated,

I agree this patch should fix this.

But the kernel still can use the same pid twice in a row, I am wondering
if there a simple solution to prevent this. Probably not, and this case
is unlikely.

While we are here, and given that many reviewers recently inspected
alloc_pidmap(), probably someone can explain the mysterious checks
in do/while loop ?

See the patch I am going to send, it is on top on Salman's fix but doesn't
really depend on it.

Oleg.

2010-06-15 18:37:47

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/1] pids: alloc_pidmap: remove the unnecessary boundary checks

alloc_pidmap() calculates max_scan so that if the initial offset != 0
we inspect the first map->page twice. This is correct, we want to find
the unused bits < offset in this bitmap block. Add the comment.

But it doesn't make any sense to stop the find_next_offset() loop when
we are looking into this map->page for the second time. We have already
already checked the bits >= offset during the first attempt, it is fine
to do this again, no matter if we succeed this time or not.

Remove this hard-to-understand code. It optimizes the very unlikely case
when we are going to fail, but slows down the more likely case.

Signed-off-by: Oleg Nesterov <[email protected]>
---

kernel/pid.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)

--- 35-rc2/kernel/pid.c~PID_CLEANUP 2010-06-15 19:33:14.000000000 +0200
+++ 35-rc2/kernel/pid.c 2010-06-15 19:52:03.000000000 +0200
@@ -169,7 +169,12 @@ static int alloc_pidmap(struct pid_names
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ /*
+ * If last_pid points into the middle of the map->page we
+ * want to scan this bitmap block twice, the second time
+ * we start with offset == 0 (or RESERVED_PIDS).
+ */
+ max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page)) {
void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -196,15 +201,7 @@ static int alloc_pidmap(struct pid_names
}
offset = find_next_offset(map, offset);
pid = mk_pid(pid_ns, map, offset);
- /*
- * find_next_offset() found a bit, the pid from it
- * is in-bounds, and if we fell back to the last
- * bitmap block and the final block was the same
- * as the starting point, pid is before last_pid.
- */
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
- (i != max_scan || pid < last ||
- !((last+1) & BITS_PER_PAGE_MASK)));
+ } while (offset < BITS_PER_PAGE && pid < pid_max);
}
if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
++map;

2010-06-15 19:37:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.

On Tue, 15 Jun 2010 16:35:52 +0200
Peter Zijlstra <[email protected]> wrote:

> > I suspect sched_clock.c might be generating fair amounts of code which
> > UP builds don't need.
>
> Only sched_clock_remote() and its caller, something like the below, not
> much code..
>
> UP machines can still have utterly sucky TSC, although the
> inter-cpu-drift thing isn't much of an issue ;-)
>
> ---
> kernel/sched_clock.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
> index 52f1a14..7ff5b56 100644
> --- a/kernel/sched_clock.c
> +++ b/kernel/sched_clock.c
> @@ -170,6 +170,7 @@ again:
> return clock;
> }
>
> +#ifdef CONFIG_SMP
> static u64 sched_clock_remote(struct sched_clock_data *scd)
> {
> struct sched_clock_data *my_scd = this_scd();
> @@ -205,6 +206,7 @@ again:
>
> return val;
> }
> +#endif
>
> /*
> * Similar to cpu_clock(), but requires local IRQs to be disabled.
> @@ -226,9 +228,11 @@ u64 sched_clock_cpu(int cpu)
>
> scd = cpu_sdc(cpu);
>
> +#ifdef CONFIG_SMP
> if (cpu != smp_processor_id())
> clock = sched_clock_remote(scd);
> else
> +#endif
> clock = sched_clock_local(scd);
>
> return clock;

hm, OK, I was actually looking at sched_clock_local() at the time. Can
clocks go backwards on UP hardware? What breaks if we do #define
sched_clock_local sched_clock?

I've mentioned this before, but sched_clock.c is really opaque - it
would be a formidable task for anyone to get in there and work on the
code if they hadn't already been working on it for years.