2004-09-12 08:59:22

by Anton Blanchard

[permalink] [raw]
Subject: /proc/sys/kernel/pid_max issues


Hi,

I tried creating 100,000 threads just for the hell of it. I was
surprised that it appears to have worked even with pid_max set at 32k.

It seems if we are above pid_max we wrap back to RESERVED_PIDS at the
start of alloc_pidmap but do not enforce this upper limit. I guess every
call of alloc_pidmap above 32k was wrapping back to RESERVED_PIDS,
walking the allocated space then allocating off the end.

Just as an aside, does it make sense to remove the pidmap allocator and
use the IDR allocator now its there?

Now once I had managed to allocate those 100,000 threads, I noticed
this:

18446744071725383682 dr-xr-xr-x 3 root root 0 Sep 12 08:10 100796

Strange huh. Turns out we allocate inodes in proc via:

#define fake_ino(pid,ino) (((pid)<<16)|(ino))

With 32bit inodes we are screwed once pids go over 64k arent we?

Anton


2004-09-12 09:38:13

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 06:56:09PM +1000, Anton Blanchard wrote:
> I tried creating 100,000 threads just for the hell of it. I was
> surprised that it appears to have worked even with pid_max set at 32k.
> It seems if we are above pid_max we wrap back to RESERVED_PIDS at the
> start of alloc_pidmap but do not enforce this upper limit. I guess every
> call of alloc_pidmap above 32k was wrapping back to RESERVED_PIDS,
> walking the allocated space then allocating off the end.

Well, it looks like things blindly move to the next bitmap block and
don't check that the offset within it is valid, so the patch below
(not runtime or compile-time tested) may help. It relies on
next_free_map() returning NULL after having cycled through enough of
the bitmap blocks.


On Sun, Sep 12, 2004 at 06:56:09PM +1000, Anton Blanchard wrote:
> Just as an aside, does it make sense to remove the pidmap allocator and
> use the IDR allocator now its there?
> Now once I had managed to allocate those 100,000 threads, I noticed
> this:
> 18446744071725383682 dr-xr-xr-x 3 root root 0 Sep 12 08:10 100796
> Strange huh. Turns out we allocate inodes in proc via:
> #define fake_ino(pid,ino) (((pid)<<16)|(ino))
> With 32bit inodes we are screwed once pids go over 64k arent we?

The /proc/ code is rather messy around this area so I don't have any
immediate ideas there.


-- wli


Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 05:46:18.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 02:25:37.679451472 -0700
@@ -75,6 +75,8 @@
while (--*max_steps) {
if (++map == map_limit)
map = pidmap_array;
+ if (map > &pidmap_array[pid_max/BITS_PER_PAGE])
+ map = pidmap_array;
if (unlikely(!map->page)) {
unsigned long page = get_zeroed_page(GFP_KERNEL);
/*
@@ -135,13 +137,12 @@
*/
scan_more:
offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
- if (offset >= BITS_PER_PAGE)
+ pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+ if (offset >= BITS_PER_PAGE || pid >= pid_max)
goto next_map;
if (test_and_set_bit(offset, map->page))
goto scan_more;
-
/* we got the PID: */
- pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
goto return_pid;

failure:

2004-09-12 09:41:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* Anton Blanchard <[email protected]> wrote:

> I tried creating 100,000 threads just for the hell of it. I was
> surprised that it appears to have worked even with pid_max set at 32k.
>
> It seems if we are above pid_max we wrap back to RESERVED_PIDS at the
> start of alloc_pidmap but do not enforce this upper limit. I guess
> every call of alloc_pidmap above 32k was wrapping back to
> RESERVED_PIDS, walking the allocated space then allocating off the
> end.

yeah. Does the attached patch fix it?

> Just as an aside, does it make sense to remove the pidmap allocator
> and use the IDR allocator now its there?

might make sense - needs benchmarking. In particular the performance of
kill(pid, 0) [PID lookup] should be benchmarked on the cycle level, and
the combined performance of pthread_create()+pthread_exit().

> Now once I had managed to allocate those 100,000 threads, I noticed
> this:
>
> 18446744071725383682 dr-xr-xr-x 3 root root 0 Sep 12 08:10 100796
>
> Strange huh. Turns out we allocate inodes in proc via:
>
> #define fake_ino(pid,ino) (((pid)<<16)|(ino))
>
> With 32bit inodes we are screwed once pids go over 64k arent we?

indeed.

i'm wondering, dont we have a similar problem with PROC_TID_FD_DIR
already? Running some simple code that opens 1 million files gives:

[root@saturn root]# ulimit -n 1000000
[root@saturn root]# ./open-fds 1000000
999997 fds opened
[root@saturn root]# cd /proc/2333/fd/
[root@saturn fd]# ls -li | grep 153028253
153028253 lrwx------ 1 root root 64 Sep 12 11:18 165533 -> /dev/pts/0
153028253 lrwx------ 1 root root 64 Sep 12 11:18 362141 -> /dev/pts/0
153028253 lrwx------ 1 root root 64 Sep 12 11:18 427677 -> /dev/pts/0
153028253 lrwx------ 1 root root 64 Sep 12 11:18 624285 -> /dev/pts/0
153028253 lrwx------ 1 root root 64 Sep 12 11:19 689821 -> /dev/pts/0
153028253 lrwx------ 1 root root 64 Sep 12 11:18 99997 -> /dev/pts/0
[...]

plenty of overlap in the #ino space.

Ingo


Attachments:
(No filename) (1.92 kB)
pid-max-fix.patch (515.00 B)
Download all attachments

2004-09-12 09:43:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 11:39:43AM +0200, Ingo Molnar wrote:
> --- linux/kernel/pid.c.orig
> +++ linux/kernel/pid.c
> @@ -103,7 +103,7 @@ int alloc_pidmap(void)
> pidmap_t *map;
>
> pid = last_pid + 1;
> - if (pid >= pid_max)
> + if (unlikely(pid >= pid_max))
> pid = RESERVED_PIDS;

Well, this part won't change the wrapping behavior.

On Sun, Sep 12, 2004 at 11:39:43AM +0200, Ingo Molnar wrote:
> offset = pid & BITS_PER_PAGE_MASK;
> @@ -116,6 +116,10 @@ int alloc_pidmap(void)
> * slowpath and that fixes things up.
> */
> return_pid:
> + if (unlikely(pid >= pid_max)) {
> + clear_bit(offset, map->page);
> + goto failure;
> + }
> atomic_dec(&map->nr_free);
> last_pid = pid;
> return pid;

This is too late; it hands back a hard failure without resetting last_pid.


-- wli

2004-09-12 09:54:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* William Lee Irwin III <[email protected]> wrote:

> + if (map > &pidmap_array[pid_max/BITS_PER_PAGE])
> + map = pidmap_array;

> - if (offset >= BITS_PER_PAGE)
> + pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
> + if (offset >= BITS_PER_PAGE || pid >= pid_max)
> goto next_map;
> if (test_and_set_bit(offset, map->page))
> goto scan_more;
> -
> /* we got the PID: */
> - pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
> goto return_pid;

i missed the wrapping, so your patch is the right one:

Signed-off-by: Ingo Molnar <[email protected]>

Ingo

2004-09-12 09:58:12

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 02:36:05AM -0700, William Lee Irwin III wrote:
> + if (map > &pidmap_array[pid_max/BITS_PER_PAGE])
> + map = pidmap_array;
> if (unlikely(!map->page)) {
> unsigned long page = get_zeroed_page(GFP_KERNEL);

If pid_max == BITS_PER_PAGE*n, none of &pidmap_array[pid_max/BITS_PER_PAGE]
is usable, so if we must complete a full revolution around pidmap_array[]
to discover a free pid slightly less than last_pid we will miss it. Hence:


-- wli

Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 05:46:18.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 02:46:17.426981200 -0700
@@ -75,6 +75,8 @@
while (--*max_steps) {
if (++map == map_limit)
map = pidmap_array;
+ if (map > &pidmap_array[(pid_max-1)/BITS_PER_PAGE])
+ map = pidmap_array;
if (unlikely(!map->page)) {
unsigned long page = get_zeroed_page(GFP_KERNEL);
/*
@@ -135,13 +137,12 @@
*/
scan_more:
offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
- if (offset >= BITS_PER_PAGE)
+ pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+ if (offset >= BITS_PER_PAGE || pid >= pid_max)
goto next_map;
if (test_and_set_bit(offset, map->page))
goto scan_more;
-
/* we got the PID: */
- pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
goto return_pid;

failure:

2004-09-12 10:11:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 02:36:05AM -0700, William Lee Irwin III wrote:
>> + if (map > &pidmap_array[pid_max/BITS_PER_PAGE])
>> + map = pidmap_array;
>> if (unlikely(!map->page)) {
>> unsigned long page = get_zeroed_page(GFP_KERNEL);

On Sun, Sep 12, 2004 at 02:58:05AM -0700, William Lee Irwin III wrote:
> If pid_max == BITS_PER_PAGE*n, none of &pidmap_array[pid_max/BITS_PER_PAGE]
> is usable, so if we must complete a full revolution around pidmap_array[]
> to discover a free pid slightly less than last_pid we will miss it. Hence:

That could only happen if max_steps were initialized to PIDMAP_ENTRIES
instead of PIDMAP_ENTRIES + 1, so this more accurate upper bound is not
strictly necessary, though with this in place, we could probably reduce
max_steps to just PIDMAP_ENTRIES so we cycle no further than the block
we began; I'll not bother with that.


-- wli

2004-09-12 10:13:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* William Lee Irwin III <[email protected]> wrote:

> If pid_max == BITS_PER_PAGE*n, none of
> &pidmap_array[pid_max/BITS_PER_PAGE] is usable, so if we must complete
> a full revolution around pidmap_array[] to discover a free pid
> slightly less than last_pid we will miss it. Hence:

yeah. Patch needs testing ...

> if (++map == map_limit)
> map = pidmap_array;
> + if (map > &pidmap_array[(pid_max-1)/BITS_PER_PAGE])
> + map = pidmap_array;

in fact we can now merge the max_limit and pid_max checks - see the
attached updated patch.

Ingo


Attachments:
(No filename) (557.00 B)
pid-max-fix.patch (1.34 kB)
Download all attachments

2004-09-12 10:43:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 12:13:50PM +0200, Ingo Molnar wrote:
> in fact we can now merge the max_limit and pid_max checks - see the
> attached updated patch.
> fix pid_max handling. Wrap around correctly.
> Signed-off-by: William Lee Irwin III <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

I like the update. But I see other issues. For instance (also untested):


pid wrapping doesn't honor RESERVED_PIDS.

Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-12 03:26:07.781592064 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 03:26:50.063164288 -0700
@@ -128,7 +128,10 @@
map = next_free_map(map, &max_steps);
if (!map)
goto failure;
- offset = 0;
+ else if (map != pidmap_array)
+ offset = 0;
+ else
+ offset = RESERVED_PIDS;
}
/*
* Find the next zero bit:

2004-09-12 10:45:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 03:43:14AM -0700, William Lee Irwin III wrote:
> I like the update. But I see other issues. For instance (also untested):
> pid wrapping doesn't honor RESERVED_PIDS.

Also:

last_pid is not honored because next_free_map(map - 1, ...) may return
the same map and so restart with a lesser offset.

Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-12 03:26:50.063164288 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 03:32:11.501298264 -0700
@@ -120,10 +120,12 @@
last_pid = pid;
return pid;
}
-
- if (!offset || !atomic_read(&map->nr_free)) {
- if (!offset)
- map--;
+ if (!offset) {
+ if (!atomic_read(&map->nr_free))
+ goto next_map;
+ else
+ goto scan_more;
+ } else if (!atomic_read(&map->nr_free)) {
next_map:
map = next_free_map(map, &max_steps);
if (!map)

2004-09-12 11:08:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 03:43:14AM -0700, William Lee Irwin III wrote:
>> I like the update. But I see other issues. For instance (also untested):
>> pid wrapping doesn't honor RESERVED_PIDS.

On Sun, Sep 12, 2004 at 03:45:24AM -0700, William Lee Irwin III wrote:
> last_pid is not honored because next_free_map(map - 1, ...) may return
> the same map and so restart with a lesser offset.

Forgot to check map->page in the first spin:

last_pid is not honored because next_free_map(map - 1, ...) may return
the same map and so restart with a lesser offset.

Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-12 03:26:50.063164288 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 04:00:03.230156848 -0700
@@ -64,6 +64,21 @@
atomic_inc(&map->nr_free);
}

+static void alloc_pidmap_page(pidmap_t *map)
+{
+ unsigned long page = get_zeroed_page(GFP_KERNEL);
+ /*
+ * Free the page if someone raced with us
+ * installing it:
+ */
+ spin_lock(&pidmap_lock);
+ if (map->page)
+ free_page(page);
+ else
+ map->page = (void *)page;
+ spin_unlock(&pidmap_lock);
+}
+
/*
* Here we search for the next map that has free bits left.
* Normally the next map has free PIDs.
@@ -76,18 +91,7 @@
if (++map > map_limit)
map = pidmap_array;
if (unlikely(!map->page)) {
- unsigned long page = get_zeroed_page(GFP_KERNEL);
- /*
- * Free the page if someone raced with us
- * installing it:
- */
- spin_lock(&pidmap_lock);
- if (map->page)
- free_page(page);
- else
- map->page = (void *)page;
- spin_unlock(&pidmap_lock);
-
+ alloc_pidmap_page(map);
if (!map->page)
break;
}
@@ -119,11 +123,20 @@
atomic_dec(&map->nr_free);
last_pid = pid;
return pid;
- }
-
- if (!offset || !atomic_read(&map->nr_free)) {
- if (!offset)
- map--;
+ } else if (!offset) {
+ if (map->page) {
+ if (atomic_read(&map->nr_free))
+ goto scan_more;
+ else
+ goto next_map;
+ } else {
+ alloc_pidmap_page(map);
+ if (map->page)
+ goto scan_more;
+ else
+ goto failure;
+ }
+ } else if (!atomic_read(&map->nr_free)) {
next_map:
map = next_free_map(map, &max_steps);
if (!map)

2004-09-12 12:18:13

by Arjan van de Ven

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, 2004-09-12 at 10:56, Anton Blanchard wrote:
> Hi,
>
> I tried creating 100,000 threads just for the hell of it. I was
> surprised that it appears to have worked even with pid_max set at 32k.

there are a lot of other reasons why you can't go over 64k threads ;)
(esp on a 32 bit machine)

such as all the 16 bit counters in rwsems etc etc...
Just Say No(tm) :)


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-09-12 12:32:07

by Anton Blanchard

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


> there are a lot of other reasons why you can't go over 64k threads ;)
> (esp on a 32 bit machine)

After all the effort of going to 4kB stacks on x86? :)

> such as all the 16 bit counters in rwsems etc etc...
> Just Say No(tm) :)

Hmm can you point the 16bit counter out? I can create 1 million NPTL
threads on ppc64 easily, so why not?

As Ingo points out the same proc inode overflow happens with
applications with lots of FDs.

Anton

2004-09-12 11:19:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* William Lee Irwin III <[email protected]> wrote:

> Forgot to check map->page in the first spin:
>
> last_pid is not honored because next_free_map(map - 1, ...) may return
> the same map and so restart with a lesser offset.

it's getting quite spaghetti ... do we really want to handle
RESERVED_PID? There's no guarantee that any root daemon wont stray out
of the 1...300 PID range anyway, so if it has an exploitable PID race
bug then it's probably exploitable even without the RESERVED_PID
protection.

Ingo

2004-09-12 12:44:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 10:30:00PM +1000, Anton Blanchard wrote:
>
> Hmm can you point the 16bit counter out? I can create 1 million NPTL
> threads on ppc64 easily, so why not?

/*
* the semaphore definition
*/
struct rw_semaphore {
/* XXX this should be able to be an atomic_t -- paulus */
signed int count;
#define RWSEM_UNLOCKED_VALUE 0x00000000
#define RWSEM_ACTIVE_BIAS 0x00000001
#define RWSEM_ACTIVE_MASK 0x0000ffff

that counter is split in 2 16 bit entities....


Attachments:
(No filename) (544.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-12 13:38:56

by Anton Blanchard

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


> /*
> * the semaphore definition
> */
> struct rw_semaphore {
> /* XXX this should be able to be an atomic_t -- paulus */
> signed int count;
> #define RWSEM_UNLOCKED_VALUE 0x00000000
> #define RWSEM_ACTIVE_BIAS 0x00000001
> #define RWSEM_ACTIVE_MASK 0x0000ffff
>
> that counter is split in 2 16 bit entities....

Yuck, 64k waiters is asking for trouble. BTW x86-64 mentions it can only
handle 32k writers, that probably wants looking at.

Anton

2004-09-12 13:40:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* Arjan van de Ven <[email protected]> wrote:

> On Sun, Sep 12, 2004 at 10:30:00PM +1000, Anton Blanchard wrote:
> >
> > Hmm can you point the 16bit counter out? I can create 1 million NPTL
> > threads on ppc64 easily, so why not?
>
> /*
> * the semaphore definition
> */
> struct rw_semaphore {
> /* XXX this should be able to be an atomic_t -- paulus */
> signed int count;
> #define RWSEM_UNLOCKED_VALUE 0x00000000
> #define RWSEM_ACTIVE_BIAS 0x00000001
> #define RWSEM_ACTIVE_MASK 0x0000ffff
>
> that counter is split in 2 16 bit entities....

the generic semaphore code can handle up to ~2^31 waiters. I once tried
it on x86, it seems to work fine.

Ingo

2004-09-12 17:13:58

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

* William Lee Irwin III <[email protected]> wrote:
>> Forgot to check map->page in the first spin:
>> last_pid is not honored because next_free_map(map - 1, ...) may return
>> the same map and so restart with a lesser offset.

On Sun, Sep 12, 2004 at 01:20:26PM +0200, Ingo Molnar wrote:
> it's getting quite spaghetti ... do we really want to handle
> RESERVED_PID? There's no guarantee that any root daemon wont stray out
> of the 1...300 PID range anyway, so if it has an exploitable PID race
> bug then it's probably exploitable even without the RESERVED_PID
> protection.

I presumed it was merely cosmetic, so daemons around system startup
will get low pid numbers recognizable by sysadmins. Maybe filtering
process listings for pids < 300 is/was used to find daemons that may
have crashed? I'm not particularly attached to the feature, and have
never used it myself, but merely noticed its implementation was off.

Spaghetti may mean it's time to rewrite things. The following is a
goto-less alloc_pidmap() vs. 2.6.9-rc1-mm4 addressing all stated
concerns, but otherwise changing none of the semantics. Untested.


-- wli

Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 05:46:18.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 09:44:52.586896544 -0700
@@ -38,6 +38,9 @@
#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
#define BITS_PER_PAGE (PAGE_SIZE*8)
#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
+#define mk_pid(map, off) (((map) - pidmap_array)*BITS_PER_PAGE + (off))
+#define find_next_offset(map, off) \
+ find_next_zero_bit((map)->page, BITS_PER_PAGE, off)

/*
* PID-map pages start out as NULL, they get allocated upon
@@ -53,8 +56,6 @@
static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
{ [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };

-static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
-
static spinlock_t pidmap_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;

fastcall void free_pidmap(int pid)
@@ -66,15 +67,16 @@
atomic_inc(&map->nr_free);
}

-/*
- * Here we search for the next map that has free bits left.
- * Normally the next map has free PIDs.
- */
-static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+int alloc_pidmap(void)
{
- while (--*max_steps) {
- if (++map == map_limit)
- map = pidmap_array;
+ int i, offset, pid = last_pid + 1;
+ pidmap_t *map;
+
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[pid/BITS_PER_PAGE];
+ for (i = 0; i < PIDMAP_ENTRIES; ++i) {
if (unlikely(!map->page)) {
unsigned long page = get_zeroed_page(GFP_KERNEL);
/*
@@ -87,64 +89,29 @@
else
map->page = (void *)page;
spin_unlock(&pidmap_lock);
-
- if (!map->page)
+ if (unlikely(!map->page))
break;
}
- if (atomic_read(&map->nr_free))
- return map;
- }
- return NULL;
-}
-
-int alloc_pidmap(void)
-{
- int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
- pidmap_t *map;
-
- pid = last_pid + 1;
- if (pid >= pid_max)
- pid = RESERVED_PIDS;
-
- offset = pid & BITS_PER_PAGE_MASK;
- map = pidmap_array + pid / BITS_PER_PAGE;
-
- if (likely(map->page && !test_and_set_bit(offset, map->page))) {
- /*
- * There is a small window for last_pid updates to race,
- * but in that case the next allocation will go into the
- * slowpath and that fixes things up.
- */
-return_pid:
- atomic_dec(&map->nr_free);
- last_pid = pid;
- return pid;
- }
-
- if (!offset || !atomic_read(&map->nr_free)) {
- if (!offset)
- map--;
-next_map:
- map = next_free_map(map, &max_steps);
- if (!map)
- goto failure;
- offset = 0;
+ if (likely(atomic_read(&map->nr_free))) {
+ do {
+ if (!test_and_set_bit(offset, map->page)) {
+ atomic_dec(&map->nr_free);
+ last_pid = pid;
+ return pid;
+ }
+ offset = find_next_offset(map, offset);
+ pid = mk_pid(map, offset);
+ } while (offset < BITS_PER_PAGE && pid < pid_max);
+ }
+ if (map < &pidmap_array[(pid_max-1)/BITS_PER_PAGE]) {
+ ++map;
+ offset = 0;
+ } else {
+ map = &pidmap_array[0];
+ offset = RESERVED_PIDS;
+ }
+ pid = mk_pid(map, offset);
}
- /*
- * Find the next zero bit:
- */
-scan_more:
- offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
- if (offset >= BITS_PER_PAGE)
- goto next_map;
- if (test_and_set_bit(offset, map->page))
- goto scan_more;
-
- /* we got the PID: */
- pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
- goto return_pid;
-
-failure:
return -1;
}

2004-09-12 18:02:47

by Chris Wedgwood

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 10:13:19AM -0700, William Lee Irwin III wrote:

> I presumed it was merely cosmetic, so daemons around system startup
> will get low pid numbers recognizable by sysadmins. Maybe filtering
> process listings for pids < 300 is/was used to find daemons that may
> have crashed? I'm not particularly attached to the feature, and have
> never used it myself, but merely noticed its implementation was off.

I always assumed it was an optimization when looking for a new PID
after a wrap by trying to skip over the kernel threads. Arguably 300
is way too small for larger systems (which might have several thousand
kernel threads) and should probably be sized on boot (or when starting
userspace) if anyone really cares.

2004-09-12 23:06:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Sun, Sep 12, 2004 at 10:13:19AM -0700, William Lee Irwin III wrote:
>> I presumed it was merely cosmetic, so daemons around system startup
>> will get low pid numbers recognizable by sysadmins. Maybe filtering
>> process listings for pids < 300 is/was used to find daemons that may
>> have crashed? I'm not particularly attached to the feature, and have
>> never used it myself, but merely noticed its implementation was off.

On Sun, Sep 12, 2004 at 11:02:29AM -0700, Chris Wedgwood wrote:
> I always assumed it was an optimization when looking for a new PID
> after a wrap by trying to skip over the kernel threads. Arguably 300
> is way too small for larger systems (which might have several thousand
> kernel threads) and should probably be sized on boot (or when starting
> userspace) if anyone really cares.

There's no reason it couldn't be made tunable, though we may want to
place restrictions on what values are allowed, e.g. reserved_pids > 0
and reserved_pids < min(BITS_PER_PAGE, pid_max). For that matter, we
should likely be using proc_dointvec_minmax() for pid_max or otherwise
a custom strategy function if we need to update bounds on reserved_pids
and/or reserved_pids in tandem. I suspect this is obscure enough I
should leave it alone unless someone develops a strong opinion about it.


-- wli

2004-09-13 01:47:00

by William Lee Irwin III

[permalink] [raw]
Subject: [pidhashing] rewrite alloc_pidmap()

On Sun, Sep 12, 2004 at 10:13:19AM -0700, William Lee Irwin III wrote:
> Spaghetti may mean it's time to rewrite things. The following is a
> goto-less alloc_pidmap() vs. 2.6.9-rc1-mm4 addressing all stated
> concerns, but otherwise changing none of the semantics. Untested.

The original version was broken wrt. wrapping.

It needs to allow enough iterations to return to the same block as we
started to allow fallback to go as far as just behind last_pid in the
same bitmap block that covers last_pid when last_pid + 1 is
BITS_PER_PAGE-unaligned. Also remove redundant scanning by bounding the
loop according to pid_max, the saved last_pid, and the alignment of
offset, as such prevents rapid-fire pid reuse by avoiding scanning any
part of the bitmap more than once or out of cyclic order starting from
last_pid + 1, never reusing the last_pid seen in a given call in that
call. The corner cases aren't that easily reproducible, but it's best
to find glaring errors early, so I've compiled and booted this now that
I'm back at home, and checked that it boots okay, generates reasonable
pid lists, and doesn't fall over or allocate pids below RESERVED_PIDS
during pid wraps on x86-64.

vs. 2.6.9-rc1-mm4. Patch description follows signature.


-- wli

Rewrite alloc_pidmap() to clarify control flow by eliminating all usage
of goto, honor pid_max and first available pid after last_pid semantics,
make only a single pass over the used portion of the pid bitmap, and
update copyrights to reflect ongoing maintenance by Ingo and myself.

Signed-off-by: William Irwin <[email protected]>

Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 05:46:18.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-12 17:24:44.265323848 -0700
@@ -1,8 +1,8 @@
/*
* Generic pidhash and scalable, time-bounded PID allocator
*
- * (C) 2002 William Irwin, IBM
- * (C) 2002 Ingo Molnar, Red Hat
+ * (C) 2002-2004 William Irwin, Oracle
+ * (C) 2002-2004 Ingo Molnar, Red Hat
*
* pid-structures are backing objects for tasks sharing a given ID to chain
* against. There is very little to them aside from hashing them and
@@ -38,6 +38,9 @@
#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
#define BITS_PER_PAGE (PAGE_SIZE*8)
#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
+#define mk_pid(map, off) (((map) - pidmap_array)*BITS_PER_PAGE + (off))
+#define find_next_offset(map, off) \
+ find_next_zero_bit((map)->page, BITS_PER_PAGE, off)

/*
* PID-map pages start out as NULL, they get allocated upon
@@ -53,8 +56,6 @@
static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
{ [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };

-static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
-
static spinlock_t pidmap_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;

fastcall void free_pidmap(int pid)
@@ -66,15 +67,18 @@
atomic_inc(&map->nr_free);
}

-/*
- * Here we search for the next map that has free bits left.
- * Normally the next map has free PIDs.
- */
-static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+int alloc_pidmap(void)
{
- while (--*max_steps) {
- if (++map == map_limit)
- map = pidmap_array;
+ int i, offset, max_scan, pid, last = last_pid;
+ pidmap_t *map;
+
+ pid = last + 1;
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[pid/BITS_PER_PAGE];
+ max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page)) {
unsigned long page = get_zeroed_page(GFP_KERNEL);
/*
@@ -87,64 +91,39 @@
else
map->page = (void *)page;
spin_unlock(&pidmap_lock);
-
- if (!map->page)
+ if (unlikely(!map->page))
break;
}
- if (atomic_read(&map->nr_free))
- return map;
- }
- return NULL;
-}
-
-int alloc_pidmap(void)
-{
- int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
- pidmap_t *map;
-
- pid = last_pid + 1;
- if (pid >= pid_max)
- pid = RESERVED_PIDS;
-
- offset = pid & BITS_PER_PAGE_MASK;
- map = pidmap_array + pid / BITS_PER_PAGE;
-
- if (likely(map->page && !test_and_set_bit(offset, map->page))) {
- /*
- * There is a small window for last_pid updates to race,
- * but in that case the next allocation will go into the
- * slowpath and that fixes things up.
- */
-return_pid:
- atomic_dec(&map->nr_free);
- last_pid = pid;
- return pid;
- }
-
- if (!offset || !atomic_read(&map->nr_free)) {
- if (!offset)
- map--;
-next_map:
- map = next_free_map(map, &max_steps);
- if (!map)
- goto failure;
- offset = 0;
+ if (likely(atomic_read(&map->nr_free))) {
+ do {
+ if (!test_and_set_bit(offset, map->page)) {
+ atomic_dec(&map->nr_free);
+ last_pid = pid;
+ return pid;
+ }
+ offset = find_next_offset(map, offset);
+ pid = mk_pid(map, offset);
+ /*
+ * find_next_offset() found a bit, the pid from it
+ * is in-bounds, and if we fell back to the last
+ * bitmap block and the final block was the same
+ * as the starting point, pid is before last_pid.
+ */
+ } while (offset < BITS_PER_PAGE && pid < pid_max &&
+ (i != max_scan || pid < last ||
+ !((last+1) & BITS_PER_PAGE_MASK)));
+ }
+ if (map < &pidmap_array[(pid_max-1)/BITS_PER_PAGE]) {
+ ++map;
+ offset = 0;
+ } else {
+ map = &pidmap_array[0];
+ offset = RESERVED_PIDS;
+ if (unlikely(last == offset))
+ break;
+ }
+ pid = mk_pid(map, offset);
}
- /*
- * Find the next zero bit:
- */
-scan_more:
- offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
- if (offset >= BITS_PER_PAGE)
- goto next_map;
- if (test_and_set_bit(offset, map->page))
- goto scan_more;
-
- /* we got the PID: */
- pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
- goto return_pid;
-
-failure:
return -1;
}

2004-09-13 03:24:08

by Albert Cahalan

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Ingo Molnar writes:

> it's getting quite spaghetti ... do we really want to handle
> RESERVED_PID? There's no guarantee that any root daemon wont stray out
> of the 1...300 PID range anyway, so if it has an exploitable PID race
> bug then it's probably exploitable even without the RESERVED_PID
> protection.

Purpose:

1. weak security enhancement
2. cosmetic (backwards, IMHO)
3. speed (avoid PIDs likely to be used)

I'd much prefer LRU allocation. There are
lots of system calls that take PID values.
All such calls are hazardous. They're pretty
much broken by design.

Better yet, make a random choice from
the 50% of PID space that has been least
recently used.

Another idea is to associate PIDs with users
to some extent. You keep getting back the same
set of PIDs unless the system runs low in some
global pool and has to steal from one user to
satisfy another.

BTW, since pid_max is now adjustable, reducing
the default to 4 digits would make sense. Try a
"ps j" to see the use. (column width changes if
you change max_pid)


2004-09-13 07:42:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Ingo Molnar writes:
>> it's getting quite spaghetti ... do we really want to handle
>> RESERVED_PID? There's no guarantee that any root daemon wont stray out
>> of the 1...300 PID range anyway, so if it has an exploitable PID race
>> bug then it's probably exploitable even without the RESERVED_PID
>> protection.

On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> Purpose:
> 1. weak security enhancement
> 2. cosmetic (backwards, IMHO)
> 3. speed (avoid PIDs likely to be used)

Well, weak security enhancement translates to "nop" in my book, but
I guess if that's really what people were trying to arrange...


On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> I'd much prefer LRU allocation. There are
> lots of system calls that take PID values.
> All such calls are hazardous. They're pretty
> much broken by design.
> Better yet, make a random choice from
> the 50% of PID space that has been least
> recently used.

I'd favor fully pseudorandom allocation over LRU or approximate LRU
allocation, as at least pseudorandom is feasible without large impacts
on resource scalability. OTOH the cache characteristics of pseudorandom
allocation are usually poor; perhaps hierarchically pseudorandom
allocation where one probes a sequence of cachelines of the bitmap
according to one PRNG, and within each cacheline probes a random
sequence of bits according to some other PRNG, would resolve that.


On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> Another idea is to associate PIDs with users
> to some extent. You keep getting back the same
> set of PIDs unless the system runs low in some
> global pool and has to steal from one user to
> satisfy another.

The resource tracking and locking implications of this are disturbing.
Would fully pseudorandom allocation be acceptable?


On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> BTW, since pid_max is now adjustable, reducing
> the default to 4 digits would make sense. Try a
> "ps j" to see the use. (column width changes if
> you change max_pid)

Is the maximum possible pid exported by the kernel somehow? Perhaps
it should be; the maximum number of decimal digits required to
represent PID_MAX_LIMIT (4*1024*1024) should suffice in all cases.
Perhaps you need to detect PID_MAX_LIMIT somehow?


-- wli

2004-09-13 07:58:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* Albert Cahalan <[email protected]> wrote:

> I'd much prefer LRU allocation. There are
> lots of system calls that take PID values.
> All such calls are hazardous. They're pretty
> much broken by design.

this is a pretty sweeping assertion. Would you
care to mention a few examples of such hazards?

> BTW, since pid_max is now adjustable, reducing
> the default to 4 digits would make sense. [...]

i'm not sure what you mean by 'now', pid_max has
been adjustable for quite some time.

Ingo

2004-09-13 13:57:19

by Albert Cahalan

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, 2004-09-13 at 03:57, Ingo Molnar wrote:
> * Albert Cahalan <[email protected]> wrote:
>
> > I'd much prefer LRU allocation. There are
> > lots of system calls that take PID values.
> > All such calls are hazardous. They're pretty
> > much broken by design.
>
> this is a pretty sweeping assertion. Would you
> care to mention a few examples of such hazards?

kill(12345,9)
setpriority(PRIO_PROCESS,12345,-20)
sched_setscheduler(12345, SCHED_FIFO, &sp)

Prior to the call being handled, the process may
die and be replaced. Some random innocent process,
or a not-so-innocent one, will get acted upon by
mistake. This is broken and dangerous.

Well, it's in the UNIX standard. The best one can
do is to make the race window hard to hit, with LRU.

> > BTW, since pid_max is now adjustable, reducing
> > the default to 4 digits would make sense. [...]
>
> i'm not sure what you mean by 'now', pid_max has
> been adjustable for quite some time.

2.6.x series I believe, not 2.4.xx series


2004-09-13 14:14:46

by Albert Cahalan

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, 2004-09-13 at 03:42, William Lee Irwin III wrote:
> On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> > I'd much prefer LRU allocation. There are
> > lots of system calls that take PID values.
> > All such calls are hazardous. They're pretty
> > much broken by design.
> > Better yet, make a random choice from
> > the 50% of PID space that has been least
> > recently used.
>
> I'd favor fully pseudorandom allocation over LRU or approximate LRU
> allocation, as at least pseudorandom is feasible without large impacts
> on resource scalability. OTOH the cache characteristics of pseudorandom
> allocation are usually poor; perhaps hierarchically pseudorandom
> allocation where one probes a sequence of cachelines of the bitmap
> according to one PRNG, and within each cacheline probes a random
> sequence of bits according to some other PRNG, would resolve that.
>
>
> On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> > Another idea is to associate PIDs with users
> > to some extent. You keep getting back the same
> > set of PIDs unless the system runs low in some
> > global pool and has to steal from one user to
> > satisfy another.
>
> The resource tracking and locking implications of this are disturbing.
> Would fully pseudorandom allocation be acceptable?

There's no point.

LRU reduces accidents that don't involve an attacker.

Strong crypto random can make some attacks a bit harder.
OpenBSD does this. It doesn't work well enough to bother
with if the implementation is problematic; there's not
much you can do while avoiding 64-bit or 128-bit PIDs.
Pseudorandom is 100% useless.

Per-user PID recycling would make it much harder for
an attacker to grab a specific PID. Perhaps the attacker
knows that a sched_setscheduler call is coming, and he
has a way to make the right process restart or crash.
Normally, this lets him get SCHED_FIFO or somesuch.
With per-user PID recycling, it would be difficult for
him to grab the desired PID.

> On Sun, Sep 12, 2004 at 11:20:29PM -0400, Albert Cahalan wrote:
> > BTW, since pid_max is now adjustable, reducing
> > the default to 4 digits would make sense. Try a
> > "ps j" to see the use. (column width changes if
> > you change max_pid)
>
> Is the maximum possible pid exported by the kernel somehow? Perhaps
> it should be; the maximum number of decimal digits required to
> represent PID_MAX_LIMIT (4*1024*1024) should suffice in all cases.
> Perhaps you need to detect PID_MAX_LIMIT somehow?

I do indeed detect pid_max via /proc/sys/kernel/pid_max
and adjust column widths as needed. (ps only, for now)

Since we're not getting the benefits of strong crypto
PIDs anyway, we might as well have 4-digit PIDs be default.
Very few people would need to increase this.



2004-09-13 14:26:44

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, 2004-09-13 at 03:57, Ingo Molnar wrote:
>> this is a pretty sweeping assertion. Would you
>> care to mention a few examples of such hazards?

On Mon, Sep 13, 2004 at 09:54:09AM -0400, Albert Cahalan wrote:
> kill(12345,9)
> setpriority(PRIO_PROCESS,12345,-20)
> sched_setscheduler(12345, SCHED_FIFO, &sp)
> Prior to the call being handled, the process may
> die and be replaced. Some random innocent process,
> or a not-so-innocent one, will get acted upon by
> mistake. This is broken and dangerous.
> Well, it's in the UNIX standard. The best one can
> do is to make the race window hard to hit, with LRU.

How do you propose to queue pid's? This is space constrained. I don't
believe it's feasible and/or desirable to attempt this, as there are
4 million objects to track independent of machine size. The general
tactic of cyclic order allocation is oriented toward making this rare
and/or hard to trigger by having a reuse period long enough that what
processes there are after a pid wrap are likely to have near-indefinite
lifetimes. i.e. it's the closest feasible approximation of LRU. If you
truly want/need reuse to be gone, 64-bit+ pid's are likely best.


-- wli

2004-09-13 14:30:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, 2004-09-13 at 03:42, William Lee Irwin III wrote:
>> The resource tracking and locking implications of this are disturbing.
>> Would fully pseudorandom allocation be acceptable?

On Mon, Sep 13, 2004 at 10:11:29AM -0400, Albert Cahalan wrote:
> There's no point.
> LRU reduces accidents that don't involve an attacker.
> Strong crypto random can make some attacks a bit harder.
> OpenBSD does this. It doesn't work well enough to bother
> with if the implementation is problematic; there's not
> much you can do while avoiding 64-bit or 128-bit PIDs.
> Pseudorandom is 100% useless.
> Per-user PID recycling would make it much harder for
> an attacker to grab a specific PID. Perhaps the attacker
> knows that a sched_setscheduler call is coming, and he
> has a way to make the right process restart or crash.
> Normally, this lets him get SCHED_FIFO or somesuch.
> With per-user PID recycling, it would be difficult for
> him to grab the desired PID.

I'd suggest pushing for 64-bit+ pid's, then. IIRC most of the work
there is in userspace (the in-kernel part is trivial).


-- wli

2004-09-13 15:07:28

by Herbert Poetzl

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, Sep 13, 2004 at 07:27:52AM -0700, William Lee Irwin III wrote:
> On Mon, 2004-09-13 at 03:42, William Lee Irwin III wrote:
> >> The resource tracking and locking implications of this are disturbing.
> >> Would fully pseudorandom allocation be acceptable?
>
> On Mon, Sep 13, 2004 at 10:11:29AM -0400, Albert Cahalan wrote:
> > There's no point.
> > LRU reduces accidents that don't involve an attacker.
> > Strong crypto random can make some attacks a bit harder.
> > OpenBSD does this. It doesn't work well enough to bother
> > with if the implementation is problematic; there's not
> > much you can do while avoiding 64-bit or 128-bit PIDs.
> > Pseudorandom is 100% useless.
> > Per-user PID recycling would make it much harder for
> > an attacker to grab a specific PID. Perhaps the attacker
> > knows that a sched_setscheduler call is coming, and he
> > has a way to make the right process restart or crash.
> > Normally, this lets him get SCHED_FIFO or somesuch.
> > With per-user PID recycling, it would be difficult for
> > him to grab the desired PID.
>
> I'd suggest pushing for 64-bit+ pid's, then. IIRC most of the work
> there is in userspace (the in-kernel part is trivial).

except for the various 'assumptions' done in procfs
to create the inode numbers ... but that is a different
story ...

best,
Herbert

> -- wli
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-09-13 15:17:09

by Albert Cahalan

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, 2004-09-13 at 10:24, William Lee Irwin III wrote:
> On Mon, 2004-09-13 at 03:57, Ingo Molnar wrote:
> >> this is a pretty sweeping assertion. Would you
> >> care to mention a few examples of such hazards?
>
> On Mon, Sep 13, 2004 at 09:54:09AM -0400, Albert Cahalan wrote:
> > kill(12345,9)
> > setpriority(PRIO_PROCESS,12345,-20)
> > sched_setscheduler(12345, SCHED_FIFO, &sp)
> > Prior to the call being handled, the process may
> > die and be replaced. Some random innocent process,
> > or a not-so-innocent one, will get acted upon by
> > mistake. This is broken and dangerous.
> > Well, it's in the UNIX standard. The best one can
> > do is to make the race window hard to hit, with LRU.
>
> How do you propose to queue pid's? This is space constrained. I don't
> believe it's feasible and/or desirable to attempt this, as there are
> 4 million objects to track independent of machine size.

As we've seen elsewhere in this thread, things break
when you go above 0xffff anyway. So 128 KiB of RAM
should do the job. With a 4-digit PID, 20000 bytes
would be enough.

Supposing you fix rwsem counts and /proc inodes and so on,
a large machine could handle 4 million objects easily.
A small machine has far, far, less need to support that.

> The general
> tactic of cyclic order allocation is oriented toward making this rare
> and/or hard to trigger by having a reuse period long enough that what
> processes there are after a pid wrap are likely to have near-indefinite
> lifetimes. i.e. it's the closest feasible approximation of LRU. If you
> truly want/need reuse to be gone, 64-bit+ pid's are likely best.

That's too unwieldy for the users, it breaks glibc,
and you'll still hit the problems after wrap-around.
Besides, Linus vetoed this a year or two ago.

Reducing the dangers of a small PID space allows for
just the opposite size change, which is much nicer for
the users.


2004-09-14 02:05:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, 2004-09-13 at 10:24, William Lee Irwin III wrote:
>> How do you propose to queue pid's? This is space constrained. I don't
>> believe it's feasible and/or desirable to attempt this, as there are
>> 4 million objects to track independent of machine size.

On Mon, Sep 13, 2004 at 10:54:04AM -0400, Albert Cahalan wrote:
> As we've seen elsewhere in this thread, things break
> when you go above 0xffff anyway. So 128 KiB of RAM
> should do the job. With a 4-digit PID, 20000 bytes
> would be enough.
> Supposing you fix rwsem counts and /proc inodes and so on,
> a large machine could handle 4 million objects easily.
> A small machine has far, far, less need to support that.

The feature of supporting more threads is there and meant to be
supported, so "> 32K breaks anyway" doesn't really fly; it should
be fixed.

The rwsems just want ia32 to use generic rwsems. /proc/ is a much
bigger mess but far from insurmountable. I suppose someone has to
be willing to mess with it for that to be fixed, and it may need
other vfs-related help (ino64_t support?).


On Mon, 2004-09-13 at 10:24, William Lee Irwin III wrote:
>> The general
>> tactic of cyclic order allocation is oriented toward making this rare
>> and/or hard to trigger by having a reuse period long enough that what
>> processes there are after a pid wrap are likely to have near-indefinite
>> lifetimes. i.e. it's the closest feasible approximation of LRU. If you
>> truly want/need reuse to be gone, 64-bit+ pid's are likely best.

On Mon, Sep 13, 2004 at 10:54:04AM -0400, Albert Cahalan wrote:
> That's too unwieldy for the users, it breaks glibc,
> and you'll still hit the problems after wrap-around.
> Besides, Linus vetoed this a year or two ago.
> Reducing the dangers of a small PID space allows for
> just the opposite size change, which is much nicer for
> the users.

Reducing the PID space so the pid's can be formatted in 4 digits
doesn't sound particularly compelling. 64-bit pid's wrapping sounds
rather unlikely, but if so, 128-bit pids are more likely to withstand
various physical arguments against pid wrapping ever happening.

The middle ground is already where kernel/pid.c stands, so I'd prefer
to leave this alone unless you have deeper concerns about it than
4-column displays and pid reuse problems (which want larger pid spaces,
not smaller). The primary complaint I've heard about pid reuse is
actually that smaller pid spaces (e.g. 32K) wrap far too quickly,
sometimes numerous times per second on larger systems with parallel
process creation activity. It's unclear what the pid wrap speeds are
for the workloads they reported is with the now-permissible pid_max
(in fact, I've since forgotten who reported this).


-- wli

2004-09-14 02:16:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

On Mon, Sep 13, 2004 at 07:27:52AM -0700, William Lee Irwin III wrote:
>> I'd suggest pushing for 64-bit+ pid's, then. IIRC most of the work
>> there is in userspace (the in-kernel part is trivial).

On Mon, Sep 13, 2004 at 04:51:48PM +0200, Herbert Poetzl wrote:
> except for the various 'assumptions' done in procfs
> to create the inode numbers ... but that is a different
> story ...

The overflow conditions in there are ugly and need someone willing to
do more intensive work with that code to address them. It's not
difficult per se, merely a lot of grubbing around with ugly code.


-- wli

2004-09-14 15:33:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues


* Albert Cahalan <[email protected]> wrote:

> > > I'd much prefer LRU allocation. There are
> > > lots of system calls that take PID values.
> > > All such calls are hazardous. They're pretty
> > > much broken by design.
> >
> > this is a pretty sweeping assertion. Would you
> > care to mention a few examples of such hazards?
>
> kill(12345,9)
> setpriority(PRIO_PROCESS,12345,-20)
> sched_setscheduler(12345, SCHED_FIFO, &sp)
>
> Prior to the call being handled, the process may
> die and be replaced. Some random innocent process,
> or a not-so-innocent one, will get acted upon by
> mistake. This is broken and dangerous.

easy to fix: SIGSTOP the task, check it's really
the one you want and then do the setpriority /
setscheduler call and SIGCONT it. Any privileged
code that is about to spread some of its privileges
via asynchronous system-calls need to be careful.

Ingo

2004-09-18 18:33:44

by Pavel Machek

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Hi!

> > > > I'd much prefer LRU allocation. There are
> > > > lots of system calls that take PID values.
> > > > All such calls are hazardous. They're pretty
> > > > much broken by design.
> > >
> > > this is a pretty sweeping assertion. Would you
> > > care to mention a few examples of such hazards?
> >
> > kill(12345,9)
> > setpriority(PRIO_PROCESS,12345,-20)
> > sched_setscheduler(12345, SCHED_FIFO, &sp)
> >
> > Prior to the call being handled, the process may
> > die and be replaced. Some random innocent process,
> > or a not-so-innocent one, will get acted upon by
> > mistake. This is broken and dangerous.
>
> easy to fix: SIGSTOP the task, check it's really
> the one you want and then do the setpriority /
> setscheduler call and SIGCONT it. Any privileged
> code that is about to spread some of its privileges
> via asynchronous system-calls need to be careful.

What if OOM killer decides it wants that memory in between? Attacker
could probably help it...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-09-24 14:16:20

by Pavel Machek

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Hi!

> > With per-user PID recycling, it would be difficult for
> > him to grab the desired PID.
>
> I'd suggest pushing for 64-bit+ pid's, then. IIRC most of the work
> there is in userspace (the in-kernel part is trivial).

Actually 64-bit pids would be very nice for clustering.
mj did that once, IIRC, maybe he still has a patch?
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-09-24 14:16:15

by Pavel Machek

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Hi!

> > 1. weak security enhancement
> > 2. cosmetic (backwards, IMHO)
> > 3. speed (avoid PIDs likely to be used)
>
> Well, weak security enhancement translates to "nop" in my book, but
> I guess if that's really what people were trying to arrange...
>

Well, how many times did you do kill <pid> from command line after doing ps?
If you randomly kill some other process because pids wrapped too fast, it is bad.
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-09-24 14:18:36

by Pavel Machek

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Hi!

> > this is a pretty sweeping assertion. Would you
> > care to mention a few examples of such hazards?
>
> kill(12345,9)
> setpriority(PRIO_PROCESS,12345,-20)
> sched_setscheduler(12345, SCHED_FIFO, &sp)
>
> Prior to the call being handled, the process may
> die and be replaced. Some random innocent process,
> or a not-so-innocent one, will get acted upon by
> mistake. This is broken and dangerous.
>
> Well, it's in the UNIX standard. The best one can
> do is to make the race window hard to hit, with LRU.

Well, you could create new state "DEAD" and enforce that every process stays there for
5 seconds after death. Throttle fork if no pids are free. Hide "DEAD" processes from ps/top.
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-09-24 16:02:35

by Martin Mares

[permalink] [raw]
Subject: Re: /proc/sys/kernel/pid_max issues

Hi!

> Actually 64-bit pids would be very nice for clustering.
> mj did that once, IIRC, maybe he still has a patch?

No, it was 32-bit pids in the 16-bit pid times :)

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
IBM = Inside Black Magic