2006-08-13 16:30:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Tue, 25 Jul 2006 11:50:04 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> BTW, how large pids and how many proccess in a (heavy and big) system ?
>>
> I found
> /*
> * A maximum of 4 million PIDs should be enough for a while.
> * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
> */
> #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
> (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
>
> ...we have to manage 4 millions tids.
>
> I'll try to make use of pidmap array which is already maintained in pid.c
> and to avoid using extra memory.

Has any progress been made on this front?

I know I was initially discouraging.

I looked up the posix definition for readdir and thought about this
some more and I do agree that this is a problem that needs to be fixed.

The basic posix/susv guarantee is that in readdir if a directory
entry is neither deleted nor added between opendir and closedir of the
directory you should see the directory entry. I could not
quite tell what the rules were with regards seekdir.

That is a sane set of guarantees to build a userspace implementation
on.

Since the current implementation does not provide those guarantees
it is hard to build a normal user space implementation, because
the guarantees provided are not the normal ones, and the current
guarantees provided (it works if you have a big enough readdir
buffer) aren't terribly useful as they require several hundred
megabytes in the worse case.

There are also other reasons to changing to a pid base traversal
of /proc. It allows us to display information on process groups,
and sessions whose original leader has died. If namespaces get
assigned a pid traversal by pid looks like a good way to display
namespaces that are not used by any process but are still alive.
Albert does that sound like a sane extension?

I also have always liked the idea of a different data structure
because hash tables may it ugly when it comes to implementing
namespaces. But I believe the current hash table is generally better
than what I have seen proposed, as replacements.

In the normal case the current hash has 4096 entries and pid_max
is set to 32768. In the normal case that means we have a single
hash table lookup, which is very cache friendly since we only have
the single cache line miss. Not always but largely we can assume
any lookup in the data structure is going to be a cache miss. pid
traversal might be different I'm not certain.

Our worst case with the current hash table with a pid_max of 32768 is
traversing a linked list 9 items long. (I just computed this by brute
force). Of course when you push pid_max to the maximum of
(4*1024*1024) the hash chains can get as long as 1024 entries which
is distressing, but at least uniformly distributed.

Comparing that with a radix tree with a height of 6. For 32768 pids
we get a usual height of ceil(15/6) = 3 and in the worst case
ceil(22/6) = 4.

Our hash function distributes things well which is good. It is a
known function which means an attacker can predict it, and with a
series of fork/exit can within some reasonable bounds pick which
pids are getting used. A cryptographic hash will not help because
the input space can be trivially brute force searched.

So for systems that are going to be using a larger number of pid
values I think we need a better data structure, and containers are
likely to push us in that area. Which means either an extensible
hash table or radix tree look like the sane choices.


Eric


2006-08-13 17:34:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

On Sun, 13 Aug 2006 10:29:51 -0600
[email protected] (Eric W. Biederman) wrote:

> So for systems that are going to be using a larger number of pid
> values I think we need a better data structure, and containers are
> likely to push us in that area. Which means either an extensible
> hash table or radix tree look like the sane choices.

radix-trees are nice because you can efficiently traverse them in-order
while the contents are changing (albeit potentially missing newly-added
things, but that's inevitable).

radix-trees are not-nice because they allocate memory at insertion time.
If that's a problem then rbtrees could perhaps be used.

idr-trees have similar characteristics to the radix-trees, except a) the
idr-tree find-next-above feature could perhaps be used for the core pid
allocation and b) idr-trees don't presently have suitable search functions
for performing the iteration.

At least we have plenty of choices ;)

2006-08-13 19:00:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

Andrew Morton <[email protected]> writes:

> On Sun, 13 Aug 2006 10:29:51 -0600
> [email protected] (Eric W. Biederman) wrote:
>
>> So for systems that are going to be using a larger number of pid
>> values I think we need a better data structure, and containers are
>> likely to push us in that area. Which means either an extensible
>> hash table or radix tree look like the sane choices.
>
> radix-trees are nice because you can efficiently traverse them in-order
> while the contents are changing (albeit potentially missing newly-added
> things, but that's inevitable).

Actually except when we can't find the process we were just at
the current code doesn't miss any newly added processes. So there
are implementations that missing new entries isn't inevitable.

> radix-trees are not-nice because they allocate memory at insertion time.
> If that's a problem then rbtrees could perhaps be used.

It isn't fundamental, as I do memory allocations on that path. But
it does appear to be a implementation conflict as the current locking
is a spinlock with irqs disabled, and doing a GFP_ATOMIC allocation
to avoid that is fairly silly.

If we were to loose the potential to do rcu traversals when looking
up a single pid that would be make scaling the code much harder.

> idr-trees have similar characteristics to the radix-trees, except a) the
> idr-tree find-next-above feature could perhaps be used for the core pid
> allocation and b) idr-trees don't presently have suitable search functions
> for performing the iteration.

We have to be careful about changing the pid allocation algorithm.
We need to keep the logic where we ensure it will be a long time
before a newly freed pid is reused. Which basically means the current
implementation that walks through possible pids allocating new ones.
The current implementation doesn't give us any guarantees, but it is
much better than the normal allocator algorithm of allocating the next
available pid.

> At least we have plenty of choices ;)

Yes, and I'm still not quite convinced we have a problem that needs a
new data structure. But with everything becoming multi-core there are
serious pressures in that direction.

Anyway the important part is to get a traversal by pid.

Eric

2006-08-13 19:12:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

Eric wrote:
> Actually except when we can't find the process we were just at
> the current code doesn't miss any newly added processes.

Random thought -- could we have file descriptors open on /proc put some
sort of 'hold' on whatever /proc entry they were just at, so it doesn't
disappear out from under them, even if that process has otherwise fully
exited?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-13 20:08:23

by Albert Cahalan

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

On 8/13/06, Eric W. Biederman <[email protected]> wrote:
> KAMEZAWA Hiroyuki <[email protected]> writes:

> > /*
> > * A maximum of 4 million PIDs should be enough for a while.
> > * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
> > */
> > #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
> > (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
> >
> > ...we have to manage 4 millions tids.

BTW, it looks like powerpc runs out of MMU contexts if
there are more than 32768 processes. Badness happens.

> The basic posix/susv guarantee is that in readdir if a directory
> entry is neither deleted nor added between opendir and closedir of the
> directory you should see the directory entry. I could not
> quite tell what the rules were with regards seekdir.

Never minding the bare minimum, I think the user-visible
offset should be the PID plus some constant for all PIDs.
(sadly, the constant can't be 0 because of ".." and init)

> There are also other reasons to changing to a pid base traversal
> of /proc. It allows us to display information on process groups,
> and sessions whose original leader has died.

Huh?

> If namespaces get
> assigned a pid traversal by pid looks like a good way to display
> namespaces that are not used by any process but are still alive.
> Albert does that sound like a sane extension?

You mean /proc/42 could be non-process data?
That will surely break lots and lots of things.

In general, process namespaces are useful for:

1. silly marketing (see Sun and FreeBSD)

2. the very obscure case of "root" account providers
who are too clueless to use SE Linux or Xen

I don't think either case justifies the complexity.
I am not looking forward to the demands that I
support this mess in procps. I suspect I am not
alone; soon people will be asking for support in
pstools, gdb, fuser, killall... until every app which
interacts with other processes will need hacks.

If the cost were only an #ifdef in the kernel, there
would be no problem. Unfortunately, this is quite
a hack in the kernel and it has far-reaching
consequenses in user space.

2006-08-16 01:20:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

On Sun, 13 Aug 2006 12:12:22 -0700
Paul Jackson <[email protected]> wrote:

> Eric wrote:
> > Actually except when we can't find the process we were just at
> > the current code doesn't miss any newly added processes.
>
> Random thought -- could we have file descriptors open on /proc put some
> sort of 'hold' on whatever /proc entry they were just at, so it doesn't
> disappear out from under them, even if that process has otherwise fully
> exited?
>

Sorry for long absense, I was on vacation.

I and my colleague are still working on ps command fix.

For holding, I have a patch to insert a token in a list to remember its
position. but my colleague may have another thought.

-Kame

2006-08-16 02:20:32

by Kyle Moffett

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

On Aug 13, 2006, at 16:08:20, Albert Cahalan wrote:
> In general, process namespaces are useful for:
>
> 1. silly marketing (see Sun and FreeBSD)
>
> 2. the very obscure case of "root" account providers who are too
> clueless to use SE Linux or Xen

3. Kernel developers who need to test their kernel changes on
multiple distributions can install those distributions unmodified in
containers (requires process namespaces)

4. Server admins who want an absolutely unbreakable way (well, as
close as one can get) to isolate virtual servers from each other and
from the hardware. This also doesn't have the overhead of running 2
stacked memory allocators (Xen allocator and kernel allocator) as
well as multiple copies of each kernel.

> I don't think either case justifies the complexity. I am not
> looking forward to the demands that I support this mess in procps.
> I suspect I am not alone; soon people will be asking for support in
> pstools, gdb, fuser, killall... until every app which interacts
> with other processes will need hacks.

IMHO support for PID namespaces should/would/will be done without
changing the existing /proc entries or underlying layout. For
example, one compatible solution would be to add a new "pidspace="
mount option to procfs which specifies which pidspace will be
represented. Another method (also compatible with the pidspace=
mount option and enabling hierarchical pid spaces would be to
represent child pidspaces in their parent pidspace as a single
process, such that sending signals to said process will send signals
to the "init" process in the child pidspace as though they came from
the kernel (IE: They aren't blocked-by-default), then add a
"pidspace" subdirectory that contains all of the proc entries for
processes in that space. Example:

# mount -t proc proc /proc
# create_new_pidspace /bin/sleep 100000 &
[1] 1234
# ls /proc
...
...
1234
...
...
# ls /proc/1234/pidspace
1

There are obviously still problems with this scenario (what numbering
schemes do pidspaces use and how is the pidspace= mount option
interpreted?), but it's an example of a few ways to preserve ABI
compatibility with existing userspace. I think the idea behind all
of this is to make it possible to run 6 different unmodified distros
on a single system (but you need a few extra tools in the parent
"boot" namespace).

> If the cost were only an #ifdef in the kernel, there would be no
> problem. Unfortunately, this is quite a hack in the kernel and it
> has far-reaching consequenses in user space.

I believe it's been stated that rule #1 for this stuff is preserving
backwards compatibility and kernel<=>userspace ABI with every
change. We might add new APIs (which then have to remain the same in
future versions, barring outright showstopper bugs), but existing
APIs will still work correctly.

Cheers,
Kyle Moffett

2006-08-17 05:00:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ps command race fix


So I just threw something together that seems to work.

All of the pid are listed in order and the next used pid is found by
scanning the pid bitmap.

Scanning the pid bitmap might be a little heavy but actually is likely
quite cache friendly as the accesses are extremely predictable, and
everything should fit into 64 cache lines, which is fewer cache lines
than walking a tree, and more predictable. Of course I turn around
and then do a hash table lookup which is at least one more cache line
and probably an unpredictable one at that.

The point despite the brute force nature I have no reason to suspect
this will run any noticeably slower than the current implementation.

Look at this try it out and if this solves the problem we can push
this upstream.

Eric
---

fs/proc/base.c | 82 ++++++++--------------------------------------------
include/linux/pid.h | 1
kernel/pid.c | 35 ++++++++++++++++++++++
3 files changed, 50 insertions(+), 68 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9943527..6f7433c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1813,70 +1813,17 @@ out:
}

/*
- * Find the first tgid to return to user space.
+ * Find the first task with tgid >= tgid
*
- * Usually this is just whatever follows &init_task, but if the users
- * buffer was too small to hold the full list or there was a seek into
- * the middle of the directory we have more work to do.
- *
- * In the case of a short read we start with find_task_by_pid.
- *
- * In the case of a seek we start with &init_task and walk nr
- * threads past it.
*/
-static struct task_struct *first_tgid(int tgid, unsigned int nr)
+static struct task_struct *next_tgid(unsigned int tgid)
{
- struct task_struct *pos;
- rcu_read_lock();
- if (tgid && nr) {
- pos = find_task_by_pid(tgid);
- if (pos && thread_group_leader(pos))
- goto found;
- }
- /* If nr exceeds the number of processes get out quickly */
- pos = NULL;
- if (nr && nr >= nr_processes())
- goto done;
-
- /* If we haven't found our starting place yet start with
- * the init_task and walk nr tasks forward.
- */
- for (pos = next_task(&init_task); nr > 0; --nr) {
- pos = next_task(pos);
- if (pos == &init_task) {
- pos = NULL;
- goto done;
- }
- }
-found:
- get_task_struct(pos);
-done:
- rcu_read_unlock();
- return pos;
-}
-
-/*
- * Find the next task in the task list.
- * Return NULL if we loop or there is any error.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tgid(struct task_struct *start)
-{
- struct task_struct *pos;
+ struct task_struct *task;
rcu_read_lock();
- pos = start;
- if (pid_alive(start))
- pos = next_task(start);
- if (pid_alive(pos) && (pos != &init_task)) {
- get_task_struct(pos);
- goto done;
- }
- pos = NULL;
-done:
+ task = get_pid_task(find_next_pid(tgid), PIDTYPE_PID);
rcu_read_unlock();
- put_task_struct(start);
- return pos;
+ return task;
+
}

static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
@@ -1888,6 +1835,8 @@ static int proc_pid_fill_cache(struct fi
proc_pid_instantiate, task, NULL);
}

+#define TGID_OFFSET (FIRST_PROCESS_ENTRY + ARRAY_SIZE(proc_base_stuff) - 1)
+
/* for the /proc/ directory itself, after non-process stuff has been done */
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
@@ -1906,23 +1855,20 @@ int proc_pid_readdir(struct file * filp,
}
nr -= ARRAY_SIZE(proc_base_stuff) - 1;

- /* f_version caches the tgid value that the last readdir call couldn't
- * return. lseek aka telldir automagically resets f_version to 0.
- */
- tgid = filp->f_version;
- filp->f_version = 0;
- for (task = first_tgid(tgid, nr);
+ tgid = filp->f_pos - TGID_OFFSET;
+ for (task = next_tgid(tgid);
task;
- task = next_tgid(task), filp->f_pos++) {
+ task = next_tgid(tgid + 1)) {
tgid = task->pid;
+ filp->f_pos = tgid + TGID_OFFSET;
if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
/* returning this tgid failed, save it as the first
* pid for the next readir call */
- filp->f_version = tgid;
put_task_struct(task);
- break;
+ goto out;
}
}
+ filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
out:
put_task_struct(reaper);
out_no_task:
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 9fd547f..d06d4ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -89,6 +89,7 @@ extern struct pid *FASTCALL(find_pid(int
* Lookup a PID in the hash table, and return with it's count elevated.
*/
extern struct pid *find_get_pid(int nr);
+extern struct pid *find_next_pid(int nr);

extern struct pid *alloc_pid(void);
extern void FASTCALL(free_pid(struct pid *pid));
diff --git a/kernel/pid.c b/kernel/pid.c
index 40e8e4d..112ff2a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -145,6 +145,23 @@ static int alloc_pidmap(void)
return -1;
}

+static int next_pidmap(int last)
+{
+ int offset;
+ pidmap_t *map;
+
+ offset = (last + 1) & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[(last + 1)/BITS_PER_PAGE];
+ for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) {
+ if (unlikely(!map->page))
+ continue;
+ offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
+ if (offset < BITS_PER_PAGE)
+ return mk_pid(map, offset);
+ }
+ return -1;
+}
+
fastcall void put_pid(struct pid *pid)
{
if (!pid)
@@ -307,6 +324,24 @@ struct pid *find_get_pid(pid_t nr)
EXPORT_SYMBOL_GPL(find_get_pid);

/*
+ * Used by proc to find the pid with the next greater number.
+ * Specifying nr is used to handle the seek case.
+ */
+struct pid *find_next_pid(int nr)
+{
+ struct pid *next;
+
+ next = find_pid(nr);
+ while (!next) {
+ nr = next_pidmap(nr);
+ if (nr <= 0)
+ break;
+ next = find_pid(nr);
+ }
+ return next;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.

2006-08-17 06:30:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

On Wed, 16 Aug 2006 22:59:50 -0600
[email protected] (Eric W. Biederman) wrote:

>
> So I just threw something together that seems to work.
>
> All of the pid are listed in order and the next used pid is found by
> scanning the pid bitmap.
>
> Scanning the pid bitmap might be a little heavy but actually is likely
> quite cache friendly as the accesses are extremely predictable, and
> everything should fit into 64 cache lines, which is fewer cache lines
> than walking a tree, and more predictable. Of course I turn around
> and then do a hash table lookup which is at least one more cache line
> and probably an unpredictable one at that.
>
> The point despite the brute force nature I have no reason to suspect
> this will run any noticeably slower than the current implementation.
>
> Look at this try it out and if this solves the problem we can push
> this upstream.
>
At first, Thanks.

question:

task = get_pid_task(find_next_pid(tgid), PIDTYPE_PID);

Does this return only "task/process" ? and never return "thread" ?

My another concern is that newly-created-process while ps running cannot be catched
by this kind of "table walking" approach (as my previous work)
But if people say O.K, I have no complaint.

I(we)'ll post another list-based one in the next week, anyway.
(I can't go-ahead this week...)

-Kame

2006-08-17 13:39:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Wed, 16 Aug 2006 22:59:50 -0600
> [email protected] (Eric W. Biederman) wrote:
>
>>
>> So I just threw something together that seems to work.
>>
>> All of the pid are listed in order and the next used pid is found by
>> scanning the pid bitmap.
>>
>> Scanning the pid bitmap might be a little heavy but actually is likely
>> quite cache friendly as the accesses are extremely predictable, and
>> everything should fit into 64 cache lines, which is fewer cache lines
>> than walking a tree, and more predictable. Of course I turn around
>> and then do a hash table lookup which is at least one more cache line
>> and probably an unpredictable one at that.
>>
>> The point despite the brute force nature I have no reason to suspect
>> this will run any noticeably slower than the current implementation.
>>
>> Look at this try it out and if this solves the problem we can push
>> this upstream.
>>
> At first, Thanks.
>
> question:
>
> task = get_pid_task(find_next_pid(tgid), PIDTYPE_PID);
>
> Does this return only "task/process" ? and never return "thread" ?
>
> My another concern is that newly-created-process while ps running cannot be
> catched
> by this kind of "table walking" approach (as my previous work)
> But if people say O.K, I have no complaint.
>
> I(we)'ll post another list-based one in the next week, anyway.
> (I can't go-ahead this week...)
>

Here is a respin with a fixed version of next_tgid. I believe
I have all of the silly corner cases handled. So if the pid
I have found is only a process group or session id or not
a thread_group leader it should not be ignored.

> static struct task_struct *next_tgid(unsigned int tgid)
> {
> struct task_struct *task;
> struct pid *pid;
>
> task = NULL;
> rcu_read_lock();
> retry:
> pid = find_next_pid(tgid);
> if (pid) {
> tgid = pid->nr + 1;
> task = pid_task(pid, PIDTYPE_PID);
> if (!task || !thread_group_leader(task))
> goto retry;
> get_task_struct(task);
> }
> rcu_read_unlock();
> return task;
>
> }

Seeking should now just work.

---
fs/proc/base.c | 89 +++++++++++++---------------------------------------
include/linux/pid.h | 1
kernel/pid.c | 35 ++++++++++++++++++++
3 files changed, 59 insertions(+), 66 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9943527..05dc244 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1813,70 +1813,28 @@ out:
}

/*
- * Find the first tgid to return to user space.
+ * Find the first task with tgid >= tgid
*
- * Usually this is just whatever follows &init_task, but if the users
- * buffer was too small to hold the full list or there was a seek into
- * the middle of the directory we have more work to do.
- *
- * In the case of a short read we start with find_task_by_pid.
- *
- * In the case of a seek we start with &init_task and walk nr
- * threads past it.
*/
-static struct task_struct *first_tgid(int tgid, unsigned int nr)
+static struct task_struct *next_tgid(unsigned int tgid)
{
- struct task_struct *pos;
- rcu_read_lock();
- if (tgid && nr) {
- pos = find_task_by_pid(tgid);
- if (pos && thread_group_leader(pos))
- goto found;
- }
- /* If nr exceeds the number of processes get out quickly */
- pos = NULL;
- if (nr && nr >= nr_processes())
- goto done;
-
- /* If we haven't found our starting place yet start with
- * the init_task and walk nr tasks forward.
- */
- for (pos = next_task(&init_task); nr > 0; --nr) {
- pos = next_task(pos);
- if (pos == &init_task) {
- pos = NULL;
- goto done;
- }
- }
-found:
- get_task_struct(pos);
-done:
- rcu_read_unlock();
- return pos;
-}
+ struct task_struct *task;
+ struct pid *pid;

-/*
- * Find the next task in the task list.
- * Return NULL if we loop or there is any error.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tgid(struct task_struct *start)
-{
- struct task_struct *pos;
+ task = NULL;
rcu_read_lock();
- pos = start;
- if (pid_alive(start))
- pos = next_task(start);
- if (pid_alive(pos) && (pos != &init_task)) {
- get_task_struct(pos);
- goto done;
+retry:
+ pid = find_next_pid(tgid);
+ if (pid) {
+ tgid = pid->nr + 1;
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task || !thread_group_leader(task))
+ goto retry;
+ get_task_struct(task);
}
- pos = NULL;
-done:
rcu_read_unlock();
- put_task_struct(start);
- return pos;
+ return task;
+
}

static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
@@ -1888,6 +1846,8 @@ static int proc_pid_fill_cache(struct fi
proc_pid_instantiate, task, NULL);
}

+#define TGID_OFFSET (FIRST_PROCESS_ENTRY + ARRAY_SIZE(proc_base_stuff) - 1)
+
/* for the /proc/ directory itself, after non-process stuff has been done */
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
@@ -1906,23 +1866,20 @@ int proc_pid_readdir(struct file * filp,
}
nr -= ARRAY_SIZE(proc_base_stuff) - 1;

- /* f_version caches the tgid value that the last readdir call couldn't
- * return. lseek aka telldir automagically resets f_version to 0.
- */
- tgid = filp->f_version;
- filp->f_version = 0;
- for (task = first_tgid(tgid, nr);
+ tgid = filp->f_pos - TGID_OFFSET;
+ for (task = next_tgid(tgid);
task;
- task = next_tgid(task), filp->f_pos++) {
+ task = next_tgid(tgid + 1)) {
tgid = task->pid;
+ filp->f_pos = tgid + TGID_OFFSET;
if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
/* returning this tgid failed, save it as the first
* pid for the next readir call */
- filp->f_version = tgid;
put_task_struct(task);
- break;
+ goto out;
}
}
+ filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
out:
put_task_struct(reaper);
out_no_task:
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 9fd547f..d06d4ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -89,6 +89,7 @@ extern struct pid *FASTCALL(find_pid(int
* Lookup a PID in the hash table, and return with it's count elevated.
*/
extern struct pid *find_get_pid(int nr);
+extern struct pid *find_next_pid(int nr);

extern struct pid *alloc_pid(void);
extern void FASTCALL(free_pid(struct pid *pid));
diff --git a/kernel/pid.c b/kernel/pid.c
index 40e8e4d..112ff2a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -145,6 +145,23 @@ static int alloc_pidmap(void)
return -1;
}

+static int next_pidmap(int last)
+{
+ int offset;
+ pidmap_t *map;
+
+ offset = (last + 1) & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[(last + 1)/BITS_PER_PAGE];
+ for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) {
+ if (unlikely(!map->page))
+ continue;
+ offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
+ if (offset < BITS_PER_PAGE)
+ return mk_pid(map, offset);
+ }
+ return -1;
+}
+
fastcall void put_pid(struct pid *pid)
{
if (!pid)
@@ -307,6 +324,24 @@ struct pid *find_get_pid(pid_t nr)
EXPORT_SYMBOL_GPL(find_get_pid);

/*
+ * Used by proc to find the pid with the next greater number.
+ * Specifying nr is used to handle the seek case.
+ */
+struct pid *find_next_pid(int nr)
+{
+ struct pid *next;
+
+ next = find_pid(nr);
+ while (!next) {
+ nr = next_pidmap(nr);
+ if (nr <= 0)
+ break;
+ next = find_pid(nr);
+ }
+ return next;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.

2006-08-17 18:16:33

by Jean Delvare

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

Eric,

On Thursday 17 August 2006 15:39, Eric W. Biederman wrote:
> Here is a respin with a fixed version of next_tgid. I believe
> I have all of the silly corner cases handled. So if the pid
> I have found is only a process group or session id or not
> a thread_group leader it should not be ignored.

On what tree is this patch based? It doesn't appear to apply to
2.6.18-rc4, nor 2.6.18-rc4-mm1.

Thanks,
--
Jean Delvare

2006-08-18 00:18:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] ps command race fix

On Thu, 17 Aug 2006 07:39:05 -0600
[email protected] (Eric W. Biederman) wrote:

> > static struct task_struct *next_tgid(unsigned int tgid)
> > {
> > struct task_struct *task;
> > struct pid *pid;
> >
> > task = NULL;
> > rcu_read_lock();
> > retry:
> > pid = find_next_pid(tgid);
> > if (pid) {
> > tgid = pid->nr + 1;
> > task = pid_task(pid, PIDTYPE_PID);
> > if (!task || !thread_group_leader(task))
> > goto retry;
> > get_task_struct(task);
> > }
> > rcu_read_unlock();
> > return task;
> >
> > }
>
> Seeking should now just work.
>

Above function looks good. Is this patch based on 'struct pid' patch ?

BTW, the whole story is..
==
last_read_pid = find_next_pid(x);
possible_next_pid = last_read_pid + 1;
find_next_pid( possible_next_pid)
<call>-> if ( find_pid(possible_next_pid) == 0) {
next = next_pidmap(maybe_next_pid + 1);
} else {
next = maybe_next_pid;
}
==

Why just do this by
==
next = next_pidmap(last_read_pid) ?
==

I don't think existing pids are contiguous in usual system.

-Kame


> ---
> fs/proc/base.c | 89 +++++++++++++---------------------------------------
> include/linux/pid.h | 1
> kernel/pid.c | 35 ++++++++++++++++++++
> 3 files changed, 59 insertions(+), 66 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9943527..05dc244 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1813,70 +1813,28 @@ out:
> }
>
> /*
> - * Find the first tgid to return to user space.
> + * Find the first task with tgid >= tgid
> *
> - * Usually this is just whatever follows &init_task, but if the users
> - * buffer was too small to hold the full list or there was a seek into
> - * the middle of the directory we have more work to do.
> - *
> - * In the case of a short read we start with find_task_by_pid.
> - *
> - * In the case of a seek we start with &init_task and walk nr
> - * threads past it.
> */
> -static struct task_struct *first_tgid(int tgid, unsigned int nr)
> +static struct task_struct *next_tgid(unsigned int tgid)
> {
> - struct task_struct *pos;
> - rcu_read_lock();
> - if (tgid && nr) {
> - pos = find_task_by_pid(tgid);
> - if (pos && thread_group_leader(pos))
> - goto found;
> - }
> - /* If nr exceeds the number of processes get out quickly */
> - pos = NULL;
> - if (nr && nr >= nr_processes())
> - goto done;
> -
> - /* If we haven't found our starting place yet start with
> - * the init_task and walk nr tasks forward.
> - */
> - for (pos = next_task(&init_task); nr > 0; --nr) {
> - pos = next_task(pos);
> - if (pos == &init_task) {
> - pos = NULL;
> - goto done;
> - }
> - }
> -found:
> - get_task_struct(pos);
> -done:
> - rcu_read_unlock();
> - return pos;
> -}
> + struct task_struct *task;
> + struct pid *pid;
>
> -/*
> - * Find the next task in the task list.
> - * Return NULL if we loop or there is any error.
> - *
> - * The reference to the input task_struct is released.
> - */
> -static struct task_struct *next_tgid(struct task_struct *start)
> -{
> - struct task_struct *pos;
> + task = NULL;
> rcu_read_lock();
> - pos = start;
> - if (pid_alive(start))
> - pos = next_task(start);
> - if (pid_alive(pos) && (pos != &init_task)) {
> - get_task_struct(pos);
> - goto done;
> +retry:
> + pid = find_next_pid(tgid);
> + if (pid) {
> + tgid = pid->nr + 1;
> + task = pid_task(pid, PIDTYPE_PID);
> + if (!task || !thread_group_leader(task))
> + goto retry;
> + get_task_struct(task);
> }
> - pos = NULL;
> -done:
> rcu_read_unlock();
> - put_task_struct(start);
> - return pos;
> + return task;
> +
> }
>
> static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
> @@ -1888,6 +1846,8 @@ static int proc_pid_fill_cache(struct fi
> proc_pid_instantiate, task, NULL);
> }
>
> +#define TGID_OFFSET (FIRST_PROCESS_ENTRY + ARRAY_SIZE(proc_base_stuff) - 1)
> +
> /* for the /proc/ directory itself, after non-process stuff has been done */
> int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
> {
> @@ -1906,23 +1866,20 @@ int proc_pid_readdir(struct file * filp,
> }
> nr -= ARRAY_SIZE(proc_base_stuff) - 1;
>
> - /* f_version caches the tgid value that the last readdir call couldn't
> - * return. lseek aka telldir automagically resets f_version to 0.
> - */
> - tgid = filp->f_version;
> - filp->f_version = 0;
> - for (task = first_tgid(tgid, nr);
> + tgid = filp->f_pos - TGID_OFFSET;
> + for (task = next_tgid(tgid);
> task;
> - task = next_tgid(task), filp->f_pos++) {
> + task = next_tgid(tgid + 1)) {
> tgid = task->pid;
> + filp->f_pos = tgid + TGID_OFFSET;
> if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
> /* returning this tgid failed, save it as the first
> * pid for the next readir call */
> - filp->f_version = tgid;
> put_task_struct(task);
> - break;
> + goto out;
> }
> }
> + filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
> out:
> put_task_struct(reaper);
> out_no_task:
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 9fd547f..d06d4ba 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -89,6 +89,7 @@ extern struct pid *FASTCALL(find_pid(int
> * Lookup a PID in the hash table, and return with it's count elevated.
> */
> extern struct pid *find_get_pid(int nr);
> +extern struct pid *find_next_pid(int nr);
>
> extern struct pid *alloc_pid(void);
> extern void FASTCALL(free_pid(struct pid *pid));
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 40e8e4d..112ff2a 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -145,6 +145,23 @@ static int alloc_pidmap(void)
> return -1;
> }
>
> +static int next_pidmap(int last)
> +{
> + int offset;
> + pidmap_t *map;
> +
> + offset = (last + 1) & BITS_PER_PAGE_MASK;
> + map = &pidmap_array[(last + 1)/BITS_PER_PAGE];
> + for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) {
> + if (unlikely(!map->page))
> + continue;
> + offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
> + if (offset < BITS_PER_PAGE)
> + return mk_pid(map, offset);
> + }
> + return -1;
> +}
> +
> fastcall void put_pid(struct pid *pid)
> {
> if (!pid)
> @@ -307,6 +324,24 @@ struct pid *find_get_pid(pid_t nr)
> EXPORT_SYMBOL_GPL(find_get_pid);
>
> /*
> + * Used by proc to find the pid with the next greater number.
> + * Specifying nr is used to handle the seek case.
> + */
> +struct pid *find_next_pid(int nr)
> +{
> + struct pid *next;
> +
> + next = find_pid(nr);
> + while (!next) {
> + nr = next_pidmap(nr);
> + if (nr <= 0)
> + break;
> + next = find_pid(nr);
> + }
> + return next;
> +}
> +
> +/*
> * The pid hash table is scaled according to the amount of memory in the
> * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
> * more.
>

2006-08-18 03:54:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ps command race fix


Hmm... I forgot to hit send.

KAMEZAWA Hiroyuki <[email protected]> writes:

> At first, Thanks.
>
> question:
>
> task = get_pid_task(find_next_pid(tgid), PIDTYPE_PID);
>
> Does this return only "task/process" ? and never return "thread" ?

Good point. I don't think I'm filter for thread group leaders here.
That should take a couple more lines of code.

> My another concern is that newly-created-process while ps running cannot be
> catched
> by this kind of "table walking" approach (as my previous work)
> But if people say O.K, I have no complaint.

Well it can but only if the newly created processes have a higher pid.

The guarantee that POSIX readdir makes is that you will see all directory
entries that are neither added nor deleted during the readdir.

> I(we)'ll post another list-based one in the next week, anyway.
> (I can't go-ahead this week...)

Where I see what I'm doing as being superior to that is that I'm
not writing to any global data structures. Which means what I'm doing
should scale to large machines without problem.

Eric