2006-08-25 09:26:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH] ps command race fix take 4 [4/4] proc root open/release/llseek

implements open/release/llseek ops are needed by new proc_pid_readdir()

llseek()'s offset is specified by 'bytes' but /proc root doesn't handle
file->f_pos as bytes. So I disabled llseek for /proc for now.

Signed-Off-By: KAMEZAWA Hiroyuki <[email protected]>

fs/proc/root.c | 34 ++++++++++++++++++++++++++++++++++
1 files changed, 34 insertions(+)

Index: linux-2.6.18-rc4/fs/proc/root.c
===================================================================
--- linux-2.6.18-rc4.orig/fs/proc/root.c
+++ linux-2.6.18-rc4/fs/proc/root.c
@@ -16,6 +16,7 @@
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/smp_lock.h>
+#include <linux/adaptive_pointer.h>

#include "internal.h"

@@ -118,14 +119,47 @@ static int proc_root_readdir(struct file
return ret;
}

+static int proc_root_open(struct inode *inode, struct file *filp)
+{
+ struct adaptive_pointer *ap;
+ ap = kzalloc(sizeof(*ap), GFP_KERNEL);
+ init_adaptive_pointer(ap);
+ if (ap) {
+ filp->private_data = ap;
+ return 0;
+ }
+ return -ENOMEM;
+}
+
+static int proc_root_release(struct inode *inode, struct file *filp)
+{
+ struct adaptive_pointer *ap;
+ ap = filp->private_data;
+ rcu_read_lock();
+ __ap_get_remove_pointer(ap);
+ rcu_read_unlock();
+ kfree(ap);
+ filp->private_data = NULL;
+ return 0;
+}
+
+static loff_t proc_root_llseek(struct file *file, loff_t off, int pos)
+{
+ /* pos is bytes...but we don't use fp->f_pos as bytes... */
+ return -ENOTSUPP;
+}
+
/*
* The root /proc directory is special, as it has the
* <pid> directories. Thus we don't use the generic
* directory handling functions for that..
*/
static struct file_operations proc_root_operations = {
+ .open = proc_root_open,
.read = generic_read_dir,
.readdir = proc_root_readdir,
+ .release = proc_root_release,
+ .llseek = proc_root_llseek,
};

/*


2006-09-04 23:14:14

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: readdir race fix.


The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of
system calls. For this race to trigger the process that proc_pid_readdir
stops at must exit before readdir is called again.

This can cause ps to fail to report processes, and it is in violation
of posix guarantees and normal application expectations with respect
to readdir.

Currently there is no way to work around this problem in user space
short of providing a gargantuan buffer to user space so the directory
read all happens in on system call.

This patch implements the normal directory semantics for proc,
that guarantee that a directory entry that is neither created nor
destroyed while reading the directory entry will be returned. For
directory that are either created or destroyed during the readdir you
may or may not see them. Furthermore you may seek to a directory
offset you have previously seen.

These are the guarantee that ext[23] provides and that posix requires,
and more importantly that user space expects. Plus it is a simple
semantic to implement reliable service. It is just a matter of
calling readdir a second time if you are wondering if something new
has show up.

These better semantics are implemented by scanning through the
pids in numerical order and by making the file offset a pid
plus a fixed offset.

The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have
to look at to scan a linked list, and the worst case of having to scan
the entire pid bitmap is pretty reasonable.

If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.

In addition this takes no additional locks and is actually less
code than what we are doing now.

This patch is against 2.6.18-rc6 and it should be relatively straight
forward to backport to older kernels as well.

Thanks to KAMEZAWA Hiroyuki <[email protected]> for
providing the first fix, pointing this out and working on it.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/base.c | 93 +++++++++++++--------------------------------------
include/linux/pid.h | 1 +
kernel/pid.c | 37 ++++++++++++++++++++
3 files changed, 62 insertions(+), 69 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index fe8d55f..7b93454 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2141,72 +2141,32 @@ out_no_task:
}

/*
- * Find the first tgid to return to user space.
+ * Find the first task with tgid >= tgid
*
- * Usually this is just whatever follows &init_task, but if the users
- * buffer was too small to hold the full list or there was a seek into
- * the middle of the directory we have more work to do.
- *
- * In the case of a short read we start with find_task_by_pid.
- *
- * In the case of a seek we start with &init_task and walk nr
- * threads past it.
*/
-static struct task_struct *first_tgid(int tgid, unsigned int nr)
+static struct task_struct *next_tgid(unsigned int tgid)
{
- struct task_struct *pos;
- rcu_read_lock();
- if (tgid && nr) {
- pos = find_task_by_pid(tgid);
- if (pos && thread_group_leader(pos))
- goto found;
- }
- /* If nr exceeds the number of processes get out quickly */
- pos = NULL;
- if (nr && nr >= nr_processes())
- goto done;
-
- /* If we haven't found our starting place yet start with
- * the init_task and walk nr tasks forward.
- */
- for (pos = next_task(&init_task); nr > 0; --nr) {
- pos = next_task(pos);
- if (pos == &init_task) {
- pos = NULL;
- goto done;
- }
- }
-found:
- get_task_struct(pos);
-done:
- rcu_read_unlock();
- return pos;
-}
+ struct task_struct *task;
+ struct pid *pid;

-/*
- * Find the next task in the task list.
- * Return NULL if we loop or there is any error.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tgid(struct task_struct *start)
-{
- struct task_struct *pos;
+ task = NULL;
rcu_read_lock();
- pos = start;
- if (pid_alive(start))
- pos = next_task(start);
- if (pid_alive(pos) && (pos != &init_task)) {
- get_task_struct(pos);
- goto done;
+retry:
+ pid = find_next_pid(tgid);
+ if (pid) {
+ tgid = pid->nr + 1;
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task || !thread_group_leader(task))
+ goto retry;
+ get_task_struct(task);
}
- pos = NULL;
-done:
rcu_read_unlock();
- put_task_struct(start);
- return pos;
+ return task;
+
}

+#define TGID_OFFSET ((FIRST_PROCESS_ENTRY + 1) - 1)
+
/* for the /proc/ directory itself, after non-process stuff has been done */
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
@@ -2222,29 +2182,24 @@ int proc_pid_readdir(struct file * filp,
filp->f_pos++;
nr++;
}
- nr -= 1;

- /* f_version caches the tgid value that the last readdir call couldn't
- * return. lseek aka telldir automagically resets f_version to 0.
- */
- tgid = filp->f_version;
- filp->f_version = 0;
- for (task = first_tgid(tgid, nr);
+ tgid = filp->f_pos - TGID_OFFSET;
+ for (task = next_tgid(tgid);
task;
- task = next_tgid(task), filp->f_pos++) {
+ task = next_tgid(tgid + 1)) {
int len;
ino_t ino;
tgid = task->pid;
+ filp->f_pos = tgid + TGID_OFFSET;
len = snprintf(buf, sizeof(buf), "%d", tgid);
ino = fake_ino(tgid, PROC_TGID_INO);
if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) {
- /* returning this tgid failed, save it as the first
- * pid for the next readir call */
- filp->f_version = tgid;
put_task_struct(task);
- break;
+ goto out;
}
}
+ filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
+out:
return 0;
}

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 29960b0..a24db52 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -87,6 +87,7 @@ extern struct pid *FASTCALL(find_pid(int
* Lookup a PID in the hash table, and return with it's count elevated.
*/
extern struct pid *find_get_pid(int nr);
+extern struct pid *find_next_pid(int nr);

extern struct pid *alloc_pid(void);
extern void FASTCALL(free_pid(struct pid *pid));
diff --git a/kernel/pid.c b/kernel/pid.c
index 93e212f..53d4159 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -145,6 +145,23 @@ static int alloc_pidmap(void)
return -1;
}

+static int next_pidmap(int last)
+{
+ int offset;
+ pidmap_t *map;
+
+ offset = (last + 1) & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[(last + 1)/BITS_PER_PAGE];
+ for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) {
+ if (unlikely(!map->page))
+ continue;
+ offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
+ if (offset < BITS_PER_PAGE)
+ return mk_pid(map, offset);
+ }
+ return -1;
+}
+
fastcall void put_pid(struct pid *pid)
{
if (!pid)
@@ -297,6 +314,26 @@ struct pid *find_get_pid(pid_t nr)
}

/*
+ * Used by proc to find the pid with the first
+ * pid that is greater than or equal to number.
+ *
+ * If there is a pid at nr this function is exactly the same as find_pid.
+ */
+struct pid *find_next_pid(int nr)
+{
+ struct pid *next;
+
+ next = find_pid(nr);
+ while (!next) {
+ nr = next_pidmap(nr);
+ if (nr <= 0)
+ break;
+ next = find_pid(nr);
+ }
+ return next;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.
--


2006-09-05 01:27:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

Hi,

On Mon, 04 Sep 2006 17:13:10 -0600
[email protected] (Eric W. Biederman) wrote:
> These better semantics are implemented by scanning through the
> pids in numerical order and by making the file offset a pid
> plus a fixed offset.
I think this is very sane/solid approach.
Maybe this is the way to go. I'll test and ack later, thank you.

> The pid scan happens on the pid bitmap, which when you look at it is
> remarkably efficient for a brute force algorithm. Given that a typical
> cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
> are only 40 cache lines for the entire 32K pid space. A typical system
> will have 100 pids or more so this is actually fewer cache lines we have
> to look at to scan a linked list, and the worst case of having to scan
> the entire pid bitmap is pretty reasonable.

I agree with you but..
Becasue this approach has to access *all* task structs in a system,
and have to scan pidhash many times. I'm not sure that this scan & lookup
is good for future implementation. But this patch is obviously better than
current implementation.



> /*
> + * Used by proc to find the pid with the first
> + * pid that is greater than or equal to number.
> + *
> + * If there is a pid at nr this function is exactly the same as find_pid.
> + */
> +struct pid *find_next_pid(int nr)
> +{

How about find_first_used_pid(int nr) ?

-Kame

2006-09-05 02:23:01

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

On Mon, 04 Sep 2006 17:13:10 -0600
[email protected] (Eric W. Biederman) wrote:
Hi, Hit OOM-Killer, because of memory leak of task struct.

patch is attached.
-Kame

task struct should be put always.

fs/proc/base.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.18-psfix/fs/proc/base.c
===================================================================
--- linux-2.6.18-psfix.orig/fs/proc/base.c 2006-09-05 10:42:40.000000000 +0900
+++ linux-2.6.18-psfix/fs/proc/base.c 2006-09-05 11:11:42.000000000 +0900
@@ -2193,8 +2193,8 @@
filp->f_pos = tgid + TGID_OFFSET;
len = snprintf(buf, sizeof(buf), "%d", tgid);
ino = fake_ino(tgid, PROC_TGID_INO);
+ put_task_struct(task);
if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) {
- put_task_struct(task);
goto out;
}
}

2006-09-05 02:32:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

KAMEZAWA Hiroyuki <[email protected]> writes:

> Hi,
>
> On Mon, 04 Sep 2006 17:13:10 -0600
> [email protected] (Eric W. Biederman) wrote:
>> These better semantics are implemented by scanning through the
>> pids in numerical order and by making the file offset a pid
>> plus a fixed offset.
> I think this is very sane/solid approach.
> Maybe this is the way to go. I'll test and ack later, thank you.
>
>> The pid scan happens on the pid bitmap, which when you look at it is
>> remarkably efficient for a brute force algorithm. Given that a typical
>> cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
>> are only 40 cache lines for the entire 32K pid space. A typical system
>> will have 100 pids or more so this is actually fewer cache lines we have
>> to look at to scan a linked list, and the worst case of having to scan
>> the entire pid bitmap is pretty reasonable.
>
> I agree with you but..
> Becasue this approach has to access *all* task structs in a system,
> and have to scan pidhash many times. I'm not sure that this scan & lookup
> is good for future implementation. But this patch is obviously better than
> current implementation.

Quite possibly. But where this approach falls down in not in the basics
but in the data structures. And it isn't that hard to fix the data
structures. Just something I don't want to do the first pass.

>> /*
>> + * Used by proc to find the pid with the first
>> + * pid that is greater than or equal to number.
>> + *
>> + * If there is a pid at nr this function is exactly the same as find_pid.
>> + */
>> +struct pid *find_next_pid(int nr)
>> +{
>
> How about find_first_used_pid(int nr) ?

Maybe. The best name I can think of is:
find_greater_or_equal_pid(int nr) and that is a little
clumsy. On the other hand it isn't confusing what the function
does.

Any other suggestions?

Eric

2006-09-05 02:38:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

On Tue, 5 Sep 2006 10:30:10 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Hi,
>
> On Mon, 04 Sep 2006 17:13:10 -0600
> [email protected] (Eric W. Biederman) wrote:
> > These better semantics are implemented by scanning through the
> > pids in numerical order and by making the file offset a pid
> > plus a fixed offset.
> I think this is very sane/solid approach.
> Maybe this is the way to go. I'll test and ack later, thank you.
>
Just a memo:

One interesting aspect of this patch is..

- default result of ps with current(old) kernel is naturally sorted by starttime.
then, ps command itself comes at the bottom of the result, as the newest process.

- default result of ps with this patch is sorted by pid, regardless of starttime.

==
kamezawa 3770 0.0 0.0 5156 996 tty1 S+ 11:17 0:00 /bin/bash ./testp
kawamura 3989 0.0 0.0 2168 832 pts/3 S+ 11:32 0:00 sh -c (cd /usr/sh
kawamura 3990 0.0 0.0 2168 408 pts/3 S+ 11:32 0:00 sh -c (cd /usr/sh
kawamura 4002 0.0 0.0 1924 680 pts/3 S+ 11:32 0:00 /usr/bin/less -iR
root 10772 0.0 0.0 6912 2188 ? Ss 11:31 0:00 sshd: kawamura [p
kamezawa 12341 0.6 0.0 5336 1476 tty2 Ss 11:17 0:06 -bash
kamezawa 15308 0.0 0.0 4788 544 tty1 S+ 11:32 0:00 sleep 1
kamezawa 15315 0.0 0.0 2380 756 pts/0 R+ 11:32 0:00 ps aux
kamezawa 17322 0.0 0.0 5332 1472 tty3 Ss+ 11:18 0:00 -bash
root 22194 0.0 0.0 1760 744 ? Ss 11:20 0:00 in.telnetd: awork
root 22198 0.0 0.0 2992 1476 ? Ss 11:20 0:00 login -- kamezawa
kawamura 24526 0.0 0.0 6912 1544 ? S 11:31 0:00 sshd: kawamura@pt
kawamura 24529 0.0 0.0 5336 1456 pts/3 Ss 11:31 0:00 -bash
satoshi 26239 2.2 0.3 40292 13696 ? Sl 11:30 0:02 /usr/bin/gnome-te
==

But 'ps --sort start_time' is available.

-Kame

2006-09-05 02:55:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Mon, 04 Sep 2006 17:13:10 -0600
> [email protected] (Eric W. Biederman) wrote:
> Hi, Hit OOM-Killer, because of memory leak of task struct.
>
> patch is attached.
> -Kame
>
> task struct should be put always.

Good catch. I actually have a pending cleanup that works better if the
task struct is held across the filldir so the incremental patch should look like:

I also noticed a benign typo in TGID_OFFSET (1 subtract one that I shouldn't)

Complete patch in just a moment.

Eric

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b7650b9..03f6680 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2186,7 +2186,7 @@ int proc_pid_readdir(struct file * filp,
tgid = filp->f_pos - TGID_OFFSET;
for (task = next_tgid(tgid);
task;
- task = next_tgid(tgid + 1)) {
+ put_task_struct(task), task = next_tgid(tgid + 1)) {
int len;
ino_t ino;
tgid = task->pid;

2006-09-05 03:08:31

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: readdir race fix


The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of
system calls. For this race to trigger the process that proc_pid_readdir
stops at must exit before readdir is called again.

This can cause ps to fail to report processes, and it is in violation
of posix guarantees and normal application expectations with respect
to readdir.

Currently there is no way to work around this problem in user space
short of providing a gargantuan buffer to user space so the directory
read all happens in on system call.

This patch implements the normal directory semantics for proc,
that guarantee that a directory entry that is neither created nor
destroyed while reading the directory entry will be returned. For
directory that are either created or destroyed during the readdir you
may or may not see them. Furthermore you may seek to a directory
offset you have previously seen.

These are the guarantee that ext[23] provides and that posix requires,
and more importantly that user space expects. Plus it is a simple
semantic to implement reliable service. It is just a matter of
calling readdir a second time if you are wondering if something new
has show up.

These better semantics are implemented by scanning through the
pids in numerical order and by making the file offset a pid
plus a fixed offset.

The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have
to look at to scan a linked list, and the worst case of having to scan
the entire pid bitmap is pretty reasonable.

If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.

In addition this takes no additional locks and is actually less
code than what we are doing now.

This patch is against 2.6.18-rc6 and it should be relatively straight
forward to backport to older kernels as well.

Thanks to KAMEZAWA Hiroyuki <[email protected]> for
providing the first fix, pointing this out and working on it.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/base.c | 93 +++++++++++++--------------------------------------
include/linux/pid.h | 1 +
kernel/pid.c | 37 ++++++++++++++++++++
3 files changed, 62 insertions(+), 69 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index fe8d55f..03f6680 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2141,72 +2141,32 @@ out_no_task:
}

/*
- * Find the first tgid to return to user space.
+ * Find the first task with tgid >= tgid
*
- * Usually this is just whatever follows &init_task, but if the users
- * buffer was too small to hold the full list or there was a seek into
- * the middle of the directory we have more work to do.
- *
- * In the case of a short read we start with find_task_by_pid.
- *
- * In the case of a seek we start with &init_task and walk nr
- * threads past it.
*/
-static struct task_struct *first_tgid(int tgid, unsigned int nr)
+static struct task_struct *next_tgid(unsigned int tgid)
{
- struct task_struct *pos;
- rcu_read_lock();
- if (tgid && nr) {
- pos = find_task_by_pid(tgid);
- if (pos && thread_group_leader(pos))
- goto found;
- }
- /* If nr exceeds the number of processes get out quickly */
- pos = NULL;
- if (nr && nr >= nr_processes())
- goto done;
-
- /* If we haven't found our starting place yet start with
- * the init_task and walk nr tasks forward.
- */
- for (pos = next_task(&init_task); nr > 0; --nr) {
- pos = next_task(pos);
- if (pos == &init_task) {
- pos = NULL;
- goto done;
- }
- }
-found:
- get_task_struct(pos);
-done:
- rcu_read_unlock();
- return pos;
-}
+ struct task_struct *task;
+ struct pid *pid;

-/*
- * Find the next task in the task list.
- * Return NULL if we loop or there is any error.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tgid(struct task_struct *start)
-{
- struct task_struct *pos;
+ task = NULL;
rcu_read_lock();
- pos = start;
- if (pid_alive(start))
- pos = next_task(start);
- if (pid_alive(pos) && (pos != &init_task)) {
- get_task_struct(pos);
- goto done;
+retry:
+ pid = find_next_pid(tgid);
+ if (pid) {
+ tgid = pid->nr + 1;
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task || !thread_group_leader(task))
+ goto retry;
+ get_task_struct(task);
}
- pos = NULL;
-done:
rcu_read_unlock();
- put_task_struct(start);
- return pos;
+ return task;
+
}

+#define TGID_OFFSET (FIRST_PROCESS_ENTRY + (1 /* /proc/self */))
+
/* for the /proc/ directory itself, after non-process stuff has been done */
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
@@ -2222,29 +2182,24 @@ int proc_pid_readdir(struct file * filp,
filp->f_pos++;
nr++;
}
- nr -= 1;

- /* f_version caches the tgid value that the last readdir call couldn't
- * return. lseek aka telldir automagically resets f_version to 0.
- */
- tgid = filp->f_version;
- filp->f_version = 0;
- for (task = first_tgid(tgid, nr);
+ tgid = filp->f_pos - TGID_OFFSET;
+ for (task = next_tgid(tgid);
task;
- task = next_tgid(task), filp->f_pos++) {
+ put_task_struct(task), task = next_tgid(tgid + 1)) {
int len;
ino_t ino;
tgid = task->pid;
+ filp->f_pos = tgid + TGID_OFFSET;
len = snprintf(buf, sizeof(buf), "%d", tgid);
ino = fake_ino(tgid, PROC_TGID_INO);
if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) {
- /* returning this tgid failed, save it as the first
- * pid for the next readir call */
- filp->f_version = tgid;
put_task_struct(task);
- break;
+ goto out;
}
}
+ filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
+out:
return 0;
}

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 29960b0..a24db52 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -87,6 +87,7 @@ extern struct pid *FASTCALL(find_pid(int
* Lookup a PID in the hash table, and return with it's count elevated.
*/
extern struct pid *find_get_pid(int nr);
+extern struct pid *find_next_pid(int nr);

extern struct pid *alloc_pid(void);
extern void FASTCALL(free_pid(struct pid *pid));
diff --git a/kernel/pid.c b/kernel/pid.c
index 93e212f..53d4159 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -145,6 +145,23 @@ static int alloc_pidmap(void)
return -1;
}

+static int next_pidmap(int last)
+{
+ int offset;
+ pidmap_t *map;
+
+ offset = (last + 1) & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[(last + 1)/BITS_PER_PAGE];
+ for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) {
+ if (unlikely(!map->page))
+ continue;
+ offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
+ if (offset < BITS_PER_PAGE)
+ return mk_pid(map, offset);
+ }
+ return -1;
+}
+
fastcall void put_pid(struct pid *pid)
{
if (!pid)
@@ -297,6 +314,26 @@ struct pid *find_get_pid(pid_t nr)
}

/*
+ * Used by proc to find the pid with the first
+ * pid that is greater than or equal to number.
+ *
+ * If there is a pid at nr this function is exactly the same as find_pid.
+ */
+struct pid *find_next_pid(int nr)
+{
+ struct pid *next;
+
+ next = find_pid(nr);
+ while (!next) {
+ nr = next_pidmap(nr);
+ if (nr <= 0)
+ break;
+ next = find_pid(nr);
+ }
+ return next;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.
--


2006-09-05 05:24:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

On Mon, 04 Sep 2006 17:13:10 -0600
[email protected] (Eric W. Biederman) wrote:

> Signed-off-by: Eric W. Biederman <[email protected]>

worked well with my (short) test set. and looks good.
Thank you.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2006-09-05 05:36:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix

On Mon, 04 Sep 2006 21:07:29 -0600
[email protected] (Eric W. Biederman) wrote:

> Thanks to KAMEZAWA Hiroyuki <[email protected]> for
> providing the first fix, pointing this out and working on it.
>
> Signed-off-by: Eric W. Biederman <[email protected]>

worked well with my (short) test set. and looks good.
Thank you.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2006-09-05 05:38:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix.

On Tue, 5 Sep 2006 14:26:38 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 04 Sep 2006 17:13:10 -0600
> [email protected] (Eric W. Biederman) wrote:
>
> > Signed-off-by: Eric W. Biederman <[email protected]>
>
> worked well with my (short) test set. and looks good.
> Thank you.
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
Sorry (>_<. I should ack to another e-mail. ignore this.
Hmm, please change subject of e-mail if patch is changed.

-Kame

2006-09-05 10:10:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix

On 09/04, Eric W. Biederman wrote:
>
> -static struct task_struct *next_tgid(struct task_struct *start)
> -{
> - struct task_struct *pos;
> + task = NULL;
> rcu_read_lock();
> - pos = start;
> - if (pid_alive(start))
> - pos = next_task(start);
> - if (pid_alive(pos) && (pos != &init_task)) {
> - get_task_struct(pos);
> - goto done;
> +retry:
> + pid = find_next_pid(tgid);
> + if (pid) {
> + tgid = pid->nr + 1;
> + task = pid_task(pid, PIDTYPE_PID);
> + if (!task || !thread_group_leader(task))
^^^^^^^^^^^^^^^^^^^
There is a window while de_thread() switches leadership, so next_tgid()
may skip a task doing exec. What do you think about

// needs a comment
if (!task || task->pid != task->tgid)
goto retry;

instead? Currently first_tgid() has the same (very minor) problem.

> + goto retry;
> + get_task_struct(task);
> }
> - pos = NULL;
> -done:
> rcu_read_unlock();
> - put_task_struct(start);
> - return pos;
> + return task;
> +
> }

Emply line before '}'

> +struct pid *find_next_pid(int nr)
> +{
> + struct pid *next;
> +
> + next = find_pid(nr);
> + while (!next) {
> + nr = next_pidmap(nr);
> + if (nr <= 0)
> + break;
> + next = find_pid(nr);
> + }
> + return next;
> +}

This is strange that we are doing find_pid() before and at the end of loop,
I'd suggest this code:

struct pid *find_next_pid(int nr)
{
struct pid *pid;

do {
pid = find_pid(nr);
if (pid != NULL)
break;
nr = next_pidmap(nr);
} while (nr > 0);

return pid;
}

Imho, a bit easier to read.

Oleg.

2006-09-05 11:37:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix

Oleg Nesterov <[email protected]> writes:

> On 09/04, Eric W. Biederman wrote:
>>
>> -static struct task_struct *next_tgid(struct task_struct *start)
>> -{
>> - struct task_struct *pos;
>> + task = NULL;
>> rcu_read_lock();
>> - pos = start;
>> - if (pid_alive(start))
>> - pos = next_task(start);
>> - if (pid_alive(pos) && (pos != &init_task)) {
>> - get_task_struct(pos);
>> - goto done;
>> +retry:
>> + pid = find_next_pid(tgid);
>> + if (pid) {
>> + tgid = pid->nr + 1;
>> + task = pid_task(pid, PIDTYPE_PID);
>> + if (!task || !thread_group_leader(task))
> ^^^^^^^^^^^^^^^^^^^
> There is a window while de_thread() switches leadership, so next_tgid()
> may skip a task doing exec. What do you think about
>
> // needs a comment
> if (!task || task->pid != task->tgid)
> goto retry;
>
> instead? Currently first_tgid() has the same (very minor) problem.

I see the problem, and your test will certainly alleviate the symptom.
You are making the test has this process ever been a thread group leader.

I guess alleviating the symptom is all that is necessary there.

Grumble. I hate that entire pid transfer case, too bad glibc cares.

If I could in the fix for this I would like to add a clean concept
that we are testing for wrapped in a helper function. Otherwise
even with a big fat comment this will be easy to break next time
someone refactors the code.


>> + goto retry;
>> + get_task_struct(task);
>> }
>> - pos = NULL;
>> -done:
>> rcu_read_unlock();
>> - put_task_struct(start);
>> - return pos;
>> + return task;
>> +
>> }
>
> Emply line before '}'
>
>> +struct pid *find_next_pid(int nr)
>> +{
>> + struct pid *next;
>> +
>> + next = find_pid(nr);
>> + while (!next) {
>> + nr = next_pidmap(nr);
>> + if (nr <= 0)
>> + break;
>> + next = find_pid(nr);
>> + }
>> + return next;
>> +}
>
> This is strange that we are doing find_pid() before and at the end of loop,
> I'd suggest this code:
>
> struct pid *find_next_pid(int nr)
> {
> struct pid *pid;
>
> do {
> pid = find_pid(nr);
> if (pid != NULL)
> break;
> nr = next_pidmap(nr);
> } while (nr > 0);
>
> return pid;
> }
>
> Imho, a bit easier to read.
It is at least not worse, so it is probably worth doing.

Eric

2006-09-05 14:53:28

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: readdir race fix (take 3)


The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of
system calls. For this race to trigger the process that proc_pid_readdir
stops at must exit before readdir is called again.

This can cause ps to fail to report processes, and it is in violation
of posix guarantees and normal application expectations with respect
to readdir.

Currently there is no way to work around this problem in user space
short of providing a gargantuan buffer to user space so the directory
read all happens in on system call.

This patch implements the normal directory semantics for proc,
that guarantee that a directory entry that is neither created nor
destroyed while reading the directory entry will be returned. For
directory that are either created or destroyed during the readdir you
may or may not see them. Furthermore you may seek to a directory
offset you have previously seen.

These are the guarantee that ext[23] provides and that posix requires,
and more importantly that user space expects. Plus it is a simple
semantic to implement reliable service. It is just a matter of
calling readdir a second time if you are wondering if something new
has show up.

These better semantics are implemented by scanning through the
pids in numerical order and by making the file offset a pid
plus a fixed offset.

The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have
to look at to scan a linked list, and the worst case of having to scan
the entire pid bitmap is pretty reasonable.

If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.

In addition this takes no additional locks and is actually less
code than what we are doing now.

Also another very subtle bug in this area has been fixed. It is
possible to catch a task in the middle of de_thread where a thread is
assuming the thread of it's thread group leader. This patch carefully
handles that case so if we hit it we don't fail to return the pid, that
is undergoing the de_thread dance.

This patch is against 2.6.18-rc6 and it should be relatively straight
forward to backport to older kernels as well.

Thanks to KAMEZAWA Hiroyuki <[email protected]> for
providing the first fix, pointing this out and working on it.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/base.c | 104 ++++++++++++++++---------------------------------
include/linux/pid.h | 1
include/linux/sched.h | 11 +++++
kernel/pid.c | 36 +++++++++++++++++
4 files changed, 83 insertions(+), 69 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index fe8d55f..28e56c3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2141,72 +2141,43 @@ out_no_task:
}

/*
- * Find the first tgid to return to user space.
+ * Find the first task with tgid >= tgid
*
- * Usually this is just whatever follows &init_task, but if the users
- * buffer was too small to hold the full list or there was a seek into
- * the middle of the directory we have more work to do.
- *
- * In the case of a short read we start with find_task_by_pid.
- *
- * In the case of a seek we start with &init_task and walk nr
- * threads past it.
*/
-static struct task_struct *first_tgid(int tgid, unsigned int nr)
+static struct task_struct *next_tgid(unsigned int tgid)
{
- struct task_struct *pos;
- rcu_read_lock();
- if (tgid && nr) {
- pos = find_task_by_pid(tgid);
- if (pos && thread_group_leader(pos))
- goto found;
- }
- /* If nr exceeds the number of processes get out quickly */
- pos = NULL;
- if (nr && nr >= nr_processes())
- goto done;
-
- /* If we haven't found our starting place yet start with
- * the init_task and walk nr tasks forward.
- */
- for (pos = next_task(&init_task); nr > 0; --nr) {
- pos = next_task(pos);
- if (pos == &init_task) {
- pos = NULL;
- goto done;
- }
- }
-found:
- get_task_struct(pos);
-done:
- rcu_read_unlock();
- return pos;
-}
+ struct task_struct *task;
+ struct pid *pid;

-/*
- * Find the next task in the task list.
- * Return NULL if we loop or there is any error.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tgid(struct task_struct *start)
-{
- struct task_struct *pos;
+ task = NULL;
rcu_read_lock();
- pos = start;
- if (pid_alive(start))
- pos = next_task(start);
- if (pid_alive(pos) && (pos != &init_task)) {
- get_task_struct(pos);
- goto done;
+retry:
+ pid = find_ge_pid(tgid);
+ if (pid) {
+ tgid = pid->nr + 1;
+ task = pid_task(pid, PIDTYPE_PID);
+ /* What we to know is if the pid we have find is the
+ * pid of a thread_group_leader. Testing for task
+ * being a thread_group_leader is the obvious thing
+ * todo but there is a window when it fails, due to
+ * the pid transfer logic in de_thread.
+ *
+ * So we perform the straight forward test of seeing
+ * if the pid we have found is the pid of a thread
+ * group leader, and don't worry if the task we have
+ * found doesn't happen to be a thread group leader.
+ * As we don't care in the case of readdir.
+ */
+ if (!task || !has_group_leader_pid(task))
+ goto retry;
+ get_task_struct(task);
}
- pos = NULL;
-done:
rcu_read_unlock();
- put_task_struct(start);
- return pos;
+ return task;
}

+#define TGID_OFFSET (FIRST_PROCESS_ENTRY + (1 /* /proc/self */))
+
/* for the /proc/ directory itself, after non-process stuff has been done */
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
@@ -2222,29 +2193,24 @@ int proc_pid_readdir(struct file * filp,
filp->f_pos++;
nr++;
}
- nr -= 1;

- /* f_version caches the tgid value that the last readdir call couldn't
- * return. lseek aka telldir automagically resets f_version to 0.
- */
- tgid = filp->f_version;
- filp->f_version = 0;
- for (task = first_tgid(tgid, nr);
+ tgid = filp->f_pos - TGID_OFFSET;
+ for (task = next_tgid(tgid);
task;
- task = next_tgid(task), filp->f_pos++) {
+ put_task_struct(task), task = next_tgid(tgid + 1)) {
int len;
ino_t ino;
tgid = task->pid;
+ filp->f_pos = tgid + TGID_OFFSET;
len = snprintf(buf, sizeof(buf), "%d", tgid);
ino = fake_ino(tgid, PROC_TGID_INO);
if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) {
- /* returning this tgid failed, save it as the first
- * pid for the next readir call */
- filp->f_version = tgid;
put_task_struct(task);
- break;
+ goto out;
}
}
+ filp->f_pos = PID_MAX_LIMIT + TGID_OFFSET;
+out:
return 0;
}

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 29960b0..525af8b 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -87,6 +87,7 @@ extern struct pid *FASTCALL(find_pid(int
* Lookup a PID in the hash table, and return with it's count elevated.
*/
extern struct pid *find_get_pid(int nr);
+extern struct pid *find_ge_pid(int nr);

extern struct pid *alloc_pid(void);
extern void FASTCALL(free_pid(struct pid *pid));
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 34ed0d9..a5dea85 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,17 @@ #define while_each_thread(g, t) \
/* de_thread depends on thread_group_leader not being a pid based check */
#define thread_group_leader(p) (p == p->group_leader)

+/* Do to the insanities of de_thread it is possible for a process
+ * to have the pid of the thread group leader without actually being
+ * the thread group leader. For iteration through the pids in proc
+ * all we care about is that we have a task with the appropriate
+ * pid, we don't actually care if we have the right task.
+ */
+static inline int has_group_leader_pid(struct task_struct *p)
+{
+ return p->pid == p->tgid;
+}
+
static inline struct task_struct *next_thread(const struct task_struct *p)
{
return list_entry(rcu_dereference(p->thread_group.next),
diff --git a/kernel/pid.c b/kernel/pid.c
index 93e212f..f396796 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -145,6 +145,23 @@ static int alloc_pidmap(void)
return -1;
}

+static int next_pidmap(int last)
+{
+ int offset;
+ pidmap_t *map;
+
+ offset = (last + 1) & BITS_PER_PAGE_MASK;
+ map = &pidmap_array[(last + 1)/BITS_PER_PAGE];
+ for (; map < &pidmap_array[PIDMAP_ENTRIES]; map++, offset = 0) {
+ if (unlikely(!map->page))
+ continue;
+ offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
+ if (offset < BITS_PER_PAGE)
+ return mk_pid(map, offset);
+ }
+ return -1;
+}
+
fastcall void put_pid(struct pid *pid)
{
if (!pid)
@@ -297,6 +314,25 @@ struct pid *find_get_pid(pid_t nr)
}

/*
+ * Used by proc to find the first pid that is greater then or equal to nr.
+ *
+ * If there is a pid at nr this function is exactly the same as find_pid.
+ */
+struct pid *find_ge_pid(int nr)
+{
+ struct pid *pid;
+
+ do {
+ pid = find_pid(nr);
+ if (pid)
+ break;
+ nr = next_pidmap(nr);
+ } while (nr > 0);
+
+ return pid;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.
--


2006-09-06 09:00:40

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

On Tuesday 5 September 2006 16:52, Eric W. Biederman wrote:
> The problem: An opendir, readdir, closedir sequence can fail to report
> process ids that are continually in use throughout the sequence of
> system calls. For this race to trigger the process that
> proc_pid_readdir stops at must exit before readdir is called again.
>
> This can cause ps to fail to report processes, and it is in violation
> of posix guarantees and normal application expectations with respect
> to readdir.
>
> Currently there is no way to work around this problem in user space
> short of providing a gargantuan buffer to user space so the directory
> read all happens in on system call.
>
> This patch implements the normal directory semantics for proc,
> that guarantee that a directory entry that is neither created nor
> destroyed while reading the directory entry will be returned. For
> directory that are either created or destroyed during the readdir you
> may or may not see them. Furthermore you may seek to a directory
> offset you have previously seen.
>
> These are the guarantee that ext[23] provides and that posix requires,
> and more importantly that user space expects. Plus it is a simple
> semantic to implement reliable service. It is just a matter of
> calling readdir a second time if you are wondering if something new
> has show up.
>
> These better semantics are implemented by scanning through the
> pids in numerical order and by making the file offset a pid
> plus a fixed offset.
>
> The pid scan happens on the pid bitmap, which when you look at it is
> remarkably efficient for a brute force algorithm. Given that a typical
> cache line is 64 bytes and thus covers space for 64*8 == 200 pids.
> There are only 40 cache lines for the entire 32K pid space. A typical
> system will have 100 pids or more so this is actually fewer cache lines
> we have to look at to scan a linked list, and the worst case of having
> to scan the entire pid bitmap is pretty reasonable.
>
> If we need something more efficient we can go to a more efficient data
> structure for indexing the pids, but for now what we have should be
> sufficient.
>
> In addition this takes no additional locks and is actually less
> code than what we are doing now.
>
> Also another very subtle bug in this area has been fixed. It is
> possible to catch a task in the middle of de_thread where a thread is
> assuming the thread of it's thread group leader. This patch carefully
> handles that case so if we hit it we don't fail to return the pid, that
> is undergoing the de_thread dance.
>
> This patch is against 2.6.18-rc6 and it should be relatively straight
> forward to backport to older kernels as well.
>
> Thanks to KAMEZAWA Hiroyuki <[email protected]> for
> providing the first fix, pointing this out and working on it.

Eric, Kame, thanks a lot for working on this. I'll be giving some good
testing to this patch today, and will return back to you when I'm done.

--
Jean Delvare

2006-09-06 21:11:56

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

On Wednesday 6 September 2006 11:01, Jean Delvare wrote:
> Eric, Kame, thanks a lot for working on this. I'll be giving some good
> testing to this patch today, and will return back to you when I'm done.

The original issue is indeed fixed, but there's a problem with the patch.
When stressing /proc (to verify the bug was fixed), my test machine ended
up crashing. Here are the 2 traces I found in the logs:

Sep 6 12:06:00 arrakis kernel: BUG: warning at
kernel/fork.c:113/__put_task_struct()
Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
Sep 6 12:06:05 arrakis kernel: BUG: warning at
kernel/fork.c:113/__put_task_struct()
Sep 6 12:06:05 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
Sep 6 12:06:05 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:05 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:05 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:05 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
Sep 6 12:06:05 arrakis kernel: BUG: unable to handle kernel NULL pointer
dereference at virtual address 00000004
Sep 6 12:06:05 arrakis kernel: printing eip:
Sep 6 12:06:05 arrakis kernel: c0115ede
Sep 6 12:06:05 arrakis kernel: *pde = 00000000
Sep 6 12:06:05 arrakis kernel: Oops: 0002 [#1]
Sep 6 12:06:05 arrakis kernel: PREEMPT
Sep 6 12:06:05 arrakis kernel: Modules linked in: button battery ac
thermal processor cpufreq_powersave speedstep_ich speedstep_lib
freq_table snd_pcm_oss snd_mixer_oss smbfs sg sd_mod nls_iso8859_1
nls_cp437 vfat fat radeon drm usb_storage scsi_mod eth1394 usbhid
snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer ide_cd
uhci_hcd snd intel_rng psmouse cdrom usbcore evdev soundcore ohci1394
intel_agp i2c_i801 snd_page_alloc ieee1394 e100 mii unix
Sep 6 12:06:05 arrakis kernel: CPU: 0
Sep 6 12:06:05 arrakis kernel: EIP: 0060:[<c0115ede>] Not tainted
VLI
Sep 6 12:06:05 arrakis kernel: EFLAGS: 00010282 (2.6.18-rc6 #113)
Sep 6 12:06:05 arrakis kernel: EIP is at __put_task_struct+0x3e/0x100
Sep 6 12:06:05 arrakis kernel: eax: 00000000 ebx: c7ba7580 ecx:
c0316bd4 edx: 00000000
Sep 6 12:06:05 arrakis kernel: esi: 00000000 edi: cc4b2d40 ebp:
c953ef48 esp: c953ef14
Sep 6 12:06:05 arrakis kernel: ds: 007b es: 007b ss: 0068
Sep 6 12:06:05 arrakis kernel: Process bug (pid: 28719, ti=c953e000
task=c953da90 task.ti=c953e000)
Sep 6 12:06:05 arrakis kernel: Stack: 00000000 c02ddbf5 00000071 c02c8401
c7ba7580 c019666a c7ba7580 c953ef48
Sep 6 12:06:05 arrakis kernel: 00000001 00000101 00000000 00000002
00000004 39350030 cc4b0038 c953ef98
Sep 6 12:06:05 arrakis kernel: c0174750 cc4b2d40 c137bdf0 fffffffe
c137be60 c01745f0 cc4b2d40 c953ef98
Sep 6 12:06:05 arrakis kernel: Call Trace:
Sep 6 12:06:05 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:05 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:05 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
Sep 6 12:06:05 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 12:06:05 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
Sep 6 12:06:05 arrakis kernel: Code: 84 af 00 00 00 8b 43 08 85 c0 75 77
89 e0 25 00 f0 ff ff 3b 18 74 3e 8b 83 84 01 00 00 89 04 24 e8 68 e0 00
00 8b 83 70 01 00 00 <ff> 48 04 0f 94 c2 84 d2 75 10 89 5c 24 18 8b 5c 24
10 83 c4 14
Sep 6 12:06:05 arrakis kernel: EIP: [<c0115ede>]
__put_task_struct+0x3e/0x100 SS:ESP 0068:c953ef14


Sep 6 19:38:23 arrakis kernel: BUG: warning at
kernel/fork.c:113/__put_task_struct()
Sep 6 19:38:23 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
Sep 6 19:38:23 arrakis kernel: [<c019667e>] proc_pid_readdir+0x14e/0x150
Sep 6 19:38:23 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:23 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
Sep 6 19:38:23 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:23 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
Sep 6 19:38:23 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:23 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
Sep 6 19:38:26 arrakis kernel: BUG: unable to handle kernel NULL pointer
dereference at virtual address 0000000c
Sep 6 19:38:26 arrakis kernel: printing eip:
Sep 6 19:38:26 arrakis kernel: c01d2a69
Sep 6 19:38:26 arrakis kernel: *pde = 00000000
Sep 6 19:38:26 arrakis kernel: Oops: 0000 [#1]
Sep 6 19:38:26 arrakis kernel: PREEMPT
Sep 6 19:38:26 arrakis kernel: Modules linked in: button battery ac
thermal processor cpufreq_powersave speedstep_ich speedstep_lib
freq_table snd_pcm_oss snd_mixer_oss smbfs nls_iso8859_1 nls_cp437 vfat
fat radeon drm sg sd_mod usb_storage scsi_mod eth1394 usbhid snd_intel8x0
snd_ac97_codec snd_ac97_bus snd_pcm snd_timer intel_rng uhci_hcd snd
ide_cd intel_agp e100 usbcore i2c_i801 cdrom ohci1394 psmouse soundcore
evdev mii ieee1394 snd_page_alloc unix
Sep 6 19:38:26 arrakis kernel: CPU: 0
Sep 6 19:38:26 arrakis kernel: EIP: 0060:[<c01d2a69>] Not tainted
VLI
Sep 6 19:38:26 arrakis kernel: EFLAGS: 00010017 (2.6.18-rc6 #115)
Sep 6 19:38:26 arrakis kernel: EIP is at rwsem_wake+0x99/0x140
Sep 6 19:38:26 arrakis kernel: eax: 00000000 ebx: d6e483b4 ecx:
cd522f64 edx: 00000001
Sep 6 19:38:26 arrakis kernel: esi: cd522f64 edi: b393a000 ebp:
d6e48380 esp: d34dff58
Sep 6 19:38:26 arrakis kernel: ds: 007b es: 007b ss: 0068
Sep 6 19:38:26 arrakis kernel: Process firefox-bin (pid: 2400,
ti=d34df000 task=d6e6c580 task.ti=d34df000)
Sep 6 19:38:26 arrakis kernel: Stack: cf2d94e8 d6e483b8 00000292 cd784df4
d6e483b4 b393a000 d6e48380 c013219e
Sep 6 19:38:26 arrakis kernel: 00000025 c0112b99 d6e483b4 cd784df4
b393a000 00000000 00030002 00000000
Sep 6 19:38:26 arrakis kernel: b393a000 d6e6c580 00000004 d34dffbc
b77d4b94 09473ed0 c0112810 bfa3ea00
Sep 6 19:38:26 arrakis kernel: Call Trace:
Sep 6 19:38:26 arrakis kernel: [<c013219e>] .text.lock.rwsem+0x21/0x43
Sep 6 19:38:26 arrakis kernel: [<c0112b99>] do_page_fault+0x389/0x580
Sep 6 19:38:26 arrakis kernel: [<c0112810>] do_page_fault+0x0/0x580
Sep 6 19:38:26 arrakis kernel: [<c01039f1>] error_code+0x39/0x40
Sep 6 19:38:26 arrakis kernel: Code: c3 e8 1c 0b 0f 00 eb ef 8b 4b 04 89
ce f6 41 0c 02 75 7c 31 d2 8d b6 00 00 00 00 8d bc 27 00 00 00 00 8b 01
42 3b 44 24 04 74 08 <f6> 40 0c 01 89 c1 75 ef 89 d0 89 d5 c1 e0 10 8d 54
02 ff 01 13
Sep 6 19:38:26 arrakis kernel: EIP: [<c01d2a69>] rwsem_wake+0x99/0x140
SS:ESP 0068:d34dff58
Sep 6 19:38:26 arrakis kernel: <6>note: firefox-bin[2400] exited with
preempt_count 1
Sep 6 19:38:26 arrakis kernel: BUG: scheduling while atomic:
firefox-bin/0x00000001/2400
Sep 6 19:38:26 arrakis kernel: [<c02c3552>] schedule+0x662/0x670
Sep 6 19:38:26 arrakis kernel: [<c02c4ebd>]
rwsem_down_read_failed+0xbd/0x1a0
Sep 6 19:38:26 arrakis kernel: [<c0237435>] vt_console_print+0x85/0x2a0
Sep 6 19:38:26 arrakis kernel: [<c0132184>] .text.lock.rwsem+0x7/0x43
Sep 6 19:38:26 arrakis kernel: [<c0133585>] futex_wake+0x25/0xf0
Sep 6 19:38:26 arrakis kernel: [<c0135210>] do_futex+0x70/0x110
Sep 6 19:38:26 arrakis kernel: [<c0119a34>]
release_console_sem+0x104/0x110
Sep 6 19:38:26 arrakis kernel: [<c013530c>] sys_futex+0x5c/0x130
Sep 6 19:38:26 arrakis kernel: [<c011633b>] mm_release+0x8b/0x90
Sep 6 19:38:26 arrakis kernel: [<c011b0e5>] exit_mm+0x25/0x110
Sep 6 19:38:26 arrakis kernel: [<c011b999>] do_exit+0xf9/0x530
Sep 6 19:38:26 arrakis kernel: [<c01042cc>] die+0x20c/0x210
Sep 6 19:38:26 arrakis kernel: [<c0112ab4>] do_page_fault+0x2a4/0x580
Sep 6 19:38:26 arrakis kernel: [<c0112810>] do_page_fault+0x0/0x580
Sep 6 19:38:26 arrakis kernel: [<c01039f1>] error_code+0x39/0x40
Sep 6 19:38:26 arrakis kernel: [<c01d2a69>] rwsem_wake+0x99/0x140
Sep 6 19:38:26 arrakis kernel: [<c013219e>] .text.lock.rwsem+0x21/0x43
Sep 6 19:38:26 arrakis kernel: [<c0112b99>] do_page_fault+0x389/0x580
Sep 6 19:38:26 arrakis kernel: [<c0112810>] do_page_fault+0x0/0x580
Sep 6 19:38:26 arrakis kernel: [<c01039f1>] error_code+0x39/0x40
Sep 6 19:38:27 arrakis kernel: BUG: warning at
kernel/fork.c:113/__put_task_struct()
Sep 6 19:38:27 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
Sep 6 19:38:27 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:27 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:27 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:27 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
Sep 6 19:38:27 arrakis kernel: BUG: unable to handle kernel paging
request at virtual address 61672d78
Sep 6 19:38:27 arrakis kernel: printing eip:
Sep 6 19:38:27 arrakis kernel: c01cf5ff
Sep 6 19:38:27 arrakis kernel: *pde = 00000000
Sep 6 19:38:27 arrakis kernel: Oops: 0002 [#2]
Sep 6 19:38:27 arrakis kernel: PREEMPT
Sep 6 19:38:27 arrakis kernel: Modules linked in: button battery ac
thermal processor cpufreq_powersave speedstep_ich speedstep_lib
freq_table snd_pcm_oss snd_mixer_oss smbfs nls_iso8859_1 nls_cp437 vfat
fat radeon drm sg sd_mod usb_storage scsi_mod eth1394 usbhid snd_intel8x0
snd_ac97_codec snd_ac97_bus snd_pcm snd_timer intel_rng uhci_hcd snd
ide_cd intel_agp e100 usbcore i2c_i801 cdrom ohci1394 psmouse soundcore
evdev mii ieee1394 snd_page_alloc unix
Sep 6 19:38:27 arrakis kernel: CPU: 0
Sep 6 19:38:27 arrakis kernel: EIP: 0060:[<c01cf5ff>] Not tainted
VLI
Sep 6 19:38:27 arrakis kernel: EFLAGS: 00010013 (2.6.18-rc6 #115)
Sep 6 19:38:27 arrakis kernel: EIP is at _atomic_dec_and_lock+0xf/0x50
Sep 6 19:38:27 arrakis kernel: eax: cd772000 ebx: 61672d78 ecx:
00000000 edx: 00000001
Sep 6 19:38:27 arrakis kernel: esi: 61672d78 edi: cd1f84e0 ebp:
cd772f48 esp: cd772ef8
Sep 6 19:38:27 arrakis kernel: ds: 007b es: 007b ss: 0068
Sep 6 19:38:27 arrakis kernel: Process bug (pid: 10459, ti=cd772000
task=cd75a030 task.ti=cd772000)
Sep 6 19:38:27 arrakis kernel: Stack: 00000206 c0123f67 61672d78 c03af980
cd521ab0 00000000 c0115ed8 61672d78
Sep 6 19:38:27 arrakis kernel: c02ddbd5 00000071 c02c8401 cd521ab0
c019666a cd521ab0 cd772f48 00000001
Sep 6 19:38:27 arrakis kernel: 00000101 00000000 00000002 00000004
35310030 cd1f8400 cd772f98 c0174750
Sep 6 19:38:27 arrakis kernel: Call Trace:
Sep 6 19:38:27 arrakis kernel: [<c0123f67>] free_uid+0x27/0xa0
Sep 6 19:38:27 arrakis kernel: [<c0115ed8>] __put_task_struct+0x38/0x100
Sep 6 19:38:27 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:27 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:27 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
Sep 6 19:38:27 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
Sep 6 19:38:27 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
Sep 6 19:38:27 arrakis kernel: Code: 01 c1 e2 0a 89 06 8b 5c 24 0c 89 d0
89 ca 8b 74 24 10 83 c4 14 c3 90 90 90 90 90 90 53 b8 01 00 00 00 8b 5c
24 08 e8 21 52 f4 ff <ff> 0b 0f 94 c0 84 c0 ba 01 00 00 00 74 04 5b 89 d0
c3 b8 01 00
Sep 6 19:38:27 arrakis kernel: EIP: [<c01cf5ff>]
_atomic_dec_and_lock+0xf/0x50 SS:ESP 0068:cd772ef8
Sep 6 19:38:27 arrakis kernel: <6>note: bug[10459] exited with
preempt_count 1

Sometimes the machine just hung, with nothing in the logs. The machine is
a Sony laptop (i686).

I have been testing the patch on another machine (x86_64) and had no
problem at all, so the reproduceability of the bug might depend on the
arch or some config option. I'll help nailing down this issue if I can,
just tell me what to do.

Thanks,
--
Jean Delvare

2006-09-06 22:26:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

On 09/06, Jean Delvare wrote:
>
> On Wednesday 6 September 2006 11:01, Jean Delvare wrote:
> > Eric, Kame, thanks a lot for working on this. I'll be giving some good
> > testing to this patch today, and will return back to you when I'm done.
>
> The original issue is indeed fixed, but there's a problem with the patch.
> When stressing /proc (to verify the bug was fixed), my test machine ended
> up crashing. Here are the 2 traces I found in the logs:
>
> Sep 6 12:06:00 arrakis kernel: BUG: warning at
> kernel/fork.c:113/__put_task_struct()
> Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
> Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
> Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
> Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
> Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
> Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
> Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb

I think there is a bug in next_tgid(),

> -static struct task_struct *next_tgid(struct task_struct *start)
> -{
> - struct task_struct *pos;
> + task = NULL;
> rcu_read_lock();
> - pos = start;
> - if (pid_alive(start))
> - pos = next_task(start);
> - if (pid_alive(pos) &amp;&amp; (pos != &amp;init_task)) {
> - get_task_struct(pos);
> - goto done;
> +retry:
> + pid = find_ge_pid(tgid);
> + if (pid) {
> + tgid = pid->nr + 1;
> + task = pid_task(pid, PIDTYPE_PID);
> + /* What we to know is if the pid we have find is the
> + * pid of a thread_group_leader. Testing for task
> + * being a thread_group_leader is the obvious thing
> + * todo but there is a window when it fails, due to
> + * the pid transfer logic in de_thread.
> + *
> + * So we perform the straight forward test of seeing
> + * if the pid we have found is the pid of a thread
> + * group leader, and don't worry if the task we have
> + * found doesn't happen to be a thread group leader.
> + * As we don't care in the case of readdir.
> + */
> + if (!task || !has_group_leader_pid(task))
> + goto retry;
> + get_task_struct(task);
> }
> - pos = NULL;
> -done:
> rcu_read_unlock();
> - put_task_struct(start);
> - return pos;
> + return task;
> }

If the task found is not a group leader, we go to retry, but
the task != NULL.

Now, if find_ge_pid(tgid) returns NULL, we return that wrong
task, and it was not get_task_struct()'ed.

Oleg.


2006-09-06 22:39:30

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] proc-readdir-race-fix-take-3-fix-3

On 09/07, Oleg Nesterov wrote:
>
> On 09/06, Jean Delvare wrote:
> >
> > On Wednesday 6 September 2006 11:01, Jean Delvare wrote:
> > > Eric, Kame, thanks a lot for working on this. I'll be giving some good
> > > testing to this patch today, and will return back to you when I'm done.
> >
> > The original issue is indeed fixed, but there's a problem with the patch.
> > When stressing /proc (to verify the bug was fixed), my test machine ended
> > up crashing. Here are the 2 traces I found in the logs:
> >
> > Sep 6 12:06:00 arrakis kernel: BUG: warning at
> > kernel/fork.c:113/__put_task_struct()
> > Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
> > Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
> > Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
> > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
> > Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
> > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
> > Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
>
> If the task found is not a group leader, we go to retry, but
> the task != NULL.
>
> Now, if find_ge_pid(tgid) returns NULL, we return that wrong
> task, and it was not get_task_struct()'ed.

Signed-off-by: Oleg Nesterov <[email protected]>

--- t/fs/proc/base.c~ 2006-09-07 02:33:26.000000000 +0400
+++ t/fs/proc/base.c 2006-09-07 02:34:19.000000000 +0400
@@ -2149,9 +2149,9 @@ static struct task_struct *next_tgid(uns
struct task_struct *task;
struct pid *pid;

- task = NULL;
rcu_read_lock();
retry:
+ task = NULL;
pid = find_ge_pid(tgid);
if (pid) {
tgid = pid->nr + 1;

2006-09-06 22:44:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

Jean Delvare <[email protected]> writes:

> On Wednesday 6 September 2006 11:01, Jean Delvare wrote:
>> Eric, Kame, thanks a lot for working on this. I'll be giving some good
>> testing to this patch today, and will return back to you when I'm done.
>
> The original issue is indeed fixed, but there's a problem with the patch.
> When stressing /proc (to verify the bug was fixed), my test machine ended
> up crashing. Here are the 2 traces I found in the logs:

Ugh.

So the death in __put_task_struct() is from:
WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
So it appears we have something that is decrementing but not
incrementing the count on the task struct.

Now what is interesting is that there are a couple of other failure modes
present here.
free_uid called from __put_task_struct is failing


And you seem to have a recursive page fault going on somewhere.

I suspect the triggering of this bug is the result of an earlier oops,
that left some process half cleaned up.

Have you tested 2.6.18-rc6 without my patch?
If not can you please test the same 2.6.18-rc6 configuration with my patch?

> Sometimes the machine just hung, with nothing in the logs. The machine is
> a Sony laptop (i686).
>
> I have been testing the patch on another machine (x86_64) and had no
> problem at all, so the reproduceability of the bug might depend on the
> arch or some config option. I'll help nailing down this issue if I can,
> just tell me what to do.

So I don't know what is going on with your laptop. It feels nasty.

I think my patch is just tripping on the problem, rather than causing
it. The previous version of fs/proc/base.c should have tripped over
this problem as well if it happened to have hit the same process.

I'm staring at the patch and I can not think of anything that would
explain your problem. The reference counting is simple and the only
bug I had in a posted version was a failure to decrement the count
on the task_struct.

I guess the practical question is what was your test methodology to
reproduce this problem? A couple of more people running the same
test on a few more machines might at least give us confidence in what
is going on.

Eric

2006-09-06 23:00:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc-readdir-race-fix-take-3-fix-3

Oleg Nesterov <[email protected]> writes:

> On 09/07, Oleg Nesterov wrote:
>>
>> On 09/06, Jean Delvare wrote:
>> >
>> > On Wednesday 6 September 2006 11:01, Jean Delvare wrote:
>> > > Eric, Kame, thanks a lot for working on this. I'll be giving some good
>> > > testing to this patch today, and will return back to you when I'm done.
>> >
>> > The original issue is indeed fixed, but there's a problem with the patch.
>> > When stressing /proc (to verify the bug was fixed), my test machine ended
>> > up crashing. Here are the 2 traces I found in the logs:
>> >
>> > Sep 6 12:06:00 arrakis kernel: BUG: warning at
>> > kernel/fork.c:113/__put_task_struct()
>> > Sep 6 12:06:00 arrakis kernel: [<c0115f93>] __put_task_struct+0xf3/0x100
>> > Sep 6 12:06:00 arrakis kernel: [<c019666a>] proc_pid_readdir+0x13a/0x150
>> > Sep 6 12:06:00 arrakis kernel: [<c01745f0>] vfs_readdir+0x80/0xa0
>> > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
>> > Sep 6 12:06:00 arrakis kernel: [<c017488c>] sys_getdents+0x6c/0xb0
>> > Sep 6 12:06:00 arrakis kernel: [<c0174750>] filldir+0x0/0xd0
>> > Sep 6 12:06:00 arrakis kernel: [<c0102fb7>] syscall_call+0x7/0xb
>>
>> If the task found is not a group leader, we go to retry, but
>> the task != NULL.
>>
>> Now, if find_ge_pid(tgid) returns NULL, we return that wrong
>> task, and it was not get_task_struct()'ed.

Yep. That would do it. And of course having written the
code it was very hard for me to step back far enough to see that.

The other two failure modes still don't make sense to me but
they may have been a side effect of this.

And it would have taken a thread being the highest pid in the
system that exits while we are calling filldir to trigger this.
Wow. That was a good stress test.


Signed-off-by: Eric W. Biederman <[email protected]>

> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- t/fs/proc/base.c~ 2006-09-07 02:33:26.000000000 +0400
> +++ t/fs/proc/base.c 2006-09-07 02:34:19.000000000 +0400
> @@ -2149,9 +2149,9 @@ static struct task_struct *next_tgid(uns
> struct task_struct *task;
> struct pid *pid;
>
> - task = NULL;
> rcu_read_lock();
> retry:
> + task = NULL;
> pid = find_ge_pid(tgid);
> if (pid) {
> tgid = pid->nr + 1;

2006-09-07 08:30:05

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

On Thursday 7 September 2006 00:43, Eric W. Biederman wrote:
> Have you tested 2.6.18-rc6 without my patch?

Yes I did, it didn't crash after a couple hours. Of course it doesn't
prove anything as the crash appears to be the result of a race.

I'll now apply Oleg's fix and see if things get better.

> I guess the practical question is what was your test methodology to
> reproduce this problem? A couple of more people running the same
> test on a few more machines might at least give us confidence in what
> is going on.

"My" test program forks 1000 children who sleep for 1 second then look for
themselves in /proc, warn if they can't find themselves, and exit. So
basically the idea is that the process list will shrink very rapidly at
the same moment every child does readdir(/proc).

I attached the test program, I take no credit (nor shame) for it, it was
provided to me by IBM (possibly on behalf of one of their own customers)
as a way to demonstrate and reproduce the original readdir(/proc) race
bug.

--
Jean Delvare


Attachments:
(No filename) (1.02 kB)
test.c (1.12 kB)
Download all attachments

2006-09-07 13:59:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

Jean Delvare <[email protected]> writes:

> On Thursday 7 September 2006 00:43, Eric W. Biederman wrote:
>> Have you tested 2.6.18-rc6 without my patch?
>
> Yes I did, it didn't crash after a couple hours. Of course it doesn't
> prove anything as the crash appears to be the result of a race.
>
> I'll now apply Oleg's fix and see if things get better.
>
>> I guess the practical question is what was your test methodology to
>> reproduce this problem? A couple of more people running the same
>> test on a few more machines might at least give us confidence in what
>> is going on.
>
> "My" test program forks 1000 children who sleep for 1 second then look for
> themselves in /proc, warn if they can't find themselves, and exit. So
> basically the idea is that the process list will shrink very rapidly at
> the same moment every child does readdir(/proc).
>
> I attached the test program, I take no credit (nor shame) for it, it was
> provided to me by IBM (possibly on behalf of one of their own customers)
> as a way to demonstrate and reproduce the original readdir(/proc) race
> bug.

Ok. So whatever is creating lots of child threads that tripped you
up is probably peculiar to the environment on your laptop.

Eric

2006-09-07 18:05:38

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

On Thursday 7 September 2006 15:57, Eric W. Biederman wrote:
> Jean Delvare <[email protected]> writes:
> > I'll now apply Oleg's fix and see if things get better.

After 8 hours of stress testing on two machines, no crash and no freeze.
So Oleg's fix seems to do the trick. Thanks Oleg :)

I'll keep the patches applied on both machines, even without stress tests
it is still better to make sure nothing bad happens in the long run.

> > "My" test program forks 1000 children who sleep for 1 second then
> > look for themselves in /proc, warn if they can't find themselves, and
> > exit. So basically the idea is that the process list will shrink very
> > rapidly at the same moment every child does readdir(/proc).
> >
> > I attached the test program, I take no credit (nor shame) for it, it
> > was provided to me by IBM (possibly on behalf of one of their own
> > customers) as a way to demonstrate and reproduce the original
> > readdir(/proc) race bug.
>
> Ok. So whatever is creating lots of child threads that tripped you
> up is probably peculiar to the environment on your laptop.

There's nothing really special running on this laptop. Slackware 10.2 with
xterm, firefox, sylpheed, xchat, and that's about it. At least one of the
crashes I had yesterday happened when I was actively using firefox, I
can't tell for the other one.

The difference with the system where no problem was observed may be that
the laptop has a preemptive kernel, and the desktop hasn't.

--
Jean Delvare

2006-09-07 18:41:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: readdir race fix (take 3)

Jean Delvare <[email protected]> writes:

> On Thursday 7 September 2006 15:57, Eric W. Biederman wrote:
>> Jean Delvare <[email protected]> writes:
>> > I'll now apply Oleg's fix and see if things get better.
>
> After 8 hours of stress testing on two machines, no crash and no freeze.
> So Oleg's fix seems to do the trick. Thanks Oleg :)
>
> I'll keep the patches applied on both machines, even without stress tests
> it is still better to make sure nothing bad happens in the long run.
>
>> > "My" test program forks 1000 children who sleep for 1 second then
>> > look for themselves in /proc, warn if they can't find themselves, and
>> > exit. So basically the idea is that the process list will shrink very
>> > rapidly at the same moment every child does readdir(/proc).
>> >
>> > I attached the test program, I take no credit (nor shame) for it, it
>> > was provided to me by IBM (possibly on behalf of one of their own
>> > customers) as a way to demonstrate and reproduce the original
>> > readdir(/proc) race bug.
>>
>> Ok. So whatever is creating lots of child threads that tripped you
>> up is probably peculiar to the environment on your laptop.
>
> There's nothing really special running on this laptop. Slackware 10.2 with
> xterm, firefox, sylpheed, xchat, and that's about it. At least one of the
> crashes I had yesterday happened when I was actively using firefox, I
> can't tell for the other one.

Well firefox is threaded so that may be enough. It takes a threaded
program to be able to trigger it.

> The difference with the system where no problem was observed may be that
> the laptop has a preemptive kernel, and the desktop hasn't.

I suspect that just has the potential to make the window bigger.

Eric

2006-09-08 06:39:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc-readdir-race-fix-take-3-fix-3


Ok. While we are on the topic of races in readdir in /proc.
I just wanted to note that I believe thread groups have the same
potential problem.

I don't think it is severe enough to cause any real world problems.

But it makes sense to note it.

Something to think about.


Eric