2009-03-16 06:51:40

by ZHANG, Le

[permalink] [raw]
Subject: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

filp->f_pos only get updated at the end of the function. Thus d_off of those
dirents who are in the middle will be 0, and this will cause a problem in
glibc's readdir implementation, specifically endless loop. Because when overflow
occurs, f_pos will be set to next dirent to read, however it will be 0, unless
the next one is the last one. So it will start over again and again.

There is a sample program in man 2 gendents. This is the output of the program
running on a multithread program's task dir before this patch is applied:

$ ./a.out /proc/3807/task
--------------- nread=128 ---------------
i-node# file type d_reclen d_off d_name
506442 directory 16 1 .
506441 directory 16 0 ..
506443 directory 16 0 3807
506444 directory 16 0 3809
506445 directory 16 0 3812
506446 directory 16 0 3861
506447 directory 16 0 3862
506448 directory 16 8 3863

This is the output after this patch is applied

$ ./a.out /proc/3807/task
--------------- nread=128 ---------------
i-node# file type d_reclen d_off d_name
506442 directory 16 1 .
506441 directory 16 2 ..
506443 directory 16 3 3807
506444 directory 16 4 3809
506445 directory 16 5 3812
506446 directory 16 6 3861
506447 directory 16 7 3862
506448 directory 16 8 3863

Signed-off-by: Zhang Le <[email protected]>
---
fs/proc/base.c | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0c9de19..cc6ea23 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3066,7 +3066,6 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
int retval = -ENOENT;
ino_t ino;
int tid;
- unsigned long pos = filp->f_pos; /* avoiding "long long" filp->f_pos */
struct pid_namespace *ns;

task = get_proc_task(inode);
@@ -3083,18 +3082,18 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
goto out_no_task;
retval = 0;

- switch (pos) {
+ switch (filp->f_pos) {
case 0:
ino = inode->i_ino;
- if (filldir(dirent, ".", 1, pos, ino, DT_DIR) < 0)
+ if (filldir(dirent, ".", 1, filp->f_pos, ino, DT_DIR) < 0)
goto out;
- pos++;
+ filp->f_pos++;
/* fall through */
case 1:
ino = parent_ino(dentry);
- if (filldir(dirent, "..", 2, pos, ino, DT_DIR) < 0)
+ if (filldir(dirent, "..", 2, filp->f_pos, ino, DT_DIR) < 0)
goto out;
- pos++;
+ filp->f_pos++;
/* fall through */
}

@@ -3104,9 +3103,9 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
ns = filp->f_dentry->d_sb->s_fs_info;
tid = (int)filp->f_version;
filp->f_version = 0;
- for (task = first_tid(leader, tid, pos - 2, ns);
+ for (task = first_tid(leader, tid, filp->f_pos - 2, ns);
task;
- task = next_tid(task), pos++) {
+ task = next_tid(task), filp->f_pos++) {
tid = task_pid_nr_ns(task, ns);
if (proc_task_fill_cache(filp, dirent, filldir, task, tid) < 0) {
/* returning this tgid failed, save it as the first
@@ -3117,7 +3116,6 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
}
}
out:
- filp->f_pos = pos;
put_task_struct(leader);
out_no_task:
return retval;
--
1.6.2


2009-03-16 08:35:19

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On Mon, Mar 16, 2009 at 02:44:31PM +0800, Zhang Le wrote:
> filp->f_pos only get updated at the end of the function. Thus d_off of those
> dirents who are in the middle will be 0, and this will cause a problem in
> glibc's readdir implementation, specifically endless loop. Because when overflow
> occurs, f_pos will be set to next dirent to read, however it will be 0, unless
> the next one is the last one. So it will start over again and again.

Eh... Here's what's really going on:

proc_..._fill_cache() API is rather kludgy and far too convoluted. In
particular, it calls filldir() and passes file->f_pos to it, expecting
the caller to update file->f_pos between the calls. proc_task_readdir()
uses that sucker, but doesn't update ->f_pos until the very end. As
the result, d_off of direntries produced by it gets screwed.
Broken-by: commit 61a28784028e6d55755e4d0f39bee8d9bf2ee8d9
Author: Eric W. Biederman <[email protected]>
Date: Mon Oct 2 02:18:49 2006 -0700

[PATCH] proc: Remove the hard coded inode numbers

Patch is fine, but I'd rather rip the layers of ..._fill_cache() abstractions
out and see what falls out. Anyway, for -rc your variant is definitely the
way to go.

2009-03-16 16:35:06

by ZHANG, Le

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On 08:34 Mon 16 Mar , Al Viro wrote:
> On Mon, Mar 16, 2009 at 02:44:31PM +0800, Zhang Le wrote:
> > filp->f_pos only get updated at the end of the function. Thus d_off of those
> > dirents who are in the middle will be 0, and this will cause a problem in
> > glibc's readdir implementation, specifically endless loop. Because when overflow
> > occurs, f_pos will be set to next dirent to read, however it will be 0, unless
> > the next one is the last one. So it will start over again and again.
>
> Eh... Here's what's really going on:
>
> proc_..._fill_cache() API is rather kludgy and far too convoluted. In
> particular, it calls filldir() and passes file->f_pos to it, expecting
> the caller to update file->f_pos between the calls. proc_task_readdir()
> uses that sucker, but doesn't update ->f_pos until the very end. As
> the result, d_off of direntries produced by it gets screwed.
> Broken-by: commit 61a28784028e6d55755e4d0f39bee8d9bf2ee8d9
> Author: Eric W. Biederman <[email protected]>
> Date: Mon Oct 2 02:18:49 2006 -0700
>
> [PATCH] proc: Remove the hard coded inode numbers
>

Thanks for sharing the insights.

> Patch is fine, but I'd rather rip the layers of ..._fill_cache() abstractions
> out and see what falls out. Anyway, for -rc your variant is definitely the
> way to go.

So, will you take care of it? Or who else should I ask? Thanks in advance!

Zhang, Le
http://zhangle.is-a-geek.org

2009-03-17 10:30:47

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On Mon, Mar 16, 2009 at 08:34:57AM +0000, Al Viro wrote:
> On Mon, Mar 16, 2009 at 02:44:31PM +0800, Zhang Le wrote:
> > filp->f_pos only get updated at the end of the function. Thus d_off of those
> > dirents who are in the middle will be 0, and this will cause a problem in
> > glibc's readdir implementation, specifically endless loop. Because when overflow
> > occurs, f_pos will be set to next dirent to read, however it will be 0, unless
> > the next one is the last one. So it will start over again and again.
>
> Eh... Here's what's really going on:
>
> proc_..._fill_cache() API is rather kludgy and far too convoluted. In
> particular, it calls filldir() and passes file->f_pos to it, expecting
> the caller to update file->f_pos between the calls. proc_task_readdir()
> uses that sucker, but doesn't update ->f_pos until the very end. As
> the result, d_off of direntries produced by it gets screwed.
> Broken-by: commit 61a28784028e6d55755e4d0f39bee8d9bf2ee8d9
> Author: Eric W. Biederman <[email protected]>
> Date: Mon Oct 2 02:18:49 2006 -0700
>
> [PATCH] proc: Remove the hard coded inode numbers
>
> Patch is fine, but I'd rather rip the layers of ..._fill_cache() abstractions
> out and see what falls out. Anyway, for -rc your variant is definitely the
> way to go.

Except this bit:

fs/built-in.o: In function `proc_task_readdir':
compr_zlib.c:(.text+0x434b8): undefined reference to `__cmpdi2'
compr_zlib.c:(.text+0x434d4): undefined reference to `__cmpdi2'

2009-03-17 12:54:39

by ZHANG, Le

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On 13:37 Tue 17 Mar , Alexey Dobriyan wrote:
> On Mon, Mar 16, 2009 at 08:34:57AM +0000, Al Viro wrote:
> > On Mon, Mar 16, 2009 at 02:44:31PM +0800, Zhang Le wrote:
> > > filp->f_pos only get updated at the end of the function. Thus d_off of those
> > > dirents who are in the middle will be 0, and this will cause a problem in
> > > glibc's readdir implementation, specifically endless loop. Because when overflow
> > > occurs, f_pos will be set to next dirent to read, however it will be 0, unless
> > > the next one is the last one. So it will start over again and again.
> >
> > Eh... Here's what's really going on:
> >
> > proc_..._fill_cache() API is rather kludgy and far too convoluted. In
> > particular, it calls filldir() and passes file->f_pos to it, expecting
> > the caller to update file->f_pos between the calls. proc_task_readdir()
> > uses that sucker, but doesn't update ->f_pos until the very end. As
> > the result, d_off of direntries produced by it gets screwed.
> > Broken-by: commit 61a28784028e6d55755e4d0f39bee8d9bf2ee8d9
> > Author: Eric W. Biederman <[email protected]>
> > Date: Mon Oct 2 02:18:49 2006 -0700
> >
> > [PATCH] proc: Remove the hard coded inode numbers
> >
> > Patch is fine, but I'd rather rip the layers of ..._fill_cache() abstractions
> > out and see what falls out. Anyway, for -rc your variant is definitely the
> > way to go.
>
> Except this bit:
>
> fs/built-in.o: In function `proc_task_readdir':
> compr_zlib.c:(.text+0x434b8): undefined reference to `__cmpdi2'
> compr_zlib.c:(.text+0x434d4): undefined reference to `__cmpdi2'

Hi, I am sorry but I can't reproduce this.
Are you sure this only happens after applying my patch?
Would you please provide your fs/jffs2/compr_zlib.o and fs/proc/base.o?
And what is your gcc version? I am using gcc 4.4 snapshot rev. 144705
Thanks!

Zhang, Le
http://zhangle.is-a-geek.org

2009-03-17 13:41:39

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On Tue, Mar 17, 2009 at 08:53:47PM +0800, Zhang Le wrote:
> > fs/built-in.o: In function `proc_task_readdir':
> > compr_zlib.c:(.text+0x434b8): undefined reference to `__cmpdi2'
> > compr_zlib.c:(.text+0x434d4): undefined reference to `__cmpdi2'
>
> Hi, I am sorry but I can't reproduce this.

It's on arm, I fixed it up already.

2009-03-17 18:11:25

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On Tue, Mar 17, 2009 at 08:53:47PM +0800, Zhang Le wrote:
> > > way to go.
> >
> > Except this bit:
> >
> > fs/built-in.o: In function `proc_task_readdir':
> > compr_zlib.c:(.text+0x434b8): undefined reference to `__cmpdi2'
> > compr_zlib.c:(.text+0x434d4): undefined reference to `__cmpdi2'
>
> Hi, I am sorry but I can't reproduce this.
> Are you sure this only happens after applying my patch?
> Would you please provide your fs/jffs2/compr_zlib.o and fs/proc/base.o?
> And what is your gcc version? I am using gcc 4.4 snapshot rev. 144705

The critical part is target architecture, not version... switch on
long long ends up spewing library calls. It's switch (pos) that
became switch (file->f_pos).

2009-03-17 18:13:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

Zhang Le <[email protected]> writes:

> filp->f_pos only get updated at the end of the function. Thus d_off of those
> dirents who are in the middle will be 0, and this will cause a problem in
> glibc's readdir implementation, specifically endless loop. Because when overflow
> occurs, f_pos will be set to next dirent to read, however it will be 0, unless
> the next one is the last one. So it will start over again and again.
>
> There is a sample program in man 2 gendents. This is the output of the program
> running on a multithread program's task dir before this patch is applied:
>
> $ ./a.out /proc/3807/task
> --------------- nread=128 ---------------
> i-node# file type d_reclen d_off d_name
> 506442 directory 16 1 .
> 506441 directory 16 0 ..
> 506443 directory 16 0 3807
> 506444 directory 16 0 3809
> 506445 directory 16 0 3812
> 506446 directory 16 0 3861
> 506447 directory 16 0 3862
> 506448 directory 16 8 3863
>
> This is the output after this patch is applied
>
> $ ./a.out /proc/3807/task
> --------------- nread=128 ---------------
> i-node# file type d_reclen d_off d_name
> 506442 directory 16 1 .
> 506441 directory 16 2 ..
> 506443 directory 16 3 3807
> 506444 directory 16 4 3809
> 506445 directory 16 5 3812
> 506446 directory 16 6 3861
> 506447 directory 16 7 3862
> 506448 directory 16 8 3863

I'm trying to understand what the problem that glibc
runs into. Is this the glibc seekdir madness?

Under which conditions have you seen this problem?

The reason I ask is if this is triggered by seekdir and people really
care then the current proc_task_readdir for the /proc/<pid>/task/
directories is not quite sufficient. As threads come and go the d_off
associated with a specific thread will change.

Which means we probably should be returning:

$ ./a.out /proc/3807/task
--------------- nread=128 ---------------
i-node# file type d_reclen d_off d_name
506442 directory 16 1 .
506441 directory 16 2 ..
506443 directory 16 3809 3807
506444 directory 16 3811 3809
506445 directory 16 3814 3812
506446 directory 16 3863 3861
506447 directory 16 3864 3862
506448 directory 16 3865 3863

Which is slightly better unfortunately it doesn't give us the
guarantee that the d_off will be continually increasing.

Eric

2009-03-18 07:27:47

by ZHANG, Le

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

On 11:12 Tue 17 Mar , Eric W. Biederman wrote:
> Zhang Le <[email protected]> writes:
>
> > filp->f_pos only get updated at the end of the function. Thus d_off of those
> > dirents who are in the middle will be 0, and this will cause a problem in
> > glibc's readdir implementation, specifically endless loop. Because when overflow
> > occurs, f_pos will be set to next dirent to read, however it will be 0, unless
> > the next one is the last one. So it will start over again and again.
> >
> > There is a sample program in man 2 gendents. This is the output of the program
> > running on a multithread program's task dir before this patch is applied:
> >
> > $ ./a.out /proc/3807/task
> > --------------- nread=128 ---------------
> > i-node# file type d_reclen d_off d_name
> > 506442 directory 16 1 .
> > 506441 directory 16 0 ..
> > 506443 directory 16 0 3807
> > 506444 directory 16 0 3809
> > 506445 directory 16 0 3812
> > 506446 directory 16 0 3861
> > 506447 directory 16 0 3862
> > 506448 directory 16 8 3863
> >
> > This is the output after this patch is applied
> >
> > $ ./a.out /proc/3807/task
> > --------------- nread=128 ---------------
> > i-node# file type d_reclen d_off d_name
> > 506442 directory 16 1 .
> > 506441 directory 16 2 ..
> > 506443 directory 16 3 3807
> > 506444 directory 16 4 3809
> > 506445 directory 16 5 3812
> > 506446 directory 16 6 3861
> > 506447 directory 16 7 3862
> > 506448 directory 16 8 3863
>
> I'm trying to understand what the problem that glibc
> runs into. Is this the glibc seekdir madness?
>
> Under which conditions have you seen this problem?

Please see my explanation at the bottom.

>
> The reason I ask is if this is triggered by seekdir and people really
> care then the current proc_task_readdir for the /proc/<pid>/task/
> directories is not quite sufficient. As threads come and go the d_off
> associated with a specific thread will change.
>
> Which means we probably should be returning:
>
> $ ./a.out /proc/3807/task
> --------------- nread=128 ---------------
> i-node# file type d_reclen d_off d_name
> 506442 directory 16 1 .
> 506441 directory 16 2 ..

Without my patch here, d_off will be 0. And so will be the following d_off
except the last one.

> 506443 directory 16 3809 3807
> 506444 directory 16 3811 3809
> 506445 directory 16 3814 3812
> 506446 directory 16 3863 3861
> 506447 directory 16 3864 3862
> 506448 directory 16 3865 3863
>
> Which is slightly better unfortunately it doesn't give us the
> guarantee that the d_off will be continually increasing.

Unfortunately, unlike directories on normal filesystem, there will never be any
other files in /proc/xxxx/task. So, to my understanding, d_off should be
increasing continuously, no?

OK, here is what is going on in glibc. I left it out intentionally so as not to
make the discuss unnecessarily complicated. So here it is:

First of all, we can see that the d_off does not get correctly updated.

Then, let's take a look at glibc's __GETDENTS function. In this function, there
are 3 implementations:
http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/getdents.c;h=b33d1789adff11a04cbb1f6f5616bc8eed59418f;hb=cvs/master
Two of them will detect overflow, and lseek if necessary.

This means, although d_off in /proc/xxxx/task does not get updated correctly
everywhere, the problem of endless loop only occurs on certain platforms.
One of them is MIPS N32 ABI system.

Of course, there are other ways to work around this. However, to me, fixing this
would be the easiest and the most straight forward one. :)

Zhang, Le
http://zhangle.is-a-geek.org

2009-03-18 10:36:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] filp->f_pos not correctly updated in proc_task_readdir

Zhang Le <[email protected]> writes:

> Unfortunately, unlike directories on normal filesystem, there will never be any
> other files in /proc/xxxx/task. So, to my understanding, d_off should be
> increasing continuously, no?
>
> OK, here is what is going on in glibc. I left it out intentionally so as not to
> make the discuss unnecessarily complicated. So here it is:
>
> First of all, we can see that the d_off does not get correctly updated.

Linus has merged that fix and I have no problems with fixing d_off.
The fact that it was not set is a bug plain and simple.


> Then, let's take a look at glibc's __GETDENTS function. In this function, there
> are 3 implementations:
> http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/getdents.c;h=b33d1789adff11a04cbb1f6f5616bc8eed59418f;hb=cvs/master
> Two of them will detect overflow, and lseek if necessary.
>
> This means, although d_off in /proc/xxxx/task does not get updated correctly
> everywhere, the problem of endless loop only occurs on certain platforms.
> One of them is MIPS N32 ABI system.

Ok. glibc is performing a conversion and if it doesn't have enough
space to convert all of the entries it read it does a seek, so it
will reread from the kernel the unread entries.

Not friendly (as seeks are notoriously hard to get right in directories)
but not the worst behavior in the world.

> Of course, there are other ways to work around this. However, to me, fixing this
> would be the easiest and the most straight forward one. :)

I'm not looking for other work arounds. There is another bug, that I wondering if
we can fix.

The sequence:
opendir
readdir
readdir
readdir
closedir

Is supposed to guarantee that all directory entries that exist from
the time of openddir to the time of closedir are returned.

If threads are being created and destroyed between opendir and
closedir we can fail to show threads that exist the entire time. A
seek is particularly bad.

So I am wondering if there is a way we can change first_tid and next_tid
so that we can do better.

The fact that people have large enough thread directories that they
are hitting an overflow case suggests that eventually we will also have
problems with missing threads, like we have saw with missing processes
in the root directory of /proc, before that problem was fixed.

One possibility is to encode the pid in the offset. That is slightly
better but it doesn't fix the restart case where that pid has existed
as the thread list is not stored in pid order. We add new threads
to the tail of the thread list. Which seems like the right thing
to do.

The only way I can see to actually make this robust is to ignore
the thread list altogether. Have an encoded form of the pid in
the offset, and then walk the pid bitmap. Only returning
for entries that are in the proper thread group. It will
likely be a bit slower but if we need to support seeks there
appears not helping it.

Eric

2009-03-18 13:40:09

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: Fix proc_tid_readdir so it handles the weird cases.


Fix the logic in proc_task_readdir so that we provide the
guarantee that in the sequence:
opendir
readdir
readdir
....
readdir
closedir
If a thread exists from the beginning until the end we are
guaranteed to see it, even if some threads are created or
worse are destroyed during the readdir.

This guarantee is provided by reusing the same logic used
by proc_pid_readdir for the root directory of /proc. The
pid is encoded into the offset, and we walk the pid bitmap
of all pids in the system, and test each of them to see if
it belongs to the specified thread group.

If we seek to a bad location or if the task we say last
exits it is ok because we can numerically find the next
task we would have returned.

This is slower that the previous version, but it should
not be too slow as this techinique is already use for
the root directory of /proc without problems.

Signed-off-by: Eric Biederman <[email protected]>
---
fs/proc/base.c | 141 +++++++++++++++++++++-----------------------------------
1 files changed, 53 insertions(+), 88 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index beaa0ce..2d8e9c9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2754,13 +2754,13 @@ retry:
if (pid) {
iter.tgid = pid_nr_ns(pid, ns);
iter.task = pid_task(pid, PIDTYPE_PID);
- /* What we to know is if the pid we have find is the
- * pid of a thread_group_leader. Testing for task
- * being a thread_group_leader is the obvious thing
- * todo but there is a window when it fails, due to
- * the pid transfer logic in de_thread.
+ /* What we want to know is if the pid we have found is
+ * the pid of a thread_group_leader. Testing for task
+ * being a thread_group_leader is the obvious thing to
+ * do but there is a window when it fails, due to the
+ * pid transfer logic in de_thread.
*
- * So we perform the straight forward test of seeing
+ * So we perform the straight forward test, of seeing
* if the pid we have found is the pid of a thread
* group leader, and don't worry if the task we have
* found doesn't happen to be a thread group leader.
@@ -2978,82 +2978,54 @@ out_no_task:
return result;
}

+
/*
- * Find the first tid of a thread group to return to user space.
- *
- * Usually this is just the thread group leader, but if the users
- * buffer was too small or there was a seek into the middle of the
- * directory we have more work todo.
- *
- * In the case of a short read we start with find_task_by_pid.
+ * Find the next task in the thread group with tid >= tid.
+ * Return iter.task == NULL if ther is an error or no next task.
*
- * In the case of a seek we start with the leader and walk nr
- * threads past it.
+ * The reference to the input task_struct is released.
*/
-static struct task_struct *first_tid(struct task_struct *leader,
- int tid, int nr, struct pid_namespace *ns)
+struct tid_iter {
+ struct task_struct *task;
+ unsigned int tid;
+};
+static struct tid_iter next_tid(struct pid_namespace *ns, struct pid *tgid,
+ struct tid_iter iter)
{
- struct task_struct *pos;
+ struct pid *pid;

+ if (iter.task)
+ put_task_struct(iter.task);
rcu_read_lock();
- /* Attempt to start with the pid of a thread */
- if (tid && (nr > 0)) {
- pos = find_task_by_pid_ns(tid, ns);
- if (pos && (pos->group_leader == leader))
- goto found;
- }
-
- /* If nr exceeds the number of threads there is nothing todo */
- pos = NULL;
- if (nr && nr >= get_nr_threads(leader))
- goto out;
-
- /* If we haven't found our starting place yet start
- * with the leader and walk nr threads forward.
- */
- for (pos = leader; nr > 0; --nr) {
- pos = next_thread(pos);
- if (pos == leader) {
- pos = NULL;
- goto out;
+retry:
+ iter.task = NULL;
+ pid = find_ge_pid(iter.tid, ns);
+ if (pid) {
+ iter.tid = pid_nr_ns(pid, ns);
+ iter.task = pid_task(pid, PIDTYPE_PID);
+ /* As the tgid does not change during de_thread
+ * this test if a task is in a thread group
+ * should always be accurate.
+ */
+ if (!iter.task || task_tgid(iter.task) != tgid) {
+ iter.tid += 1;
+ goto retry;
}
+ get_task_struct(iter.task);
}
-found:
- get_task_struct(pos);
-out:
rcu_read_unlock();
- return pos;
+ return iter;
}

-/*
- * Find the next thread in the thread list.
- * Return NULL if there is an error or no next thread.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tid(struct task_struct *start)
-{
- struct task_struct *pos = NULL;
- rcu_read_lock();
- if (pid_alive(start)) {
- pos = next_thread(start);
- if (thread_group_leader(pos))
- pos = NULL;
- else
- get_task_struct(pos);
- }
- rcu_read_unlock();
- put_task_struct(start);
- return pos;
-}
+#define TID_OFFSET 2

static int proc_task_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
- struct task_struct *task, int tid)
+ struct tid_iter iter)
{
char name[PROC_NUMBUF];
- int len = snprintf(name, sizeof(name), "%d", tid);
+ int len = snprintf(name, sizeof(name), "%d", iter.tid);
return proc_fill_cache(filp, dirent, filldir, name, len,
- proc_task_instantiate, task, NULL);
+ proc_task_instantiate, iter.task, NULL);
}

/* for the /proc/TGID/task/ directories */
@@ -3061,24 +3033,22 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
{
struct dentry *dentry = filp->f_path.dentry;
struct inode *inode = dentry->d_inode;
- struct task_struct *leader = NULL;
+ struct pid *tgid = NULL;
struct task_struct *task;
int retval = -ENOENT;
ino_t ino;
- int tid;
+ struct tid_iter iter;
struct pid_namespace *ns;

task = get_proc_task(inode);
if (!task)
goto out_no_task;
rcu_read_lock();
- if (pid_alive(task)) {
- leader = task->group_leader;
- get_task_struct(leader);
- }
+ if (pid_alive(task))
+ tgid = get_pid(task_tgid(task));
rcu_read_unlock();
put_task_struct(task);
- if (!leader)
+ if (!tgid)
goto out_no_task;
retval = 0;

@@ -3097,26 +3067,21 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
/* fall through */
}

- /* f_version caches the tgid value that the last readdir call couldn't
- * return. lseek aka telldir automagically resets f_version to 0.
- */
ns = filp->f_dentry->d_sb->s_fs_info;
- tid = (int)filp->f_version;
- filp->f_version = 0;
- for (task = first_tid(leader, tid, filp->f_pos - 2, ns);
- task;
- task = next_tid(task), filp->f_pos++) {
- tid = task_pid_nr_ns(task, ns);
- if (proc_task_fill_cache(filp, dirent, filldir, task, tid) < 0) {
- /* returning this tgid failed, save it as the first
- * pid for the next readir call */
- filp->f_version = (u64)tid;
+ iter.task = NULL;
+ iter.tid = filp->f_pos - TID_OFFSET;
+ for (iter = next_tid(ns, tgid, iter);
+ iter.task;
+ iter.tid += 1, iter = next_tid(ns, tgid, iter)) {
+ filp->f_pos = iter.tid + TID_OFFSET;
+ if (proc_task_fill_cache(filp, dirent, filldir, iter) < 0) {
put_task_struct(task);
- break;
+ goto out;
}
}
+ filp->f_pos = PID_MAX_LIMIT + TID_OFFSET;
out:
- put_task_struct(leader);
+ put_pid(tgid);
out_no_task:
return retval;
}
--
1.6.1.2.350.g88cc

2009-03-18 14:03:20

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH] proc: Fix proc_tid_readdir so it handles the weird cases.

On 18/03/09 6:39 -0700, Eric W. Biederman wrote:
>
> Fix the logic in proc_task_readdir so that we provide the
> guarantee that in the sequence:
> opendir
> readdir
> readdir
> ....
> readdir
> closedir
> If a thread exists from the beginning until the end we are
> guaranteed to see it, even if some threads are created or
> worse are destroyed during the readdir.
>
> This guarantee is provided by reusing the same logic used
> by proc_pid_readdir for the root directory of /proc. The
> pid is encoded into the offset, and we walk the pid bitmap
> of all pids in the system, and test each of them to see if
> it belongs to the specified thread group.
>
> If we seek to a bad location or if the task we say last
> exits it is ok because we can numerically find the next
> task we would have returned.
>
> This is slower that the previous version, but it should
> not be too slow as this techinique is already use for
> the root directory of /proc without problems.

This makes 'ps -T x' an O(n^2) thing, compared to O(n) before the patch, right?
It would be good to, at least, check for thread_group_empty().

Thanks,

Louis

>
> Signed-off-by: Eric Biederman <[email protected]>
> ---
> fs/proc/base.c | 141 +++++++++++++++++++++-----------------------------------
> 1 files changed, 53 insertions(+), 88 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index beaa0ce..2d8e9c9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2754,13 +2754,13 @@ retry:
> if (pid) {
> iter.tgid = pid_nr_ns(pid, ns);
> iter.task = pid_task(pid, PIDTYPE_PID);
> - /* What we to know is if the pid we have find is the
> - * pid of a thread_group_leader. Testing for task
> - * being a thread_group_leader is the obvious thing
> - * todo but there is a window when it fails, due to
> - * the pid transfer logic in de_thread.
> + /* What we want to know is if the pid we have found is
> + * the pid of a thread_group_leader. Testing for task
> + * being a thread_group_leader is the obvious thing to
> + * do but there is a window when it fails, due to the
> + * pid transfer logic in de_thread.
> *
> - * So we perform the straight forward test of seeing
> + * So we perform the straight forward test, of seeing
> * if the pid we have found is the pid of a thread
> * group leader, and don't worry if the task we have
> * found doesn't happen to be a thread group leader.
> @@ -2978,82 +2978,54 @@ out_no_task:
> return result;
> }
>
> +
> /*
> - * Find the first tid of a thread group to return to user space.
> - *
> - * Usually this is just the thread group leader, but if the users
> - * buffer was too small or there was a seek into the middle of the
> - * directory we have more work todo.
> - *
> - * In the case of a short read we start with find_task_by_pid.
> + * Find the next task in the thread group with tid >= tid.
> + * Return iter.task == NULL if ther is an error or no next task.
> *
> - * In the case of a seek we start with the leader and walk nr
> - * threads past it.
> + * The reference to the input task_struct is released.
> */
> -static struct task_struct *first_tid(struct task_struct *leader,
> - int tid, int nr, struct pid_namespace *ns)
> +struct tid_iter {
> + struct task_struct *task;
> + unsigned int tid;
> +};
> +static struct tid_iter next_tid(struct pid_namespace *ns, struct pid *tgid,
> + struct tid_iter iter)
> {
> - struct task_struct *pos;
> + struct pid *pid;
>
> + if (iter.task)
> + put_task_struct(iter.task);
> rcu_read_lock();
> - /* Attempt to start with the pid of a thread */
> - if (tid && (nr > 0)) {
> - pos = find_task_by_pid_ns(tid, ns);
> - if (pos && (pos->group_leader == leader))
> - goto found;
> - }
> -
> - /* If nr exceeds the number of threads there is nothing todo */
> - pos = NULL;
> - if (nr && nr >= get_nr_threads(leader))
> - goto out;
> -
> - /* If we haven't found our starting place yet start
> - * with the leader and walk nr threads forward.
> - */
> - for (pos = leader; nr > 0; --nr) {
> - pos = next_thread(pos);
> - if (pos == leader) {
> - pos = NULL;
> - goto out;
> +retry:
> + iter.task = NULL;
> + pid = find_ge_pid(iter.tid, ns);
> + if (pid) {
> + iter.tid = pid_nr_ns(pid, ns);
> + iter.task = pid_task(pid, PIDTYPE_PID);
> + /* As the tgid does not change during de_thread
> + * this test if a task is in a thread group
> + * should always be accurate.
> + */
> + if (!iter.task || task_tgid(iter.task) != tgid) {
> + iter.tid += 1;
> + goto retry;
> }
> + get_task_struct(iter.task);
> }
> -found:
> - get_task_struct(pos);
> -out:
> rcu_read_unlock();
> - return pos;
> + return iter;
> }
>
> -/*
> - * Find the next thread in the thread list.
> - * Return NULL if there is an error or no next thread.
> - *
> - * The reference to the input task_struct is released.
> - */
> -static struct task_struct *next_tid(struct task_struct *start)
> -{
> - struct task_struct *pos = NULL;
> - rcu_read_lock();
> - if (pid_alive(start)) {
> - pos = next_thread(start);
> - if (thread_group_leader(pos))
> - pos = NULL;
> - else
> - get_task_struct(pos);
> - }
> - rcu_read_unlock();
> - put_task_struct(start);
> - return pos;
> -}
> +#define TID_OFFSET 2
>
> static int proc_task_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
> - struct task_struct *task, int tid)
> + struct tid_iter iter)
> {
> char name[PROC_NUMBUF];
> - int len = snprintf(name, sizeof(name), "%d", tid);
> + int len = snprintf(name, sizeof(name), "%d", iter.tid);
> return proc_fill_cache(filp, dirent, filldir, name, len,
> - proc_task_instantiate, task, NULL);
> + proc_task_instantiate, iter.task, NULL);
> }
>
> /* for the /proc/TGID/task/ directories */
> @@ -3061,24 +3033,22 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
> {
> struct dentry *dentry = filp->f_path.dentry;
> struct inode *inode = dentry->d_inode;
> - struct task_struct *leader = NULL;
> + struct pid *tgid = NULL;
> struct task_struct *task;
> int retval = -ENOENT;
> ino_t ino;
> - int tid;
> + struct tid_iter iter;
> struct pid_namespace *ns;
>
> task = get_proc_task(inode);
> if (!task)
> goto out_no_task;
> rcu_read_lock();
> - if (pid_alive(task)) {
> - leader = task->group_leader;
> - get_task_struct(leader);
> - }
> + if (pid_alive(task))
> + tgid = get_pid(task_tgid(task));
> rcu_read_unlock();
> put_task_struct(task);
> - if (!leader)
> + if (!tgid)
> goto out_no_task;
> retval = 0;
>
> @@ -3097,26 +3067,21 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
> /* fall through */
> }
>
> - /* f_version caches the tgid value that the last readdir call couldn't
> - * return. lseek aka telldir automagically resets f_version to 0.
> - */
> ns = filp->f_dentry->d_sb->s_fs_info;
> - tid = (int)filp->f_version;
> - filp->f_version = 0;
> - for (task = first_tid(leader, tid, filp->f_pos - 2, ns);
> - task;
> - task = next_tid(task), filp->f_pos++) {
> - tid = task_pid_nr_ns(task, ns);
> - if (proc_task_fill_cache(filp, dirent, filldir, task, tid) < 0) {
> - /* returning this tgid failed, save it as the first
> - * pid for the next readir call */
> - filp->f_version = (u64)tid;
> + iter.task = NULL;
> + iter.tid = filp->f_pos - TID_OFFSET;
> + for (iter = next_tid(ns, tgid, iter);
> + iter.task;
> + iter.tid += 1, iter = next_tid(ns, tgid, iter)) {
> + filp->f_pos = iter.tid + TID_OFFSET;
> + if (proc_task_fill_cache(filp, dirent, filldir, iter) < 0) {
> put_task_struct(task);
> - break;
> + goto out;
> }
> }
> + filp->f_pos = PID_MAX_LIMIT + TID_OFFSET;
> out:
> - put_task_struct(leader);
> + put_pid(tgid);
> out_no_task:
> return retval;
> }
> --
> 1.6.1.2.350.g88cc
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (8.11 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2009-03-19 03:40:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: Fix proc_tid_readdir so it handles the weird cases.

Louis Rilling <[email protected]> writes:

> On 18/03/09 6:39 -0700, Eric W. Biederman wrote:
>>
>> Fix the logic in proc_task_readdir so that we provide the
>> guarantee that in the sequence:
>> opendir
>> readdir
>> readdir
>> ....
>> readdir
>> closedir
>> If a thread exists from the beginning until the end we are
>> guaranteed to see it, even if some threads are created or
>> worse are destroyed during the readdir.
>>
>> This guarantee is provided by reusing the same logic used
>> by proc_pid_readdir for the root directory of /proc. The
>> pid is encoded into the offset, and we walk the pid bitmap
>> of all pids in the system, and test each of them to see if
>> it belongs to the specified thread group.
>>
>> If we seek to a bad location or if the task we say last
>> exits it is ok because we can numerically find the next
>> task we would have returned.
>>
>> This is slower that the previous version, but it should
>> not be too slow as this techinique is already use for
>> the root directory of /proc without problems.
>
> This makes 'ps -T x' an O(n^2) thing, compared to O(n) before the patch, right?
> It would be good to, at least, check for thread_group_empty().

Sort of. It is more like it adds a constant to the traversing each thread
directory. With everything in cache I don't expect it to be a big constant,
as I have reports that when I changed the pid directory it sped things up.

We definitely don't hold any locks so only the process doing the traversal will pay
the price.

Since this is actually a correctness issue I'm willing to pay a little more to get
the right result.

Eric