2007-11-17 18:16:19

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] task_pid_nr_ns() breaks proc_pid_readdir()

proc_pid_readdir:

for (...; ...; task = next_tgid(tgid + 1, ns)) {
tgid = task_pid_nr_ns(task, ns);
... use tgid ...

The first problem is that task_pid_nr_ns() can race with RCU and read the
freed memory.

However, rcu_read_lock() can't help. next_tgid() returns a pinned task_struct,
but the task can be released (and it's pid detached) before task_pid_nr_ns()
reads the pid_t value. In that case task_pid_nr_ns() returns 0 thus breaking
the whole logic.

Make sure that task_pid_nr_ns() returns !0 before updating tgid. Note that
next_tgid(tgid + 1) can find the same "struct pid" again, but we shouldn't
go into the endless loop because pid_task(PIDTYPE_PID) must return NULL in
this case, so next_tgid() can't return the same task.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 24/fs/proc/base.c~pprd 2007-10-25 16:22:11.000000000 +0400
+++ 24/fs/proc/base.c 2007-11-17 20:58:14.000000000 +0300
@@ -2481,7 +2481,15 @@ int proc_pid_readdir(struct file * filp,
for (task = next_tgid(tgid, ns);
task;
put_task_struct(task), task = next_tgid(tgid + 1, ns)) {
- tgid = task_pid_nr_ns(task, ns);
+ int nr;
+
+ rcu_read_lock();
+ nr = task_pid_nr_ns(task, ns);
+ rcu_read_unlock();
+ if (!nr)
+ continue;
+
+ tgid = nr;
filp->f_pos = tgid + TGID_OFFSET;
if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
put_task_struct(task);


2007-11-17 20:36:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] task_pid_nr_ns() breaks proc_pid_readdir()

Oleg Nesterov <[email protected]> writes:

> proc_pid_readdir:
>
> for (...; ...; task = next_tgid(tgid + 1, ns)) {
> tgid = task_pid_nr_ns(task, ns);
> ... use tgid ...
>
> The first problem is that task_pid_nr_ns() can race with RCU and read the
> freed memory.
>
> However, rcu_read_lock() can't help. next_tgid() returns a pinned task_struct,
> but the task can be released (and it's pid detached) before task_pid_nr_ns()
> reads the pid_t value. In that case task_pid_nr_ns() returns 0 thus breaking
> the whole logic.
>
> Make sure that task_pid_nr_ns() returns !0 before updating tgid. Note that
> next_tgid(tgid + 1) can find the same "struct pid" again, but we shouldn't
> go into the endless loop because pid_task(PIDTYPE_PID) must return NULL in
> this case, so next_tgid() can't return the same task.
>
> Signed-off-by: Oleg Nesterov <[email protected]>

Oleg I think I would rather update next_tgid to return the tgid (which
removes the need to call task_pid_nr_ns). This keeps all of the task
iteration logic together in next_tgid.

Although looking at this in more detail, I'm half wondering if
proc_pid_make_inode() should take a struct pid instead of a task.
At first glance that looks like it might be a little simple and more
race free. Although it doesn't do any favors to:
> inode->i_gid = 0;
> if (task_dumpable(task)) {
> inode->i_uid = task->euid;
> inode->i_gid = task->egid;
> }
> security_task_to_inode(task, inode);

Anyway short of rewriting the world this is what updating next_tgid
looks like. Opinions?

Eric


diff --git a/fs/proc/base.c b/fs/proc/base.c
index a17c268..5d9328d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2411,7 +2411,7 @@ out:
* Find the first task with tgid >= tgid
*
*/
-static struct task_struct *next_tgid(unsigned int tgid,
+static struct task_struct *next_tgid(unsigned int *tgid,
struct pid_namespace *ns)
{
struct task_struct *task;
@@ -2420,9 +2420,9 @@ static struct task_struct *next_tgid(unsigned int tgid,
rcu_read_lock();
retry:
task = NULL;
- pid = find_ge_pid(tgid, ns);
+ pid = find_ge_pid(*tgid, ns);
if (pid) {
- tgid = pid_nr_ns(pid, ns) + 1;
+ *tgid = pid_nr_ns(pid, ns);
task = pid_task(pid, PIDTYPE_PID);
/* What we to know is if the pid we have find is the
* pid of a thread_group_leader. Testing for task
@@ -2436,8 +2436,10 @@ retry:
* found doesn't happen to be a thread group leader.
* As we don't care in the case of readdir.
*/
- if (!task || !has_group_leader_pid(task))
+ if (!task || !has_group_leader_pid(task)) {
+ *tgid += 1;
goto retry;
+ }
get_task_struct(task);
}
rcu_read_unlock();
@@ -2475,10 +2477,9 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)

ns = filp->f_dentry->d_sb->s_fs_info;
tgid = filp->f_pos - TGID_OFFSET;
- for (task = next_tgid(tgid, ns);
+ for (task = next_tgid(&tgid, ns);
task;
- put_task_struct(task), task = next_tgid(tgid + 1, ns)) {
- tgid = task_pid_nr_ns(task, ns);
+ put_task_struct(task), tgid += 1, task = next_tgid(&tgid, ns)) {
filp->f_pos = tgid + TGID_OFFSET;
if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
put_task_struct(task);

2007-11-18 14:20:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_pid_nr_ns() breaks proc_pid_readdir()

On 11/17, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > Make sure that task_pid_nr_ns() returns !0 before updating tgid. Note that
> > next_tgid(tgid + 1) can find the same "struct pid" again, but we shouldn't
> > go into the endless loop because pid_task(PIDTYPE_PID) must return NULL in
> > this case, so next_tgid() can't return the same task.
> >
> Oleg I think I would rather update next_tgid to return the tgid (which
> removes the need to call task_pid_nr_ns). This keeps all of the task
> iteration logic together in next_tgid.

Yes sure, I think your patch is also correct, please use it.

<offtopic>

Personally, I hate the functions which use pointers to return another value.
(yes, yes, I know, my taste is perverted). Why don't we return structure in
this case? We can even make a common helper struct, say,

struct pair {
union {
long val1;
void *ptr1;
};
union {
long val2;
void *ptr2;
};
};

#define PAIR(x1, x2) (struct pair){{ . x1 }, { . x2 }}

Now, next_tgid() can do

return PAIR(ptr1 = task, val2 = tgid);

With -freg-struct-return the generated code is nice.

Of course, another option is to rewrite the kernle in perl, in that case
proc_pid_readdir() can just do

(task, tgid) = next_tgid();

</offtopic>

> Although looking at this in more detail, I'm half wondering if
> proc_pid_make_inode() should take a struct pid instead of a task.

Yes, I also thought about this. Needs more changes, and still not perfect.

I am starting to think we need a more generic change. How about the patch
below? With this change the stable task_struct implies we have the stable
pids, this allows us to do a lot of cleanups.

Oleg.

--- kernel/pid.c 2007-10-25 16:22:12.000000000 +0400
+++ - 2007-11-18 16:56:30.682555454 +0300
@@ -323,7 +323,7 @@ int fastcall attach_pid(struct task_stru
struct pid_link *link;

link = &task->pids[type];
- link->pid = pid;
+ link->pid = get_pid(pid);
hlist_add_head_rcu(&link->node, &pid->tasks[type]);

return 0;
@@ -339,7 +339,6 @@ void fastcall detach_pid(struct task_str
pid = link->pid;

hlist_del_rcu(&link->node);
- link->pid = NULL;

for (tmp = PIDTYPE_MAX; --tmp >= 0; )
if (!hlist_empty(&pid->tasks[tmp]))
@@ -348,6 +347,14 @@ void fastcall detach_pid(struct task_str
free_pid(pid);
}

+void task_put_pids(struct pid_link *pids)
+{
+ int type = PIDTYPE_MAX;
+
+ while (type--)
+ put_pid(pids[type].pid);
+}
+
/* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
void fastcall transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type type)
--- kernel/fork.c 2007-11-09 12:57:31.000000000 +0300
+++ - 2007-11-18 16:57:34.037105563 +0300
@@ -121,6 +121,7 @@ void __put_task_struct(struct task_struc
WARN_ON(atomic_read(&tsk->usage));
WARN_ON(tsk == current);

+ task_put_pids(tsk->pids);
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);

2007-11-19 18:16:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] task_pid_nr_ns() breaks proc_pid_readdir()

Oleg Nesterov <[email protected]> writes:

> On 11/17, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > Make sure that task_pid_nr_ns() returns !0 before updating tgid. Note that
>> > next_tgid(tgid + 1) can find the same "struct pid" again, but we shouldn't
>> > go into the endless loop because pid_task(PIDTYPE_PID) must return NULL in
>> > this case, so next_tgid() can't return the same task.
>> >
>> Oleg I think I would rather update next_tgid to return the tgid (which
>> removes the need to call task_pid_nr_ns). This keeps all of the task
>> iteration logic together in next_tgid.
>
> Yes sure, I think your patch is also correct, please use it.
>
> <offtopic>
>
> Personally, I hate the functions which use pointers to return another value.
> (yes, yes, I know, my taste is perverted). Why don't we return structure in
> this case? We can even make a common helper struct, say,
>
> Of course, another option is to rewrite the kernle in perl, in that case
> proc_pid_readdir() can just do
>
> (task, tgid) = next_tgid();

I wish that last syntax worked in C. That would make returning
structures so much easier. From a compiler optimization standpoint
returning structures is ever so much nicer. Seeing through pointers
or references to optimize things is tricky.

> </offtopic>
>
>> Although looking at this in more detail, I'm half wondering if
>> proc_pid_make_inode() should take a struct pid instead of a task.
>
> Yes, I also thought about this. Needs more changes, and still not perfect.
>
> I am starting to think we need a more generic change. How about the patch
> below? With this change the stable task_struct implies we have the stable
> pids, this allows us to do a lot of cleanups.

I don't see the huge pile of opportunities to clean up. But otherwise
I am in favor of it. In the normal case it should only delay freeing
of struct pid (and the corresponding namespace) by an rcu interval so it
isn't a big deal.

I suspect it will be a help with killing things like the global pids
in task_struct.

Regardless I'm going to pass on the keeping struct pid on the task
struct until we free it and reference counting it there for the
immediate present as I think we can solve the immediate issues cleanly
without it, and we are pretty much in bug fixing territory now.

Eric

2007-11-19 18:32:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_pid_nr_ns() breaks proc_pid_readdir()

On 11/19, Eric W. Biederman wrote:
>
> I think we can solve the immediate issues cleanly
> without it, and we are pretty much in bug fixing territory now.

Yes sure. Besides, the "patch" I showed for illustration is not complete,
and it is not easy to solve the problems with exec().

Could you re-send your patch with changelog to Andrew? I agree, it is better
to pass a pointer than to add the horrible hack to proc_pid_readdir().

Oleg.

2007-11-19 18:51:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] task_pid_nr_ns() breaks proc_pid_readdir()

Oleg Nesterov <[email protected]> writes:

> On 11/19, Eric W. Biederman wrote:
>>
>> I think we can solve the immediate issues cleanly
>> without it, and we are pretty much in bug fixing territory now.
>
> Yes sure. Besides, the "patch" I showed for illustration is not complete,
> and it is not easy to solve the problems with exec().
>
> Could you re-send your patch with changelog to Andrew? I agree, it is better
> to pass a pointer than to add the horrible hack to proc_pid_readdir().

Definitely. I'm just testing now to make certain my code actually works.
I have respun my patch to return a structure as I figured out how to
do that cleanly.

Eric

2007-11-19 21:46:16

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: Remove races from proc_id_readdir()


Oleg noticed that the call of task_pid_nr_ns() in proc_pid_readdir
is racy with respect to tasks exiting.

After a bit of examination it also appears that the call itself
is completely unnecessary.

So to fix the problem this patch modifies next_tgid() to return
both a tgid and the task struct in question.

A structure is introduced to return these values because it is
slightly cleaner and easier to optimize, and the resulting code
is a little shorter.

Signed-off-by: "Eric W. Biederman" <[email protected]>

---
fs/proc/base.c | 51 ++++++++++++++++++++++++++++-----------------------
1 files changed, 28 insertions(+), 23 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5b92b79..7e8eca4 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2454,19 +2454,23 @@ out:
* Find the first task with tgid >= tgid
*
*/
-static struct task_struct *next_tgid(unsigned int tgid,
- struct pid_namespace *ns)
-{
+struct tgid_iter {
+ unsigned int tgid;
struct task_struct *task;
+};
+static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
+{
struct pid *pid;

+ if (iter.task)
+ put_task_struct(iter.task);
rcu_read_lock();
retry:
- task = NULL;
- pid = find_ge_pid(tgid, ns);
+ iter.task = NULL;
+ pid = find_ge_pid(iter.tgid, ns);
if (pid) {
- tgid = pid_nr_ns(pid, ns) + 1;
- task = pid_task(pid, PIDTYPE_PID);
+ iter.tgid = pid_nr_ns(pid, ns);
+ iter.task = pid_task(pid, PIDTYPE_PID);
/* What we to know is if the pid we have find is the
* pid of a thread_group_leader. Testing for task
* being a thread_group_leader is the obvious thing
@@ -2479,23 +2483,25 @@ retry:
* found doesn't happen to be a thread group leader.
* As we don't care in the case of readdir.
*/
- if (!task || !has_group_leader_pid(task))
+ if (!iter.task || !has_group_leader_pid(iter.task)) {
+ iter.tgid += 1;
goto retry;
- get_task_struct(task);
+ }
+ get_task_struct(iter.task);
}
rcu_read_unlock();
- return task;
+ return iter;
}

#define TGID_OFFSET (FIRST_PROCESS_ENTRY + ARRAY_SIZE(proc_base_stuff))

static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
- struct task_struct *task, int tgid)
+ struct tgid_iter iter)
{
char name[PROC_NUMBUF];
- int len = snprintf(name, sizeof(name), "%d", tgid);
+ int len = snprintf(name, sizeof(name), "%d", iter.tgid);
return proc_fill_cache(filp, dirent, filldir, name, len,
- proc_pid_instantiate, task, NULL);
+ proc_pid_instantiate, iter.task, NULL);
}

/* for the /proc/ directory itself, after non-process stuff has been done */
@@ -2503,8 +2509,7 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
unsigned int nr = filp->f_pos - FIRST_PROCESS_ENTRY;
struct task_struct *reaper = get_proc_task(filp->f_path.dentry->d_inode);
- struct task_struct *task;
- int tgid;
+ struct tgid_iter iter;
struct pid_namespace *ns;

if (!reaper)
@@ -2517,14 +2522,14 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
}

ns = filp->f_dentry->d_sb->s_fs_info;
- tgid = filp->f_pos - TGID_OFFSET;
- for (task = next_tgid(tgid, ns);
- task;
- put_task_struct(task), task = next_tgid(tgid + 1, ns)) {
- tgid = task_pid_nr_ns(task, ns);
- filp->f_pos = tgid + TGID_OFFSET;
- if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
- put_task_struct(task);
+ iter.task = NULL;
+ iter.tgid = filp->f_pos - TGID_OFFSET;
+ for (iter = next_tgid(ns, iter);
+ iter.task;
+ iter.tgid += 1, iter = next_tgid(ns, iter)) {
+ filp->f_pos = iter.tgid + TGID_OFFSET;
+ if (proc_pid_fill_cache(filp, dirent, filldir, iter) < 0) {
+ put_task_struct(iter.task);
goto out;
}
}
--
1.5.3.rc6.17.g1911

2007-11-20 16:34:57

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] proc-remove-races-from-proc_id_readdir-factor-out-tgid-increment

On 11/19, Eric W. Biederman wrote:
>
> So to fix the problem this patch modifies next_tgid() to return
> both a tgid and the task struct in question.
>
> A structure is introduced to return these values because it is
> slightly cleaner and easier to optimize, and the resulting code
> is a little shorter.

Nice. Perhaps we can also factor out the incrementing of iter.tgid? Lessens
the generated code, and (imho) cleanups the source a little bit.

(untested, 2.6.24-rc2-mm1 doesn't boot on my machine)

Signed-off-by: Oleg Nesterov <[email protected]>

--- PT/fs/proc/base.c~ 2007-11-20 19:00:05.000000000 +0300
+++ PT/fs/proc/base.c 2007-11-20 19:06:28.000000000 +0300
@@ -2378,6 +2378,7 @@ static struct tgid_iter next_tgid(struct
rcu_read_lock();
retry:
iter.task = NULL;
+ iter.tgid++;
pid = find_ge_pid(iter.tgid, ns);
if (pid) {
iter.tgid = pid_nr_ns(pid, ns);
@@ -2394,10 +2395,8 @@ retry:
* found doesn't happen to be a thread group leader.
* As we don't care in the case of readdir.
*/
- if (!iter.task || !has_group_leader_pid(iter.task)) {
- iter.tgid += 1;
+ if (!iter.task || !has_group_leader_pid(iter.task))
goto retry;
- }
get_task_struct(iter.task);
}
rcu_read_unlock();
@@ -2434,10 +2433,8 @@ int proc_pid_readdir(struct file * filp,

ns = filp->f_dentry->d_sb->s_fs_info;
iter.task = NULL;
- iter.tgid = filp->f_pos - TGID_OFFSET;
- for (iter = next_tgid(ns, iter);
- iter.task;
- iter.tgid += 1, iter = next_tgid(ns, iter)) {
+ iter.tgid = filp->f_pos - TGID_OFFSET - 1;
+ while ((iter = next_tgid(ns, iter)).task) {
filp->f_pos = iter.tgid + TGID_OFFSET;
if (proc_pid_fill_cache(filp, dirent, filldir, iter) < 0) {
put_task_struct(iter.task);

2007-11-20 20:46:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc-remove-races-from-proc_id_readdir-factor-out-tgid-increment

Oleg Nesterov <[email protected]> writes:

> On 11/19, Eric W. Biederman wrote:
>>
>> So to fix the problem this patch modifies next_tgid() to return
>> both a tgid and the task struct in question.
>>
>> A structure is introduced to return these values because it is
>> slightly cleaner and easier to optimize, and the resulting code
>> is a little shorter.
>
> Nice. Perhaps we can also factor out the incrementing of iter.tgid? Lessens
> the generated code, and (imho) cleanups the source a little bit.

Well the name doesn't imply what the exact sematic is and we
only have one caller, so we might as well. Looking at the
code I don't see any problems.

Acked-by: "Eric W. Biederman" <[email protected]>

> (untested, 2.6.24-rc2-mm1 doesn't boot on my machine)
>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- PT/fs/proc/base.c~ 2007-11-20 19:00:05.000000000 +0300
> +++ PT/fs/proc/base.c 2007-11-20 19:06:28.000000000 +0300
> @@ -2378,6 +2378,7 @@ static struct tgid_iter next_tgid(struct
> rcu_read_lock();
> retry:
> iter.task = NULL;
> + iter.tgid++;
> pid = find_ge_pid(iter.tgid, ns);
> if (pid) {
> iter.tgid = pid_nr_ns(pid, ns);
> @@ -2394,10 +2395,8 @@ retry:
> * found doesn't happen to be a thread group leader.
> * As we don't care in the case of readdir.
> */
> - if (!iter.task || !has_group_leader_pid(iter.task)) {
> - iter.tgid += 1;
> + if (!iter.task || !has_group_leader_pid(iter.task))
> goto retry;
> - }
> get_task_struct(iter.task);
> }
> rcu_read_unlock();
> @@ -2434,10 +2433,8 @@ int proc_pid_readdir(struct file * filp,
>
> ns = filp->f_dentry->d_sb->s_fs_info;
> iter.task = NULL;
> - iter.tgid = filp->f_pos - TGID_OFFSET;
> - for (iter = next_tgid(ns, iter);
> - iter.task;
> - iter.tgid += 1, iter = next_tgid(ns, iter)) {
> + iter.tgid = filp->f_pos - TGID_OFFSET - 1;
> + while ((iter = next_tgid(ns, iter)).task) {
> filp->f_pos = iter.tgid + TGID_OFFSET;
> if (proc_pid_fill_cache(filp, dirent, filldir, iter) < 0) {
> put_task_struct(iter.task);