2008-12-12 09:55:33

by Li Zefan

[permalink] [raw]
Subject: [PATCH] sched: fix another race when reading /proc/sched_debug

I fixed an oops with the following commit:

| commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
| Author: Li Zefan <[email protected]>
| Date: Thu Nov 6 12:53:32 2008 -0800
|
| cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
|
| This fixes an oops when reading /proc/sched_debug.

The above commit fixed a race that reading /proc/sched_debug may access
NULL cgrp->dentry if a cgroup is being removed.

But I found there's another different race, in that reading sched_debug
may access a cgroup which hasn't been completed created, and thus
dereference NULL cgrp->dentry!

cgroup_create()
cpu_cgroup_create()
register_fair_sched_group()
list_add_rcu(...)
print_cfs_stats()
for_each_leaf_cfs_rq()
print_cfs_rq()
cgroup_path()
cgroup->dentry = dentry;

task_group is added to the global list before the cgroup has been created
completely, if at this time print_cfs_stats() is called, it will access
the half-created cgroup.

This patch fixes the bug by holding cgroup_lock() to wait the cgroup to
be created completely before calling cgroup_path().

Signed-off-by: Li Zefan <[email protected]>
---

The patch is based on linus's git tree, and should go into 2.6.28,
but it conflicts with the cleanup patch in sched/core:
0a0db8f5c9d4bbb9bbfcc2b6cb6bce2d0ef4d73d

---
kernel/sched_debug.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 26ed8e3..01abf5b 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -127,8 +127,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
if (tg)
cgroup = tg->css.cgroup;

- if (cgroup)
+ if (cgroup) {
+ cgroup_lock();
cgroup_path(cgroup, path, sizeof(path));
+ cgroup_unlock();
+ }

SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
#else
@@ -181,8 +184,11 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
if (tg)
cgroup = tg->css.cgroup;

- if (cgroup)
+ if (cgroup) {
+ cgroup_lock();
cgroup_path(cgroup, path, sizeof(path));
+ cgroup_unlock();
+ }

SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
#else
--
1.5.4.rc3


2008-12-12 10:01:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug


* Li Zefan <[email protected]> wrote:

> I fixed an oops with the following commit:
>
> | commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
> | Author: Li Zefan <[email protected]>
> | Date: Thu Nov 6 12:53:32 2008 -0800
> |
> | cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
> |
> | This fixes an oops when reading /proc/sched_debug.
>
> The above commit fixed a race that reading /proc/sched_debug may access
> NULL cgrp->dentry if a cgroup is being removed.
>
> But I found there's another different race, in that reading sched_debug
> may access a cgroup which hasn't been completed created, and thus
> dereference NULL cgrp->dentry!
>
> cgroup_create()
> cpu_cgroup_create()
> register_fair_sched_group()
> list_add_rcu(...)
> print_cfs_stats()
> for_each_leaf_cfs_rq()
> print_cfs_rq()
> cgroup_path()
> cgroup->dentry = dentry;
>
> task_group is added to the global list before the cgroup has been created
> completely, if at this time print_cfs_stats() is called, it will access
> the half-created cgroup.
>
> This patch fixes the bug by holding cgroup_lock() to wait the cgroup to
> be created completely before calling cgroup_path().
>
> Signed-off-by: Li Zefan <[email protected]>

applied to tip/sched/urgent, thanks!

> The patch is based on linus's git tree, and should go into 2.6.28,
> but it conflicts with the cleanup patch in sched/core:
> 0a0db8f5c9d4bbb9bbfcc2b6cb6bce2d0ef4d73d

i merged it up in tip/master, could you please check whether it's ok?

Ingo

2008-12-12 11:38:18

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Fri, Dec 12, 2008 at 3:23 PM, Li Zefan <[email protected]> wrote:
> kernel/sched_debug.c | 10 ++++++++--
> 1 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index 26ed8e3..01abf5b 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -127,8 +127,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> if (tg)
> cgroup = tg->css.cgroup;
>
> - if (cgroup)
> + if (cgroup) {
> + cgroup_lock();
> cgroup_path(cgroup, path, sizeof(path));
> + cgroup_unlock();
> + }
>
> SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
> #else
> @@ -181,8 +184,11 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
> if (tg)
> cgroup = tg->css.cgroup;
>
> - if (cgroup)
> + if (cgroup) {
> + cgroup_lock();
> cgroup_path(cgroup, path, sizeof(path));
> + cgroup_unlock();
> + }
>
> SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
> #else

The comment in cgroup_path() routine says that it needs to be called
with cgroup_mutex held. With the above fix, print_task() in
sched_debug.c remains the last caller of cgroup_path() which calls it
without holding cgroup_mutex. Does this also need a fix ?

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2008-12-13 08:23:57

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

Bharata B Rao wrote:
> On Fri, Dec 12, 2008 at 3:23 PM, Li Zefan <[email protected]> wrote:
>> kernel/sched_debug.c | 10 ++++++++--
>> 1 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
>> index 26ed8e3..01abf5b 100644
>> --- a/kernel/sched_debug.c
>> +++ b/kernel/sched_debug.c
>> @@ -127,8 +127,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>> if (tg)
>> cgroup = tg->css.cgroup;
>>
>> - if (cgroup)
>> + if (cgroup) {
>> + cgroup_lock();
>> cgroup_path(cgroup, path, sizeof(path));
>> + cgroup_unlock();
>> + }
>>
>> SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
>> #else
>> @@ -181,8 +184,11 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
>> if (tg)
>> cgroup = tg->css.cgroup;
>>
>> - if (cgroup)
>> + if (cgroup) {
>> + cgroup_lock();
>> cgroup_path(cgroup, path, sizeof(path));
>> + cgroup_unlock();
>> + }
>>
>> SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
>> #else
>
> The comment in cgroup_path() routine says that it needs to be called
> with cgroup_mutex held. With the above fix, print_task() in
> sched_debug.c remains the last caller of cgroup_path() which calls it
> without holding cgroup_mutex. Does this also need a fix ?
>

You mean:

print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
...
cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
...
}

Hmm...Normally we have to take task_lock() or rcu_read_lock() to retrieve
the cgroup from the task, and as long as we hold either lock, we don't need
to take cgroup_lock().

I noticed neither task_lock() nor rcu is held before calling cgroup_path,
so I wrote a test program to see if I can trigger a but here, but it didn't
happen. I'll dig more.

2008-12-14 02:55:37

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

> i merged it up in tip/master, could you please check whether it's ok?
>

Sorry, though this patch avoids accessing a half-created cgroup, but I found
current code may access a cgroup which has been destroyed.

The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.

Could you revert this patch and apply the following new one? My box has
survived for 16 hours with it applied.

==========

From: Li Zefan <[email protected]>
Date: Sun, 14 Dec 2008 09:53:28 +0800
Subject: [PATCH] sched: fix another race when reading /proc/sched_debug

I fixed an oops with the following commit:

| commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
| Author: Li Zefan <[email protected]>
| Date: Thu Nov 6 12:53:32 2008 -0800
|
| cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
|
| This fixes an oops when reading /proc/sched_debug.

The above commit fixed a race that reading /proc/sched_debug may access
NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
hasn't been destroyed (via cgroup_diput).

But I found there's another different race, in that reading sched_debug
may access a cgroup which is being created or has been destroyed, and thus
dereference NULL cgrp->dentry!

task_group is added to the global list while the cgroup is being created,
and is removed from the global list while the cgroup is under destruction.
So running through the list should be protected by cgroup_lock(), if
cgroup data will be accessed (here by calling cgroup_path).

Signed-off-by: Li Zefan <[email protected]>
---
kernel/sched_fair.c | 6 ++++++
kernel/sched_rt.c | 6 ++++++
2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 98345e4..8b2965b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1737,9 +1737,15 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
{
struct cfs_rq *cfs_rq;

+ /*
+ * Hold cgroup_lock() to avoid calling cgroup_path() with
+ * invalid cgroup.
+ */
+ cgroup_lock();
rcu_read_lock();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
rcu_read_unlock();
+ cgroup_unlock();
}
#endif
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d9ba9d5..e84480d 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1538,9 +1538,15 @@ static void print_rt_stats(struct seq_file *m, int cpu)
{
struct rt_rq *rt_rq;

+ /*
+ * Hold cgroup_lock() to avoid calling cgroup_path() with
+ * invalid cgroup.
+ */
+ cgroup_lock();
rcu_read_lock();
for_each_leaf_rt_rq(rt_rq, cpu_rq(cpu))
print_rt_rq(m, cpu, rt_rq);
rcu_read_unlock();
+ cgroup_unlock();
}
#endif /* CONFIG_SCHED_DEBUG */
--
1.5.4.rc3

2008-12-14 12:48:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Sun, 2008-12-14 at 10:54 +0800, Li Zefan wrote:
> > i merged it up in tip/master, could you please check whether it's ok?
> >
>
> Sorry, though this patch avoids accessing a half-created cgroup, but I found
> current code may access a cgroup which has been destroyed.
>
> The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.
>
> Could you revert this patch and apply the following new one? My box has
> survived for 16 hours with it applied.
>
> ==========
>
> From: Li Zefan <[email protected]>
> Date: Sun, 14 Dec 2008 09:53:28 +0800
> Subject: [PATCH] sched: fix another race when reading /proc/sched_debug
>
> I fixed an oops with the following commit:
>
> | commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
> | Author: Li Zefan <[email protected]>
> | Date: Thu Nov 6 12:53:32 2008 -0800
> |
> | cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
> |
> | This fixes an oops when reading /proc/sched_debug.
>
> The above commit fixed a race that reading /proc/sched_debug may access
> NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
> hasn't been destroyed (via cgroup_diput).
>
> But I found there's another different race, in that reading sched_debug
> may access a cgroup which is being created or has been destroyed, and thus
> dereference NULL cgrp->dentry!
>
> task_group is added to the global list while the cgroup is being created,
> and is removed from the global list while the cgroup is under destruction.
> So running through the list should be protected by cgroup_lock(), if
> cgroup data will be accessed (here by calling cgroup_path).

Can't we detect a dead task-group and skip those instead of adding this
global lock?

> Signed-off-by: Li Zefan <[email protected]>
> ---
> kernel/sched_fair.c | 6 ++++++
> kernel/sched_rt.c | 6 ++++++
> 2 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 98345e4..8b2965b 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1737,9 +1737,15 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
> {
> struct cfs_rq *cfs_rq;
>
> + /*
> + * Hold cgroup_lock() to avoid calling cgroup_path() with
> + * invalid cgroup.
> + */
> + cgroup_lock();
> rcu_read_lock();
> for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> print_cfs_rq(m, cpu, cfs_rq);
> rcu_read_unlock();
> + cgroup_unlock();
> }
> #endif
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index d9ba9d5..e84480d 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -1538,9 +1538,15 @@ static void print_rt_stats(struct seq_file *m, int cpu)
> {
> struct rt_rq *rt_rq;
>
> + /*
> + * Hold cgroup_lock() to avoid calling cgroup_path() with
> + * invalid cgroup.
> + */
> + cgroup_lock();
> rcu_read_lock();
> for_each_leaf_rt_rq(rt_rq, cpu_rq(cpu))
> print_rt_rq(m, cpu, rt_rq);
> rcu_read_unlock();
> + cgroup_unlock();
> }
> #endif /* CONFIG_SCHED_DEBUG */

2008-12-15 01:26:53

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

Peter Zijlstra wrote:
> On Sun, 2008-12-14 at 10:54 +0800, Li Zefan wrote:
>>> i merged it up in tip/master, could you please check whether it's ok?
>>>
>> Sorry, though this patch avoids accessing a half-created cgroup, but I found
>> current code may access a cgroup which has been destroyed.
>>
>> The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.
>>
>> Could you revert this patch and apply the following new one? My box has
>> survived for 16 hours with it applied.
>>
>> ==========
>>
>> From: Li Zefan <[email protected]>
>> Date: Sun, 14 Dec 2008 09:53:28 +0800
>> Subject: [PATCH] sched: fix another race when reading /proc/sched_debug
>>
>> I fixed an oops with the following commit:
>>
>> | commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
>> | Author: Li Zefan <[email protected]>
>> | Date: Thu Nov 6 12:53:32 2008 -0800
>> |
>> | cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
>> |
>> | This fixes an oops when reading /proc/sched_debug.
>>
>> The above commit fixed a race that reading /proc/sched_debug may access
>> NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
>> hasn't been destroyed (via cgroup_diput).
>>
>> But I found there's another different race, in that reading sched_debug
>> may access a cgroup which is being created or has been destroyed, and thus
>> dereference NULL cgrp->dentry!
>>
>> task_group is added to the global list while the cgroup is being created,
>> and is removed from the global list while the cgroup is under destruction.
>> So running through the list should be protected by cgroup_lock(), if
>> cgroup data will be accessed (here by calling cgroup_path).
>
> Can't we detect a dead task-group and skip those instead of adding this
> global lock?
>

I tried it, but I don't think it's feasable, without lock syncronization:

| print_cfs_rq()
| check task_group is dead
cgroup_diput() |
.. |
mark task_group as dead |
.. |
kfree(cgrp) |
| call cgroup_path()

>> Signed-off-by: Li Zefan <[email protected]>
>> ---
>> kernel/sched_fair.c | 6 ++++++
>> kernel/sched_rt.c | 6 ++++++
>> 2 files changed, 12 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 98345e4..8b2965b 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1737,9 +1737,15 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
>> {
>> struct cfs_rq *cfs_rq;
>>
>> + /*
>> + * Hold cgroup_lock() to avoid calling cgroup_path() with
>> + * invalid cgroup.
>> + */
>> + cgroup_lock();
>> rcu_read_lock();
>> for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
>> print_cfs_rq(m, cpu, cfs_rq);
>> rcu_read_unlock();
>> + cgroup_unlock();
>> }
>> #endif
>> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
>> index d9ba9d5..e84480d 100644
>> --- a/kernel/sched_rt.c
>> +++ b/kernel/sched_rt.c
>> @@ -1538,9 +1538,15 @@ static void print_rt_stats(struct seq_file *m, int cpu)
>> {
>> struct rt_rq *rt_rq;
>>
>> + /*
>> + * Hold cgroup_lock() to avoid calling cgroup_path() with
>> + * invalid cgroup.
>> + */
>> + cgroup_lock();
>> rcu_read_lock();
>> for_each_leaf_rt_rq(rt_rq, cpu_rq(cpu))
>> print_rt_rq(m, cpu, rt_rq);
>> rcu_read_unlock();
>> + cgroup_unlock();
>> }
>> #endif /* CONFIG_SCHED_DEBUG */
>
>
>

2008-12-15 01:40:50

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

Paul Menage wrote:
> I sent out some patches last week (search for hierarchy_mutex) that would
> mean that you'd only need to take a subsystem-local lock to keep a cgroup
> alive. People seemed to like them so I'll tweak them based on feedback and
> send them on to Andrew.
>

Unfortunately, AFAICS the proposed hierarchy_mutex can't solve this bug. :(

> Paul
>
> On Dec 14, 2008 4:48 AM, "Peter Zijlstra" <[email protected]> wrote:
>
> On Sun, 2008-12-14 at 10:54 +0800, Li Zefan wrote: > > i merged it up in
> tip/master, could you pleas...
> Can't we detect a dead task-group and skip those instead of adding this
> global lock?
>
>> Signed-off-by: Li Zefan <[email protected]> > --- > kernel/sched_fair.c
> | 6 ++++++ > kerne...
>

2008-12-15 01:52:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Mon, 15 Dec 2008 09:39:18 +0800
Li Zefan <[email protected]> wrote:

> Paul Menage wrote:
> > I sent out some patches last week (search for hierarchy_mutex) that would
> > mean that you'd only need to take a subsystem-local lock to keep a cgroup
> > alive. People seemed to like them so I'll tweak them based on feedback and
> > send them on to Andrew.
> >
>
> Unfortunately, AFAICS the proposed hierarchy_mutex can't solve this bug. :(
>
Hmm ? how about this way if cgroup->dentry is problem ?

at creation:
- cpu_cgroup_populate() should record "tg" that "this cgroup has valid dentry"
at deletion
- css_tryget() will be useful to avoid the race.

thx,
-Kame

> > Paul
> >
> > On Dec 14, 2008 4:48 AM, "Peter Zijlstra" <[email protected]> wrote:
> >
> > On Sun, 2008-12-14 at 10:54 +0800, Li Zefan wrote: > > i merged it up in
> > tip/master, could you pleas...
> > Can't we detect a dead task-group and skip those instead of adding this
> > global lock?
> >
> >> Signed-off-by: Li Zefan <[email protected]> > --- > kernel/sched_fair.c
> > | 6 ++++++ > kerne...
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-12-15 02:13:29

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

KAMEZAWA Hiroyuki wrote:
> On Mon, 15 Dec 2008 09:39:18 +0800
> Li Zefan <[email protected]> wrote:
>
>> Paul Menage wrote:
>>> I sent out some patches last week (search for hierarchy_mutex) that would
>>> mean that you'd only need to take a subsystem-local lock to keep a cgroup
>>> alive. People seemed to like them so I'll tweak them based on feedback and
>>> send them on to Andrew.
>>>
>> Unfortunately, AFAICS the proposed hierarchy_mutex can't solve this bug. :(
>>
> Hmm ? how about this way if cgroup->dentry is problem ?
>

Sounds feasable, though the resulting code might a bit tricky.

> at creation:
> - cpu_cgroup_populate() should record "tg" that "this cgroup has valid dentry"

This avoids /proc/sched_debug accessing invalid cgrp dentry while a cgroup is
being created.

> at deletion
> - css_tryget() will be useful to avoid the race.

And this avoids accessing a dead task_group.

Since css_tryget() is a 2.6.29 material, I think we can have this bug fixed for
2.6.28 by holding cgroup_lock, and try to remove the global lock for 2.6.29 ?

2008-12-15 08:13:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Mon, 2008-12-15 at 09:25 +0800, Li Zefan wrote:
> Peter Zijlstra wrote:
> > On Sun, 2008-12-14 at 10:54 +0800, Li Zefan wrote:
> >>> i merged it up in tip/master, could you please check whether it's ok?
> >>>
> >> Sorry, though this patch avoids accessing a half-created cgroup, but I found
> >> current code may access a cgroup which has been destroyed.
> >>
> >> The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.
> >>
> >> Could you revert this patch and apply the following new one? My box has
> >> survived for 16 hours with it applied.
> >>
> >> ==========
> >>
> >> From: Li Zefan <[email protected]>
> >> Date: Sun, 14 Dec 2008 09:53:28 +0800
> >> Subject: [PATCH] sched: fix another race when reading /proc/sched_debug
> >>
> >> I fixed an oops with the following commit:
> >>
> >> | commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
> >> | Author: Li Zefan <[email protected]>
> >> | Date: Thu Nov 6 12:53:32 2008 -0800
> >> |
> >> | cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
> >> |
> >> | This fixes an oops when reading /proc/sched_debug.
> >>
> >> The above commit fixed a race that reading /proc/sched_debug may access
> >> NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
> >> hasn't been destroyed (via cgroup_diput).
> >>
> >> But I found there's another different race, in that reading sched_debug
> >> may access a cgroup which is being created or has been destroyed, and thus
> >> dereference NULL cgrp->dentry!
> >>
> >> task_group is added to the global list while the cgroup is being created,
> >> and is removed from the global list while the cgroup is under destruction.
> >> So running through the list should be protected by cgroup_lock(), if
> >> cgroup data will be accessed (here by calling cgroup_path).
> >
> > Can't we detect a dead task-group and skip those instead of adding this
> > global lock?
> >
>
> I tried it, but I don't think it's feasable, without lock syncronization:
>
> | print_cfs_rq()
> | check task_group is dead
> cgroup_diput() |
> .. |
> mark task_group as dead |
> .. |
> kfree(cgrp) |
> | call cgroup_path()

rcu free cgrp


2008-12-15 09:52:45

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

>>> Can't we detect a dead task-group and skip those instead of adding this
>>> global lock?
>>>
>> I tried it, but I don't think it's feasable, without lock syncronization:
>>
>> | print_cfs_rq()
>> | check task_group is dead
>> cgroup_diput() |
>> .. |
>> mark task_group as dead |
>> .. |
>> kfree(cgrp) |
>> | call cgroup_path()
>
> rcu free cgrp
>

I got your point, thanks.

Another way is use css_tryget(), and thus can avoid touching cgroup.c and adding
synchronize_rcu(). css_tryget() is proposed by Kamezawa but I think won't be
available until 2.6.29.

Anyway, here is the fix. I'll post a complete version with changelog when we
agree on how to fix it.

Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 6 ++++++
kernel/sched_debug.c | 17 +++++++++++++++--
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fe00b3b..3c54d1b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -624,6 +624,12 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
* created the cgroup */
deactivate_super(cgrp->root->sb);

+ /*
+ * Some subsystems (cpu cgroup) might still be able to
+ * accessing the cgroup in rcu section.
+ */
+ synchronize_rcu();
+
kfree(cgrp);
}
iput(inode);
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 26ed8e3..174c072 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -127,8 +127,14 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
if (tg)
cgroup = tg->css.cgroup;

- if (cgroup)
+ if (cgroup) {
+ /*
+ * This task_group is dead or we race with cgroup creating.
+ */
+ if (cgroup_is_removed(cgroup) || !cgroup->dentry)
+ return;
cgroup_path(cgroup, path, sizeof(path));
+ }

SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
#else
@@ -181,8 +187,15 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
if (tg)
cgroup = tg->css.cgroup;

- if (cgroup)
+ if (cgroup) {
+ /*
+ * This task_group is dead or we race with cgroup creating.
+ */
+ if (cgroup_is_removed(cgroup) || !cgroup->dentry)
+ return;
+
cgroup_path(cgroup, path, sizeof(path));
+ }

SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
#else
--
1.5.4.rc3

2008-12-15 10:44:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Mon, 2008-12-15 at 17:51 +0800, Li Zefan wrote:
> >>> Can't we detect a dead task-group and skip those instead of adding this
> >>> global lock?
> >>>
> >> I tried it, but I don't think it's feasable, without lock syncronization:
> >>
> >> | print_cfs_rq()
> >> | check task_group is dead
> >> cgroup_diput() |
> >> .. |
> >> mark task_group as dead |
> >> .. |
> >> kfree(cgrp) |
> >> | call cgroup_path()
> >
> > rcu free cgrp
> >
>
> I got your point, thanks.
>
> Another way is use css_tryget(), and thus can avoid touching cgroup.c and adding
> synchronize_rcu(). css_tryget() is proposed by Kamezawa but I think won't be
> available until 2.6.29.
>
> Anyway, here is the fix. I'll post a complete version with changelog when we
> agree on how to fix it.

Yeah, the below looks like a suitable fix.

> Signed-off-by: Li Zefan <[email protected]>
> ---
> kernel/cgroup.c | 6 ++++++
> kernel/sched_debug.c | 17 +++++++++++++++--
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index fe00b3b..3c54d1b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -624,6 +624,12 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> * created the cgroup */
> deactivate_super(cgrp->root->sb);
>
> + /*
> + * Some subsystems (cpu cgroup) might still be able to
> + * accessing the cgroup in rcu section.
> + */
> + synchronize_rcu();
> +
> kfree(cgrp);
> }
> iput(inode);

Is there any reason we cannot use call_rcu() ?

> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index 26ed8e3..174c072 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -127,8 +127,14 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> if (tg)
> cgroup = tg->css.cgroup;
>
> - if (cgroup)
> + if (cgroup) {
> + /*
> + * This task_group is dead or we race with cgroup creating.
> + */
> + if (cgroup_is_removed(cgroup) || !cgroup->dentry)
> + return;
> cgroup_path(cgroup, path, sizeof(path));
> + }

Perhaps wrap that check in a cgroup_*() helper? That would avoid the
duplication, be clearer and not open code the ->dentry assumption.

cgroup_is_active() perhaps?

> SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
> #else
> @@ -181,8 +187,15 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
> if (tg)
> cgroup = tg->css.cgroup;
>
> - if (cgroup)
> + if (cgroup) {
> + /*
> + * This task_group is dead or we race with cgroup creating.
> + */
> + if (cgroup_is_removed(cgroup) || !cgroup->dentry)
> + return;
> +
> cgroup_path(cgroup, path, sizeof(path));
> + }
>
> SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
> #else

2008-12-15 11:10:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Mon, 15 Dec 2008 11:43:49 +0100
Peter Zijlstra <[email protected]> wrote:

> > + if (cgroup_is_removed(cgroup) || !cgroup->dentry)
> > + return;
> > cgroup_path(cgroup, path, sizeof(path));
> > + }
>
> Perhaps wrap that check in a cgroup_*() helper? That would avoid the
> duplication, be clearer and not open code the ->dentry assumption.
>
> cgroup_is_active() perhaps?
>

I vote for cgroup_is_populated().

Thx,
-Kame

2008-12-16 05:50:20

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

KAMEZAWA Hiroyuki wrote:
> On Mon, 15 Dec 2008 11:43:49 +0100
> Peter Zijlstra <[email protected]> wrote:
>
>>> + if (cgroup_is_removed(cgroup) || !cgroup->dentry)
>>> + return;
>>> cgroup_path(cgroup, path, sizeof(path));
>>> + }
>> Perhaps wrap that check in a cgroup_*() helper? That would avoid the
>> duplication, be clearer and not open code the ->dentry assumption.
>>
>> cgroup_is_active() perhaps?
>>

We are fixing this particular race, so open code and comment can document
the problem clearer.

And this wrapper shouldn't be used by other cgroup users, if one needs to
do these checks, he should have a reason and comment is needed to explain
what's happening.

Also, I can't think out an appropriate name..

>
> I vote for cgroup_is_populated().
>

But non-populated means the directory isn't filled up with control files,
this function name doesn't indicate if the dir dentry itself is valid or not..

2008-12-16 07:00:46

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

How about this :

====================

I fixed an oops with the following commit:

| commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
| Author: Li Zefan <[email protected]>
| Date: Thu Nov 6 12:53:32 2008 -0800
|
| cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
|
| This fixes an oops when reading /proc/sched_debug.

The above commit fixed a race that reading /proc/sched_debug may access
NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
hasn't been destroyed (via cgroup_diput).

But I found there's another different race, in that reading sched_debug
may access a cgroup which is being created or has been destroyed, and thus
dereference NULL cgrp->dentry!

We fix the former issue by checking if the cgroup is being created, and
fix the latter issue by synchronize free cgroup with rcu.

Signed-off-by: Li Zefan <[email protected]>
---
include/linux/cgroup.h | 12 +++++++++++-
kernel/cgroup.c | 14 ++++++++------
kernel/sched_debug.c | 18 ++++++++++++++----
3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1164963..23854ec 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -91,6 +91,8 @@ static inline void css_put(struct cgroup_subsys_state *css)

/* bits in struct cgroup flags field */
enum {
+ /* Control Group is completely created */
+ CGRP_CREATED,
/* Control Group is dead */
CGRP_REMOVED,
/* Control Group has previously had a child cgroup or a task,
@@ -303,7 +305,15 @@ int cgroup_add_files(struct cgroup *cgrp,
const struct cftype cft[],
int count);

-int cgroup_is_removed(const struct cgroup *cgrp);
+static inline int cgroup_is_removed(const struct cgroup *cgrp)
+{
+ return test_bit(CGRP_REMOVED, &cgrp->flags);
+}
+
+static inline int cgroup_is_created(const struct cgroup *cgrp)
+{
+ return test_bit(CGRP_CREATED, &cgrp->flags);
+}

int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fe00b3b..364e8a3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -118,12 +118,6 @@ static int root_count;
static int need_forkexit_callback __read_mostly;
static int need_mm_owner_callback __read_mostly;

-/* convenient tests for these bits */
-inline int cgroup_is_removed(const struct cgroup *cgrp)
-{
- return test_bit(CGRP_REMOVED, &cgrp->flags);
-}
-
/* bits in struct cgroupfs_root flags field */
enum {
ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
@@ -624,6 +618,12 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
* created the cgroup */
deactivate_super(cgrp->root->sb);

+ /*
+ * Some subsystems (cpu cgroup) might still be able to
+ * accessing the cgroup in rcu section.
+ */
+ synchronize_rcu();
+
kfree(cgrp);
}
iput(inode);
@@ -2391,6 +2391,8 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
err = cgroup_populate_dir(cgrp);
/* If err < 0, we have a half-filled directory - oh well ;) */

+ set_bit(CGRP_CREATED, &cgrp->flags);
+
mutex_unlock(&cgroup_mutex);
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 26ed8e3..ae35d1a 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -127,8 +127,13 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
if (tg)
cgroup = tg->css.cgroup;

- if (cgroup)
- cgroup_path(cgroup, path, sizeof(path));
+ /*
+ * The task_group is dead or we race with cgroup creating.
+ */
+ if (!cgroup || !cgroup_is_created(cgroup) || !cgroup_is_removed(cgroup))
+ return;
+
+ cgroup_path(cgroup, path, sizeof(path));

SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
#else
@@ -181,8 +186,13 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
if (tg)
cgroup = tg->css.cgroup;

- if (cgroup)
- cgroup_path(cgroup, path, sizeof(path));
+ /*
+ * The task_group is dead or we race with cgroup creating.
+ */
+ if (!cgroup || !cgroup_is_created(cgroup) || !cgroup_is_removed(cgroup))
+ return;
+
+ cgroup_path(cgroup, path, sizeof(path));

SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
#else

2008-12-16 08:02:45

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

Li Zefan wrote:
>> i merged it up in tip/master, could you please check whether it's ok?
>>
>
> Sorry, though this patch avoids accessing a half-created cgroup, but I found
> current code may access a cgroup which has been destroyed.
>
> The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.
>
> Could you revert this patch and apply the following new one? My box has
> survived for 16 hours with it applied.
>

Hi, Ingo

Can we have this bug fixed for 2.6.28 using this patch ? This patch is the
simplest fix and has been fully tested.

Since it's not a hot path, it's acceptable to hold cgroup_lock, and I've
checked the call chains to make sure it's deadlock free.

I'll remove this cgroup_lock for 2.6.29 in a way which is a bit tricky and
needs more testing.

This bug can trigger oops quite easily:

BUG: unable to handle kernel NULL pointer dereference at 00000038
IP: [<c0457c66>] cgroup_path+0x34/0x8b
...
Call Trace:
[<c041f12a>] ? print_cfs_rq+0x72/0x4f0
[<c041ff09>] ? sched_debug_show+0x961/0xe52
[<c04a6728>] ? seq_read+0x25/0x2a8
[<c05f79bd>] ? mutex_lock_nested+0x20a/0x212
[<c0447c60>] ? trace_hardirqs_on+0xb/0xd
[<c04a6755>] ? seq_read+0x52/0x2a8
[<c04a67f8>] ? seq_read+0xf5/0x2a8
[<c04a6703>] ? seq_read+0x0/0x2a8
[<c04bfcee>] ? proc_reg_read+0x60/0x74
[<c04bfc8e>] ? proc_reg_read+0x0/0x74
[<c04938fc>] ? vfs_read+0x8a/0x131
[<c0493c5d>] ? sys_read+0x3b/0x60
[<c04039a5>] ? sysenter_do_call+0x12/0x31
Code: ec 08 3d b4 d3 b5 c0 89 55 ec 75 0f 8b 45 ec ba 5e ec 6a c0 e8 dc 22 0a 00 eb 57 03 4d ec 89 c8 89 4d f0 48 c6 41 ff 00 8b 73 1c <8b> 56 38 29 d0 3b 45 ec 72 41 89 d1 8b 76 3c 89 c7 c1 e9 02 f3
EIP: [<c0457c66>] cgroup_path+0x34/0x8b SS:ESP 0068:e187fd84


> ==========
>
> From: Li Zefan <[email protected]>
> Date: Sun, 14 Dec 2008 09:53:28 +0800
> Subject: [PATCH] sched: fix another race when reading /proc/sched_debug
>
> I fixed an oops with the following commit:
>
> | commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
> | Author: Li Zefan <[email protected]>
> | Date: Thu Nov 6 12:53:32 2008 -0800
> |
> | cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
> |
> | This fixes an oops when reading /proc/sched_debug.
>
> The above commit fixed a race that reading /proc/sched_debug may access
> NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
> hasn't been destroyed (via cgroup_diput).
>
> But I found there's another different race, in that reading sched_debug
> may access a cgroup which is being created or has been destroyed, and thus
> dereference NULL cgrp->dentry!
>
> task_group is added to the global list while the cgroup is being created,
> and is removed from the global list while the cgroup is under destruction.
> So running through the list should be protected by cgroup_lock(), if
> cgroup data will be accessed (here by calling cgroup_path).
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> kernel/sched_fair.c | 6 ++++++
> kernel/sched_rt.c | 6 ++++++
> 2 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 98345e4..8b2965b 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1737,9 +1737,15 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
> {
> struct cfs_rq *cfs_rq;
>
> + /*
> + * Hold cgroup_lock() to avoid calling cgroup_path() with
> + * invalid cgroup.
> + */
> + cgroup_lock();
> rcu_read_lock();
> for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> print_cfs_rq(m, cpu, cfs_rq);
> rcu_read_unlock();
> + cgroup_unlock();
> }
> #endif
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index d9ba9d5..e84480d 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -1538,9 +1538,15 @@ static void print_rt_stats(struct seq_file *m, int cpu)
> {
> struct rt_rq *rt_rq;
>
> + /*
> + * Hold cgroup_lock() to avoid calling cgroup_path() with
> + * invalid cgroup.
> + */
> + cgroup_lock();
> rcu_read_lock();
> for_each_leaf_rt_rq(rt_rq, cpu_rq(cpu))
> print_rt_rq(m, cpu, rt_rq);
> rcu_read_unlock();
> + cgroup_unlock();
> }
> #endif /* CONFIG_SCHED_DEBUG */

2008-12-16 09:24:01

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Sun, Dec 14, 2008 at 5:50 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> at creation:
> - cpu_cgroup_populate() should record "tg" that "this cgroup has valid dentry"
> at deletion

Wouldn't moving the call to cgroup_create_dir() to before doing
subsystem initialization fix this problem more straightforwardly? Then
by the time the cpu subsytem create() callback is called, dentry will
already be valid.

Paul

2008-12-16 09:41:26

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

Paul Menage wrote:
> On Sun, Dec 14, 2008 at 5:50 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> at creation:
>> - cpu_cgroup_populate() should record "tg" that "this cgroup has valid dentry"
>> at deletion
>
> Wouldn't moving the call to cgroup_create_dir() to before doing
> subsystem initialization fix this problem more straightforwardly? Then
> by the time the cpu subsytem create() callback is called, dentry will
> already be valid.
>

Yes, it's true, and actually it was the first idea came into my mind when I
started to fix this bug. But I was not sure whether this is an apropriate way.
Now I think it's ok, and we'd better add a comment to indicate we want to
make cgrp->dentry valid before calling subsystem create() method.

Note in print_cfs_rq(), due to the race, tg->css.cgroup can be NULL.
Fortunately, the code checks this condition, but I think it's just by accident
but not awaring there is a race.

2008-12-16 09:41:59

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Mon, Dec 15, 2008 at 2:43 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> + /*
>> + * Some subsystems (cpu cgroup) might still be able to
>> + * accessing the cgroup in rcu section.
>> + */
>> + synchronize_rcu();
>> +
>> kfree(cgrp);
>> }
>> iput(inode);
>
> Is there any reason we cannot use call_rcu() ?

It should be OK to rcu-free cgrp and have it still safe to call
cgroup_path() on it - as long as we're confident that the dentry (and
all its parents) won't be released until after the RCU section as
well, since that's where we get the path elements from. And that does
seem to be the case from looking at dcache.c

It's certainly nicer than having two calls to synchronize_rcu() in
quick succession in cgroup_diput().
>
> Perhaps wrap that check in a cgroup_*() helper? That would avoid the
> duplication, be clearer and not open code the ->dentry assumption.

I think it would be simpler to not have the cgroup exposed to the
subsystem until the dentry and other structures have been created. The
current order (create subsystems, then create fs structures) is more
or less a carry-over from when I extended cpusets, rather than an
explicit design choice.

Paul

2008-12-16 12:24:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug


* Li Zefan <[email protected]> wrote:

> Li Zefan wrote:
> >> i merged it up in tip/master, could you please check whether it's ok?
> >>
> >
> > Sorry, though this patch avoids accessing a half-created cgroup, but I found
> > current code may access a cgroup which has been destroyed.
> >
> > The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.
> >
> > Could you revert this patch and apply the following new one? My box has
> > survived for 16 hours with it applied.
> >
>
> Hi, Ingo
>
> Can we have this bug fixed for 2.6.28 using this patch ? This patch is
> the simplest fix and has been fully tested.

the mutex used by cgroup_lock() is pretty crappy to nest inside the
runqueue lock:

BUG: sleeping function called from invalid context at kernel/mutex7
in_atomic(): 0, irqs_disabled(): 1, pid: 1790, name: cat
2 locks held by cat/1790:
#0: (&p->lock){--..}, at: [<c02a6328>] seq_read+0x25/0x2b8
#1: (tasklist_lock){..--}, at: [<c021f5bc>]
sched_debug_show+0xaa7/0xe90
Call Trace:
[<c0222612>] __might_sleep+0xd6/0xdd
[<c05ff1eb>] mutex_lock_nested+0x1d/0x245
[<c021f88b>] ? sched_debug_show+0xd76/0xe90
[<c0256223>] cgroup_lock+0xf/0x11
[<c021f893>] sched_debug_show+0xd7e/0xe90
[<c0248696>] ? __lock_acquire+0x637/0x69d
[<c028e7cd>] ? check_object+0x111/0x18c
[<c028f435>] ? kmem_cache_alloc+0x70/0xa5
[<c02a6355>] ? seq_read+0x52/0x2b8
[<c0246ee2>] ? trace_hardirqs_on_caller+0x105/0x13d
[<c0246f25>] ? trace_hardirqs_on+0xb/0xd
[<c02a6355>] ? seq_read+0x52/0x2b8
[<c02a63f7>] seq_read+0xf4/0x2b8
[<c02a6303>] ? seq_read+0x0/0x2b8
[<c02c411a>] proc_reg_read+0x60/0x74

so i had to remove your patch.

also, while looking at what cgroup_lock does - it's a trivial wrapper:

void cgroup_lock(void)
{
mutex_lock(&cgroup_mutex);
}

why isnt that done explicitly in all usage sites? If cgroup-unaware code
ever has to take the cgroup lock outside of CONFIG_CGROUPS, that's a code
structure problem.

Ingo

2008-12-16 12:42:42

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Tue, Dec 16, 2008 at 1:41 AM, Paul Menage <[email protected]> wrote:
>
> It should be OK to rcu-free cgrp and have it still safe to call
> cgroup_path() on it - as long as we're confident that the dentry (and
> all its parents) won't be released until after the RCU section as
> well, since that's where we get the path elements from. And that does
> seem to be the case from looking at dcache.c
>
> It's certainly nicer than having two calls to synchronize_rcu() in
> quick succession in cgroup_diput().

How about something like this?

Paul

include/linux/cgroup.h | 9 +++++++++
kernel/cgroup.c | 49 ++++++++++++++++++++++++++++---------------------
2 files changed, 37 insertions(+), 21 deletions(-)

Index: cgroup-rcu-mmotm-2008-12-09/include/linux/cgroup.h
===================================================================
--- cgroup-rcu-mmotm-2008-12-09.orig/include/linux/cgroup.h
+++ cgroup-rcu-mmotm-2008-12-09/include/linux/cgroup.h
@@ -145,6 +145,9 @@ struct cgroup {
int pids_use_count;
/* Length of the current tasks_pids array */
int pids_length;
+
+ /* For RCU-protected deletion */
+ struct rcu_head rcu_head;
};

/* A css_set is a structure holding pointers to a set of
@@ -305,6 +308,12 @@ int cgroup_add_files(struct cgroup *cgrp

int cgroup_is_removed(const struct cgroup *cgrp);

+/*
+ * cgroup_path() fills in a filesystem-like path for the given cgroup
+ * into "buf", up to "buflen" characters. Should be called with
+ * cgroup_mutex held, or else in an RCU section with an RCU-protected
+ * cgroup reference
+ */
int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);

int cgroup_task_count(const struct cgroup *cgrp);
Index: cgroup-rcu-mmotm-2008-12-09/kernel/cgroup.c
===================================================================
--- cgroup-rcu-mmotm-2008-12-09.orig/kernel/cgroup.c
+++ cgroup-rcu-mmotm-2008-12-09/kernel/cgroup.c
@@ -271,7 +271,7 @@ static void __put_css_set(struct css_set

rcu_read_lock();
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
- struct cgroup *cgrp = cg->subsys[i]->cgroup;
+ struct cgroup *cgrp = rcu_dereference(cg->subsys[i]->cgroup);
if (atomic_dec_and_test(&cgrp->count) &&
notify_on_release(cgrp)) {
if (taskexit)
@@ -595,6 +595,18 @@ static void cgroup_call_pre_destroy(stru
return;
}

+static void __free_cgroup_rcu(struct rcu_head *obj)
+{
+ struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);
+ /*
+ * Drop the active superblock reference that we took when we
+ * created the cgroup
+ */
+ deactivate_super(cgrp->root->sb);
+
+ kfree(cgrp);
+}
+
static void cgroup_diput(struct dentry *dentry, struct inode *inode)
{
/* is dentry a directory ? if so, kfree() associated cgroup */
@@ -602,13 +614,6 @@ static void cgroup_diput(struct dentry *
struct cgroup *cgrp = dentry->d_fsdata;
struct cgroup_subsys *ss;
BUG_ON(!(cgroup_is_removed(cgrp)));
- /* It's possible for external users to be holding css
- * reference counts on a cgroup; css_put() needs to
- * be able to access the cgroup after decrementing
- * the reference count in order to know if it needs to
- * queue the cgroup to be handled by the release
- * agent */
- synchronize_rcu();

mutex_lock(&cgroup_mutex);
/*
@@ -620,11 +625,7 @@ static void cgroup_diput(struct dentry *
cgrp->root->number_of_cgroups--;
mutex_unlock(&cgroup_mutex);

- /* Drop the active superblock reference that we took when we
- * created the cgroup */
- deactivate_super(cgrp->root->sb);
-
- kfree(cgrp);
+ call_rcu(&cgrp->rcu_head, __free_cgroup_rcu);
}
iput(inode);
}
@@ -1134,8 +1135,9 @@ static inline struct cftype *__d_cft(str
* @buf: the buffer to write the path into
* @buflen: the length of the buffer
*
- * Called with cgroup_mutex held. Writes path of cgroup into buf.
- * Returns 0 on success, -errno on error.
+ * Called with cgroup_mutex held or else with an RCU-protected cgroup
+ * reference. Writes path of cgroup into buf. Returns 0 on success,
+ * -errno on error.
*/
int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
{
@@ -2440,12 +2442,16 @@ static int cgroup_has_css_refs(struct cg
if (ss->root != cgrp->root)
continue;
css = cgrp->subsys[ss->subsys_id];
- /* When called from check_for_release() it's possible
- * that by this point the cgroup has been removed
- * and the css deleted. But a false-positive doesn't
- * matter, since it can only happen if the cgroup
- * has been deleted and hence no longer needs the
- * release agent to be called anyway. */
+ /*
+ * When called from check_for_release() it's possible
+ * that by this point the cgroup has been removed and
+ * the css deleted (since css objects are not
+ * necessarily RCU-protected). But a false-positive
+ * doesn't matter, since it can only happen if the
+ * cgroup (which *is* RCU-protected) has been removed
+ * and hence no longer needs the release agent to be
+ * called anyway.
+ */
if (css && atomic_read(&css->refcnt))
return 1;
}
@@ -2487,6 +2493,7 @@ static int cgroup_rmdir(struct inode *un
return -EBUSY;
}

+ /* Synchronize with check_for_release() */
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
if (!list_empty(&cgrp->release_list))

2008-12-16 12:56:51

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

Paul Menage wrote:
> On Tue, Dec 16, 2008 at 1:41 AM, Paul Menage <[email protected]> wrote:
>> It should be OK to rcu-free cgrp and have it still safe to call
>> cgroup_path() on it - as long as we're confident that the dentry (and
>> all its parents) won't be released until after the RCU section as
>> well, since that's where we get the path elements from. And that does
>> seem to be the case from looking at dcache.c
>>
>> It's certainly nicer than having two calls to synchronize_rcu() in
>> quick succession in cgroup_diput().
>
> How about something like this?
>

This solves half of the problem..

This avoids accessing a destroyed cgroup, but still race with cgroup_create.
You sugguested call cgroup_create_dir() before calling subsystems' create()
method.

> Paul
>
> include/linux/cgroup.h | 9 +++++++++
> kernel/cgroup.c | 49 ++++++++++++++++++++++++++++---------------------
> 2 files changed, 37 insertions(+), 21 deletions(-)
>

2008-12-16 18:35:26

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Tue, Dec 16, 2008 at 4:55 AM, Li Zefan <[email protected]> wrote:
>
> This avoids accessing a destroyed cgroup, but still race with cgroup_create.
> You sugguested call cgroup_create_dir() before calling subsystems' create()
> method.
>

I think that if we need a fix for 2.6.28 I'll just make cgroup_path()
check for a NULL dentry in the passed cgroup, rather than shuffling
any code around. That combined with a simple RCU free in
cgroup_diput() should do the trick - I think I'll leave the existing
synchronize_rcu() at the beginning of cgroup_diput() in case we need
it for more reasons than I originally mentioned in the comment there.

Paul

2008-12-19 04:37:25

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

* Peter Zijlstra <[email protected]> [2008-12-14 13:48:10]:

> On Sun, 2008-12-14 at 10:54 +0800, Li Zefan wrote:
> > > i merged it up in tip/master, could you please check whether it's ok?
> > >
> >
> > Sorry, though this patch avoids accessing a half-created cgroup, but I found
> > current code may access a cgroup which has been destroyed.
> >
> > The simplest fix is to take cgroup_lock() before for_each_leaf_cfs_rq.
> >
> > Could you revert this patch and apply the following new one? My box has
> > survived for 16 hours with it applied.
> >
> > ==========
> >
> > From: Li Zefan <[email protected]>
> > Date: Sun, 14 Dec 2008 09:53:28 +0800
> > Subject: [PATCH] sched: fix another race when reading /proc/sched_debug
> >
> > I fixed an oops with the following commit:
> >
> > | commit 24eb089950ce44603b30a3145a2c8520e2b55bb1
> > | Author: Li Zefan <[email protected]>
> > | Date: Thu Nov 6 12:53:32 2008 -0800
> > |
> > | cgroups: fix invalid cgrp->dentry before cgroup has been completely removed
> > |
> > | This fixes an oops when reading /proc/sched_debug.
> >
> > The above commit fixed a race that reading /proc/sched_debug may access
> > NULL cgrp->dentry if a cgroup is being removed (via cgroup_rmdir), but
> > hasn't been destroyed (via cgroup_diput).
> >
> > But I found there's another different race, in that reading sched_debug
> > may access a cgroup which is being created or has been destroyed, and thus
> > dereference NULL cgrp->dentry!
> >
> > task_group is added to the global list while the cgroup is being created,
> > and is removed from the global list while the cgroup is under destruction.
> > So running through the list should be protected by cgroup_lock(), if
> > cgroup data will be accessed (here by calling cgroup_path).
>
> Can't we detect a dead task-group and skip those instead of adding this
> global lock?
>

Now we can, there is a css_is_removed() function.

--
Balbir

2008-12-19 14:07:00

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] sched: fix another race when reading /proc/sched_debug

On Thu, Dec 18, 2008 at 8:37 PM, Balbir Singh <[email protected]> wrote:
>>
>> Can't we detect a dead task-group and skip those instead of adding this
>> global lock?
>>
>
> Now we can, there is a css_is_removed() function.
>

With the patch that makes cgroup_path() RCU-safe, you shouldn't need
to do any additional skipping or locking.

Paul