2012-02-08 02:37:40

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 0/2] cgroup: Fix some races against css_set task links

Hi,

This is trying to fix some races in the way to link all the tasks to
the css_set links. I hope some people can have a look at this, especially
as I'm definetly not an SMP ordering expert.

To give you the big picture, as long as nobody never calls
cgroup_iter_start(), we don't link the tasks to their css_set (this 'link'
is a simple list_add()).

But once somebody calls cgroup_iter_start(), we call
cgroup_enable_task_cg_lists() that grossly does this:

cgroup_enable_task_cg_lists() {
use_task_set_css_links = 1;
do_each_thread(g, p) {
link p to css_set
} while_each_thread(g, p);
}

But this links only existing tasks, we also need to link all the tasks
that will be created later, this is what does cgroup_post_fork():

cgroup_post_fork() {
if (use_task_set_css_links)
link p to css_set
}

So we have some races here:

- cgroup_enable_task_cg_lists() iterates over the tasklist
without protection. The comments are advertizing we are using RCU
but we don't. And in fact RCU doesn't yet protect against
while_each_thread().

- Moreover with RCU there is a risk that we iterate the tasklist but
we don't immediately see all the last updates that happened. For
example if a task forks and passes cgroup_post_fork() while
use_task_set_css_links = 0 then another CPU calling
cgroup_enable_task_cg_list() can miss the new child while walking the
tasklist with RCU as it doesn't appear immediately.

- There is no ordering constraint on use_task_set_css_links read/write
against the tasklist traversal and modification. cgroup_post_fork()
may deal with a stale value.

The second patch of the series is a proposal to fix the three above
points. Tell me what you think.

Thanks.

Frederic Weisbecker (2):
cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
cgroup: Walk task list under tasklist_lock in
cgroup_enable_task_cg_list

kernel/cgroup.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)

--
1.7.5.4


2012-02-08 02:37:44

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/2] cgroup: Remove wrong comment on cgroup_enable_task_cg_list()

Remove the stale comment about RCU protection. Many callers
(all of them?) of cgroup_enable_task_cg_list() don't seem
to be in an RCU read side critical section. Besides, RCU is
not helpful to protect against while_each_thread().

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Mandeep Singh Baines <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Paul E. McKenney <[email protected]>
---
kernel/cgroup.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 865d89a..6e4eb43 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2701,9 +2701,6 @@ static void cgroup_advance_iter(struct cgroup *cgrp,
* using their cgroups capability, we don't maintain the lists running
* through each css_set to its tasks until we see the list actually
* used - in other words after the first call to cgroup_iter_start().
- *
- * The tasklist_lock is not held here, as do_each_thread() and
- * while_each_thread() are protected by RCU.
*/
static void cgroup_enable_task_cg_lists(void)
{
--
1.7.5.4

2012-02-08 02:37:47

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

Walking through the tasklist in cgroup_enable_task_cg_list() inside
an RCU read side critical section is not enough because:

- RCU is not (yet) safe against while_each_thread()

- If we use only RCU, a forking task that has passed cgroup_post_fork()
without seeing use_task_css_set_links == 1 is not guaranteed to have
its child immediately visible in the tasklist if we walk through it
remotely with RCU. In this case it will be missing in its css_set's
task list.

Thus we need to traverse the list (unfortunately) under the
tasklist_lock. It makes us safe against while_each_thread() and also
make sure we see all forked task that have been added to the tasklist.

As a secondary effect, reading and writing use_task_css_set_links are
now well ordered against tasklist traversing and modification. The new
layout is:

CPU 0 CPU 1

use_task_css_set_links = 1 write_lock(tasklist_lock)
read_lock(tasklist_lock) add task to tasklist
do_each_thread() { write_unlock(tasklist_lock)
add thread to css set links if (use_task_css_set_links)
} while_each_thread() add thread to css set links
read_unlock(tasklist_lock)

If CPU 0 traverse the list after the task has been added to the tasklist
then it is correctly added to the css set links. OTOH if CPU 0 traverse
the tasklist before the new task had the opportunity to be added to the
tasklist because it was too early in the fork process, then CPU 1
catches up and add the task to the css set links after it added the task
to the tasklist. The right value of use_task_css_set_links is guaranteed
to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
afterward to return the correct value.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Mandeep Singh Baines <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Paul E. McKenney <[email protected]>
---
kernel/cgroup.c | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6e4eb43..c6877fe 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2707,6 +2707,14 @@ static void cgroup_enable_task_cg_lists(void)
struct task_struct *p, *g;
write_lock(&css_set_lock);
use_task_css_set_links = 1;
+ /*
+ * We need tasklist_lock because RCU is not safe against
+ * while_each_thread(). Besides, a forking task that has passed
+ * cgroup_post_fork() without seeing use_task_css_set_links = 1
+ * is not guaranteed to have its child immediately visible in the
+ * tasklist if we walk through it with RCU.
+ */
+ read_lock(&tasklist_lock);
do_each_thread(g, p) {
task_lock(p);
/*
@@ -2718,6 +2726,7 @@ static void cgroup_enable_task_cg_lists(void)
list_add(&p->cg_list, &p->cgroups->tasks);
task_unlock(p);
} while_each_thread(g, p);
+ read_unlock(&tasklist_lock);
write_unlock(&css_set_lock);
}

@@ -4522,6 +4531,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
*/
void cgroup_post_fork(struct task_struct *child)
{
+ /*
+ * use_task_css_set_links is set to 1 before we walk the tasklist
+ * under the tasklist_lock and we read it here after we added the child
+ * to the tasklist under the tasklist_lock as well. If the child wasn't
+ * yet in the tasklist when we walked through it from
+ * cgroup_enable_task_cg_lists(), then use_task_css_set_links value
+ * should be visible now due to the paired locking and barriers implied
+ * by LOCK/UNLOCK: it is written before the tasklist_lock unlock
+ * in cgroup_enable_task_cg_lists() and read here after the tasklist_lock
+ * lock on fork.
+ */
if (use_task_css_set_links) {
write_lock(&css_set_lock);
if (list_empty(&child->cg_list)) {
--
1.7.5.4

2012-02-17 05:38:38

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] cgroup: Fix some races against css_set task links

sorry for the delayed reply. I had been off for quite a few days.

Frederic Weisbecker wrote:
> Hi,
>
> This is trying to fix some races in the way to link all the tasks to
> the css_set links. I hope some people can have a look at this, especially
> as I'm definetly not an SMP ordering expert.

me neither.

>
> To give you the big picture, as long as nobody never calls
> cgroup_iter_start(), we don't link the tasks to their css_set (this 'link'
> is a simple list_add()).
>
> But once somebody calls cgroup_iter_start(), we call
> cgroup_enable_task_cg_lists() that grossly does this:
>
> cgroup_enable_task_cg_lists() {
> use_task_set_css_links = 1;
> do_each_thread(g, p) {
> link p to css_set
> } while_each_thread(g, p);
> }
>
> But this links only existing tasks, we also need to link all the tasks
> that will be created later, this is what does cgroup_post_fork():
>
> cgroup_post_fork() {
> if (use_task_set_css_links)
> link p to css_set
> }
>
> So we have some races here:
>
> - cgroup_enable_task_cg_lists() iterates over the tasklist
> without protection. The comments are advertizing we are using RCU
> but we don't. And in fact RCU doesn't yet protect against
> while_each_thread().
>
> - Moreover with RCU there is a risk that we iterate the tasklist but
> we don't immediately see all the last updates that happened. For
> example if a task forks and passes cgroup_post_fork() while
> use_task_set_css_links = 0 then another CPU calling
> cgroup_enable_task_cg_list() can miss the new child while walking the
> tasklist with RCU as it doesn't appear immediately.
>
> - There is no ordering constraint on use_task_set_css_links read/write
> against the tasklist traversal and modification. cgroup_post_fork()
> may deal with a stale value.
>
> The second patch of the series is a proposal to fix the three above
> points. Tell me what you think.
>

The patch looks good to me.

As cgroup_enable_task_cg_lists() is an one-off function, won't be
called more than once, so there's little chance it can happen
in reality, so should be ok to queue it for 3.4.

> Thanks.
>
> Frederic Weisbecker (2):
> cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
> cgroup: Walk task list under tasklist_lock in
> cgroup_enable_task_cg_list
>

for both patches

Acked-by: Li Zefan <[email protected]>

> kernel/cgroup.c | 23 ++++++++++++++++++++---
> 1 files changed, 20 insertions(+), 3 deletions(-)
>

2012-02-21 17:16:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] cgroup: Fix some races against css_set task links

Hello, Frederic, Li.

Sorry about the delay. I was (and still is) swamped with block cgroup
stuff.

As the patch is low risk and fix for a race condition, I applied it to
for-3.3-fixes. Will push to Linus in a week or so.

Thanks.

--
tejun

2012-02-21 17:42:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] cgroup: Fix some races against css_set task links

On Tue, Feb 21, 2012 at 09:16:21AM -0800, Tejun Heo wrote:
> Hello, Frederic, Li.
>
> Sorry about the delay. I was (and still is) swamped with block cgroup
> stuff.
>
> As the patch is low risk and fix for a race condition, I applied it to
> for-3.3-fixes. Will push to Linus in a week or so.

IMHO since this doesn't really fix a regression (I guess) I suspect this
should go for 3.4 to avoid any surprise like some missed lock inversion.

But I'll let you choose.

>
> Thanks.
>
> --
> tejun

2012-02-21 17:46:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] cgroup: Fix some races against css_set task links

On Tue, Feb 21, 2012 at 06:42:03PM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 21, 2012 at 09:16:21AM -0800, Tejun Heo wrote:
> > Hello, Frederic, Li.
> >
> > Sorry about the delay. I was (and still is) swamped with block cgroup
> > stuff.
> >
> > As the patch is low risk and fix for a race condition, I applied it to
> > for-3.3-fixes. Will push to Linus in a week or so.
>
> IMHO since this doesn't really fix a regression (I guess) I suspect this
> should go for 3.4 to avoid any surprise like some missed lock inversion.
>
> But I'll let you choose.

Heh, okay, moving them to for-3.4.

Thanks.

--
tejun

2012-02-21 22:23:59

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

Frederic Weisbecker ([email protected]) wrote:
> Walking through the tasklist in cgroup_enable_task_cg_list() inside
> an RCU read side critical section is not enough because:
>
> - RCU is not (yet) safe against while_each_thread()
>
> - If we use only RCU, a forking task that has passed cgroup_post_fork()
> without seeing use_task_css_set_links == 1 is not guaranteed to have
> its child immediately visible in the tasklist if we walk through it
> remotely with RCU. In this case it will be missing in its css_set's
> task list.
>
> Thus we need to traverse the list (unfortunately) under the
> tasklist_lock. It makes us safe against while_each_thread() and also
> make sure we see all forked task that have been added to the tasklist.
>
> As a secondary effect, reading and writing use_task_css_set_links are
> now well ordered against tasklist traversing and modification. The new
> layout is:
>
> CPU 0 CPU 1
>
> use_task_css_set_links = 1 write_lock(tasklist_lock)
> read_lock(tasklist_lock) add task to tasklist
> do_each_thread() { write_unlock(tasklist_lock)
> add thread to css set links if (use_task_css_set_links)
> } while_each_thread() add thread to css set links
> read_unlock(tasklist_lock)
>
> If CPU 0 traverse the list after the task has been added to the tasklist
> then it is correctly added to the css set links. OTOH if CPU 0 traverse
> the tasklist before the new task had the opportunity to be added to the
> tasklist because it was too early in the fork process, then CPU 1
> catches up and add the task to the css set links after it added the task
> to the tasklist. The right value of use_task_css_set_links is guaranteed
> to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
> the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
> and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
> afterward to return the correct value.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Li Zefan <[email protected]>
> Cc: Mandeep Singh Baines <[email protected]>

Reviewed-by: Mandeep Singh Baines <[email protected]>

Sorry for being late. My feedback is really just comments.

> Cc: Oleg Nesterov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> ---
> kernel/cgroup.c | 20 ++++++++++++++++++++
> 1 files changed, 20 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 6e4eb43..c6877fe 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2707,6 +2707,14 @@ static void cgroup_enable_task_cg_lists(void)
> struct task_struct *p, *g;
> write_lock(&css_set_lock);

You might want to re-test use_task_css_set_links once you have the lock
in order to avoid an unnecessary do_each_thread()/while_each_thread() in
case you race between reading the value and entering the loop. This is
a potential optimization in a rare case so maybe not worth the LOC.

> use_task_css_set_links = 1;
> + /*
> + * We need tasklist_lock because RCU is not safe against
> + * while_each_thread(). Besides, a forking task that has passed
> + * cgroup_post_fork() without seeing use_task_css_set_links = 1
> + * is not guaranteed to have its child immediately visible in the
> + * tasklist if we walk through it with RCU.
> + */

Maybe add TODO to remove the lock once do_each_thread()/while_each_thread()
is made rcu safe. On a large system, it could take a while to iterate
over every thread in the system. Thats a long time to hold a spinlock.
But it only happens once so probably not that big a deal.

> + read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> task_lock(p);
> /*
> @@ -2718,6 +2726,7 @@ static void cgroup_enable_task_cg_lists(void)
> list_add(&p->cg_list, &p->cgroups->tasks);
> task_unlock(p);
> } while_each_thread(g, p);
> + read_unlock(&tasklist_lock);
> write_unlock(&css_set_lock);
> }
>
> @@ -4522,6 +4531,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
> */
> void cgroup_post_fork(struct task_struct *child)
> {
> + /*
> + * use_task_css_set_links is set to 1 before we walk the tasklist
> + * under the tasklist_lock and we read it here after we added the child
> + * to the tasklist under the tasklist_lock as well. If the child wasn't
> + * yet in the tasklist when we walked through it from
> + * cgroup_enable_task_cg_lists(), then use_task_css_set_links value
> + * should be visible now due to the paired locking and barriers implied
> + * by LOCK/UNLOCK: it is written before the tasklist_lock unlock
> + * in cgroup_enable_task_cg_lists() and read here after the tasklist_lock
> + * lock on fork.
> + */
> if (use_task_css_set_links) {
> write_lock(&css_set_lock);
> if (list_empty(&child->cg_list)) {
> --
> 1.7.5.4
>

2012-02-22 00:55:35

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Tue, Feb 21, 2012 at 02:23:43PM -0800, Mandeep Singh Baines wrote:
> Frederic Weisbecker ([email protected]) wrote:
> > Walking through the tasklist in cgroup_enable_task_cg_list() inside
> > an RCU read side critical section is not enough because:
> >
> > - RCU is not (yet) safe against while_each_thread()
> >
> > - If we use only RCU, a forking task that has passed cgroup_post_fork()
> > without seeing use_task_css_set_links == 1 is not guaranteed to have
> > its child immediately visible in the tasklist if we walk through it
> > remotely with RCU. In this case it will be missing in its css_set's
> > task list.
> >
> > Thus we need to traverse the list (unfortunately) under the
> > tasklist_lock. It makes us safe against while_each_thread() and also
> > make sure we see all forked task that have been added to the tasklist.
> >
> > As a secondary effect, reading and writing use_task_css_set_links are
> > now well ordered against tasklist traversing and modification. The new
> > layout is:
> >
> > CPU 0 CPU 1
> >
> > use_task_css_set_links = 1 write_lock(tasklist_lock)
> > read_lock(tasklist_lock) add task to tasklist
> > do_each_thread() { write_unlock(tasklist_lock)
> > add thread to css set links if (use_task_css_set_links)
> > } while_each_thread() add thread to css set links
> > read_unlock(tasklist_lock)
> >
> > If CPU 0 traverse the list after the task has been added to the tasklist
> > then it is correctly added to the css set links. OTOH if CPU 0 traverse
> > the tasklist before the new task had the opportunity to be added to the
> > tasklist because it was too early in the fork process, then CPU 1
> > catches up and add the task to the css set links after it added the task
> > to the tasklist. The right value of use_task_css_set_links is guaranteed
> > to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
> > the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
> > and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
> > afterward to return the correct value.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> > Cc: Li Zefan <[email protected]>
> > Cc: Mandeep Singh Baines <[email protected]>
>
> Reviewed-by: Mandeep Singh Baines <[email protected]>
>
> Sorry for being late. My feedback is really just comments.
>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > ---
> > kernel/cgroup.c | 20 ++++++++++++++++++++
> > 1 files changed, 20 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 6e4eb43..c6877fe 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -2707,6 +2707,14 @@ static void cgroup_enable_task_cg_lists(void)
> > struct task_struct *p, *g;
> > write_lock(&css_set_lock);
>
> You might want to re-test use_task_css_set_links once you have the lock
> in order to avoid an unnecessary do_each_thread()/while_each_thread() in
> case you race between reading the value and entering the loop. This is
> a potential optimization in a rare case so maybe not worth the LOC.

Makes sense. I'll do that in a seperate patch.

>
> > use_task_css_set_links = 1;
> > + /*
> > + * We need tasklist_lock because RCU is not safe against
> > + * while_each_thread(). Besides, a forking task that has passed
> > + * cgroup_post_fork() without seeing use_task_css_set_links = 1
> > + * is not guaranteed to have its child immediately visible in the
> > + * tasklist if we walk through it with RCU.
> > + */
>
> Maybe add TODO to remove the lock once do_each_thread()/while_each_thread()
> is made rcu safe. On a large system, it could take a while to iterate
> over every thread in the system. Thats a long time to hold a spinlock.
> But it only happens once so probably not that big a deal.

I think that even if while_each_thread() was RCU safe, that wouldn't
work here.

Unless I'm mistaken, we have no guarantee that a remote list_add_rcu()
is immediately visible by the local CPU if it walks the list under
rcu_read_lock() only.

Consider that ordering scenario:

CPU 0 CPU 1
--------------- --------------

fork() {
write_lock(tasklist_lock);
add child to tasklist
write_unlock(tasklist_lock);
cgroup_post_fork()
}

cgroup_enable_task_cg_lists() {
rcu_read_lock();
do_each_thread() {
..... <-- find child ?
} while_each_thread()
rcu_read_unlock()


We have no guarantee here that the write on CPU 0 will be visible
in time to CPU 1.

But may be I misunderstood the ordering and committing guarantees with RCU.
Perhaps Paul can confirm or correct me.

Paul?

2012-02-22 01:00:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Wed, Feb 22, 2012 at 01:55:28AM +0100, Frederic Weisbecker wrote:
> > > + * We need tasklist_lock because RCU is not safe against
> > > + * while_each_thread(). Besides, a forking task that has passed
> > > + * cgroup_post_fork() without seeing use_task_css_set_links = 1
> > > + * is not guaranteed to have its child immediately visible in the
> > > + * tasklist if we walk through it with RCU.
> > > + */
> >
> > Maybe add TODO to remove the lock once do_each_thread()/while_each_thread()
> > is made rcu safe. On a large system, it could take a while to iterate
> > over every thread in the system. Thats a long time to hold a spinlock.
> > But it only happens once so probably not that big a deal.
>
> I think that even if while_each_thread() was RCU safe, that wouldn't
> work here.

Guys, this is one time thing. It happens *once* after boot and we're
already holding exclusive lock. There's no reason to optimize this at
all. Let's just keep it simple.

Thanks.

--
tejun

2012-02-22 01:04:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Tue, Feb 21, 2012 at 05:00:15PM -0800, Tejun Heo wrote:
> On Wed, Feb 22, 2012 at 01:55:28AM +0100, Frederic Weisbecker wrote:
> > > > + * We need tasklist_lock because RCU is not safe against
> > > > + * while_each_thread(). Besides, a forking task that has passed
> > > > + * cgroup_post_fork() without seeing use_task_css_set_links = 1
> > > > + * is not guaranteed to have its child immediately visible in the
> > > > + * tasklist if we walk through it with RCU.
> > > > + */
> > >
> > > Maybe add TODO to remove the lock once do_each_thread()/while_each_thread()
> > > is made rcu safe. On a large system, it could take a while to iterate
> > > over every thread in the system. Thats a long time to hold a spinlock.
> > > But it only happens once so probably not that big a deal.
> >
> > I think that even if while_each_thread() was RCU safe, that wouldn't
> > work here.
>
> Guys, this is one time thing. It happens *once* after boot and we're
> already holding exclusive lock. There's no reason to optimize this at
> all. Let's just keep it simple.

For now I agree. But one day the real time guys might eye that thing.

2012-02-22 01:06:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Wed, Feb 22, 2012 at 02:04:34AM +0100, Frederic Weisbecker wrote:
> > Guys, this is one time thing. It happens *once* after boot and we're
> > already holding exclusive lock. There's no reason to optimize this at
> > all. Let's just keep it simple.
>
> For now I agree. But one day the real time guys might eye that thing.

If the latency issue ever comes up, I think the right thing to do is
just disable the lazy optimization. ie. Just chain tasks from the
beginning.

Thanks.

--
tejun

2012-02-22 01:10:54

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Tue, Feb 21, 2012 at 05:06:23PM -0800, Tejun Heo wrote:
> On Wed, Feb 22, 2012 at 02:04:34AM +0100, Frederic Weisbecker wrote:
> > > Guys, this is one time thing. It happens *once* after boot and we're
> > > already holding exclusive lock. There's no reason to optimize this at
> > > all. Let's just keep it simple.
> >
> > For now I agree. But one day the real time guys might eye that thing.
>
> If the latency issue ever comes up, I think the right thing to do is
> just disable the lazy optimization. ie. Just chain tasks from the
> beginning.

Yeah agreed.

2012-02-22 01:19:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Wed, Feb 22, 2012 at 01:55:28AM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 21, 2012 at 02:23:43PM -0800, Mandeep Singh Baines wrote:
> > Frederic Weisbecker ([email protected]) wrote:
> > > Walking through the tasklist in cgroup_enable_task_cg_list() inside
> > > an RCU read side critical section is not enough because:
> > >
> > > - RCU is not (yet) safe against while_each_thread()
> > >
> > > - If we use only RCU, a forking task that has passed cgroup_post_fork()
> > > without seeing use_task_css_set_links == 1 is not guaranteed to have
> > > its child immediately visible in the tasklist if we walk through it
> > > remotely with RCU. In this case it will be missing in its css_set's
> > > task list.
> > >
> > > Thus we need to traverse the list (unfortunately) under the
> > > tasklist_lock. It makes us safe against while_each_thread() and also
> > > make sure we see all forked task that have been added to the tasklist.
> > >
> > > As a secondary effect, reading and writing use_task_css_set_links are
> > > now well ordered against tasklist traversing and modification. The new
> > > layout is:
> > >
> > > CPU 0 CPU 1
> > >
> > > use_task_css_set_links = 1 write_lock(tasklist_lock)
> > > read_lock(tasklist_lock) add task to tasklist
> > > do_each_thread() { write_unlock(tasklist_lock)
> > > add thread to css set links if (use_task_css_set_links)
> > > } while_each_thread() add thread to css set links
> > > read_unlock(tasklist_lock)
> > >
> > > If CPU 0 traverse the list after the task has been added to the tasklist
> > > then it is correctly added to the css set links. OTOH if CPU 0 traverse
> > > the tasklist before the new task had the opportunity to be added to the
> > > tasklist because it was too early in the fork process, then CPU 1
> > > catches up and add the task to the css set links after it added the task
> > > to the tasklist. The right value of use_task_css_set_links is guaranteed
> > > to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
> > > the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
> > > and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
> > > afterward to return the correct value.
> > >
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Tejun Heo <[email protected]>
> > > Cc: Li Zefan <[email protected]>
> > > Cc: Mandeep Singh Baines <[email protected]>
> >
> > Reviewed-by: Mandeep Singh Baines <[email protected]>
> >
> > Sorry for being late. My feedback is really just comments.
> >
> > > Cc: Oleg Nesterov <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/cgroup.c | 20 ++++++++++++++++++++
> > > 1 files changed, 20 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > > index 6e4eb43..c6877fe 100644
> > > --- a/kernel/cgroup.c
> > > +++ b/kernel/cgroup.c
> > > @@ -2707,6 +2707,14 @@ static void cgroup_enable_task_cg_lists(void)
> > > struct task_struct *p, *g;
> > > write_lock(&css_set_lock);
> >
> > You might want to re-test use_task_css_set_links once you have the lock
> > in order to avoid an unnecessary do_each_thread()/while_each_thread() in
> > case you race between reading the value and entering the loop. This is
> > a potential optimization in a rare case so maybe not worth the LOC.
>
> Makes sense. I'll do that in a seperate patch.
>
> >
> > > use_task_css_set_links = 1;
> > > + /*
> > > + * We need tasklist_lock because RCU is not safe against
> > > + * while_each_thread(). Besides, a forking task that has passed
> > > + * cgroup_post_fork() without seeing use_task_css_set_links = 1
> > > + * is not guaranteed to have its child immediately visible in the
> > > + * tasklist if we walk through it with RCU.
> > > + */
> >
> > Maybe add TODO to remove the lock once do_each_thread()/while_each_thread()
> > is made rcu safe. On a large system, it could take a while to iterate
> > over every thread in the system. Thats a long time to hold a spinlock.
> > But it only happens once so probably not that big a deal.
>
> I think that even if while_each_thread() was RCU safe, that wouldn't
> work here.
>
> Unless I'm mistaken, we have no guarantee that a remote list_add_rcu()
> is immediately visible by the local CPU if it walks the list under
> rcu_read_lock() only.

Indeed, the guarantee is instead that -if- a reader encounters a newly
added list element, then that reader will see any initialization of that
list element carried out prior to the list_add_rcu().

Memory barriers are about ordering, not about making memory writes
visible faster.

Thanx, Paul

> Consider that ordering scenario:
>
> CPU 0 CPU 1
> --------------- --------------
>
> fork() {
> write_lock(tasklist_lock);
> add child to tasklist
> write_unlock(tasklist_lock);
> cgroup_post_fork()
> }
>
> cgroup_enable_task_cg_lists() {
> rcu_read_lock();
> do_each_thread() {
> ..... <-- find child ?
> } while_each_thread()
> rcu_read_unlock()
>
>
> We have no guarantee here that the write on CPU 0 will be visible
> in time to CPU 1.
>
> But may be I misunderstood the ordering and committing guarantees with RCU.
> Perhaps Paul can confirm or correct me.
>
> Paul?
>

2012-02-22 01:33:48

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Tue, Feb 21, 2012 at 05:19:34PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 22, 2012 at 01:55:28AM +0100, Frederic Weisbecker wrote:
> > On Tue, Feb 21, 2012 at 02:23:43PM -0800, Mandeep Singh Baines wrote:
> > > Frederic Weisbecker ([email protected]) wrote:
> > > > Walking through the tasklist in cgroup_enable_task_cg_list() inside
> > > > an RCU read side critical section is not enough because:
> > > >
> > > > - RCU is not (yet) safe against while_each_thread()
> > > >
> > > > - If we use only RCU, a forking task that has passed cgroup_post_fork()
> > > > without seeing use_task_css_set_links == 1 is not guaranteed to have
> > > > its child immediately visible in the tasklist if we walk through it
> > > > remotely with RCU. In this case it will be missing in its css_set's
> > > > task list.
> > > >
> > > > Thus we need to traverse the list (unfortunately) under the
> > > > tasklist_lock. It makes us safe against while_each_thread() and also
> > > > make sure we see all forked task that have been added to the tasklist.
> > > >
> > > > As a secondary effect, reading and writing use_task_css_set_links are
> > > > now well ordered against tasklist traversing and modification. The new
> > > > layout is:
> > > >
> > > > CPU 0 CPU 1
> > > >
> > > > use_task_css_set_links = 1 write_lock(tasklist_lock)
> > > > read_lock(tasklist_lock) add task to tasklist
> > > > do_each_thread() { write_unlock(tasklist_lock)
> > > > add thread to css set links if (use_task_css_set_links)
> > > > } while_each_thread() add thread to css set links
> > > > read_unlock(tasklist_lock)
> > > >
> > > > If CPU 0 traverse the list after the task has been added to the tasklist
> > > > then it is correctly added to the css set links. OTOH if CPU 0 traverse
> > > > the tasklist before the new task had the opportunity to be added to the
> > > > tasklist because it was too early in the fork process, then CPU 1
> > > > catches up and add the task to the css set links after it added the task
> > > > to the tasklist. The right value of use_task_css_set_links is guaranteed
> > > > to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
> > > > the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
> > > > and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
> > > > afterward to return the correct value.
> > > >
> > > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > > Cc: Tejun Heo <[email protected]>
> > > > Cc: Li Zefan <[email protected]>
> > > > Cc: Mandeep Singh Baines <[email protected]>
> > >
> > > Reviewed-by: Mandeep Singh Baines <[email protected]>
> > >
> > > Sorry for being late. My feedback is really just comments.
> > >
> > > > Cc: Oleg Nesterov <[email protected]>
> > > > Cc: Andrew Morton <[email protected]>
> > > > Cc: Paul E. McKenney <[email protected]>
> > > > ---
> > > > kernel/cgroup.c | 20 ++++++++++++++++++++
> > > > 1 files changed, 20 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > > > index 6e4eb43..c6877fe 100644
> > > > --- a/kernel/cgroup.c
> > > > +++ b/kernel/cgroup.c
> > > > @@ -2707,6 +2707,14 @@ static void cgroup_enable_task_cg_lists(void)
> > > > struct task_struct *p, *g;
> > > > write_lock(&css_set_lock);
> > >
> > > You might want to re-test use_task_css_set_links once you have the lock
> > > in order to avoid an unnecessary do_each_thread()/while_each_thread() in
> > > case you race between reading the value and entering the loop. This is
> > > a potential optimization in a rare case so maybe not worth the LOC.
> >
> > Makes sense. I'll do that in a seperate patch.
> >
> > >
> > > > use_task_css_set_links = 1;
> > > > + /*
> > > > + * We need tasklist_lock because RCU is not safe against
> > > > + * while_each_thread(). Besides, a forking task that has passed
> > > > + * cgroup_post_fork() without seeing use_task_css_set_links = 1
> > > > + * is not guaranteed to have its child immediately visible in the
> > > > + * tasklist if we walk through it with RCU.
> > > > + */
> > >
> > > Maybe add TODO to remove the lock once do_each_thread()/while_each_thread()
> > > is made rcu safe. On a large system, it could take a while to iterate
> > > over every thread in the system. Thats a long time to hold a spinlock.
> > > But it only happens once so probably not that big a deal.
> >
> > I think that even if while_each_thread() was RCU safe, that wouldn't
> > work here.
> >
> > Unless I'm mistaken, we have no guarantee that a remote list_add_rcu()
> > is immediately visible by the local CPU if it walks the list under
> > rcu_read_lock() only.
>
> Indeed, the guarantee is instead that -if- a reader encounters a newly
> added list element, then that reader will see any initialization of that
> list element carried out prior to the list_add_rcu().
>
> Memory barriers are about ordering, not about making memory writes
> visible faster.
>
> Thanx, Paul

Cool that confirm what I was thinking. Thanks for the clarification!

2012-02-22 18:10:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

Hi.

Sorry, currently I do not have the time to even read this discussion,
just one note,

On 02/22, Frederic Weisbecker wrote:
>
> On Tue, Feb 21, 2012 at 05:19:34PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 22, 2012 at 01:55:28AM +0100, Frederic Weisbecker wrote:
> > > On Tue, Feb 21, 2012 at 02:23:43PM -0800, Mandeep Singh Baines wrote:
> > > > Frederic Weisbecker ([email protected]) wrote:
> > > > > Walking through the tasklist in cgroup_enable_task_cg_list() inside
> > > > > an RCU read side critical section is not enough because:
> > > > >
> > > > > - RCU is not (yet) safe against while_each_thread()

Yes. Except I'd say it is while_each_thread() who is not safe ;)

Please do not take this into account. This should be fixed, I hope
to send the fix "soon".

Oleg.

2012-02-27 18:57:31

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

On Wed, Feb 22, 2012 at 07:03:27PM +0100, Oleg Nesterov wrote:
> Hi.
>
> Sorry, currently I do not have the time to even read this discussion,
> just one note,
>
> On 02/22, Frederic Weisbecker wrote:
> >
> > On Tue, Feb 21, 2012 at 05:19:34PM -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 22, 2012 at 01:55:28AM +0100, Frederic Weisbecker wrote:
> > > > On Tue, Feb 21, 2012 at 02:23:43PM -0800, Mandeep Singh Baines wrote:
> > > > > Frederic Weisbecker ([email protected]) wrote:
> > > > > > Walking through the tasklist in cgroup_enable_task_cg_list() inside
> > > > > > an RCU read side critical section is not enough because:
> > > > > >
> > > > > > - RCU is not (yet) safe against while_each_thread()
>
> Yes. Except I'd say it is while_each_thread() who is not safe ;)

Right.

>
> Please do not take this into account. This should be fixed, I hope
> to send the fix "soon".

Sure, but anyway we need the read_lock here for ordering/commiting reasons.