2011-02-02 11:28:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

On Thu, 2011-01-20 at 15:39 +0100, Peter Zijlstra wrote:
> On Thu, 2011-01-20 at 15:30 +0200, Stephane Eranian wrote:
> > @@ -4259,8 +4261,20 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> >
> > /* Reassign the task to the init_css_set. */
> > task_lock(tsk);
> > + /*
> > + * we mask interrupts to prevent:
> > + * - timer tick to cause event rotation which
> > + * could schedule back in cgroup events after
> > + * they were switched out by perf_cgroup_sched_out()
> > + *
> > + * - preemption which could schedule back in cgroup events
> > + */
> > + local_irq_save(flags);
> > + perf_cgroup_sched_out(tsk);
> > cg = tsk->cgroups;
> > tsk->cgroups = &init_css_set;
> > + perf_cgroup_sched_in(tsk);
> > + local_irq_restore(flags);
> > task_unlock(tsk);
> > if (cg)
> > put_css_set_taskexit(cg);
>
> So you too need a callback on cgroup change there.. Li, Paul, any chance
> we can fix this cgroup_subsys::exit callback? The scheduler code needs
> to do funny thing because its in the wrong place as well.

cgroup guys? Shall I just fix this exit thing since the only user seems
to be the scheduler and now perf for both of which its unfortunate at
best?

Balbir, memcontrol.c uses pre_destroy(), I pose that using this method
is broken per definition since it makes the cgroup empty notification
void.



2011-02-02 11:50:24

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

* Peter Zijlstra <[email protected]> [2011-02-02 12:29:20]:

> On Thu, 2011-01-20 at 15:39 +0100, Peter Zijlstra wrote:
> > On Thu, 2011-01-20 at 15:30 +0200, Stephane Eranian wrote:
> > > @@ -4259,8 +4261,20 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> > >
> > > /* Reassign the task to the init_css_set. */
> > > task_lock(tsk);
> > > + /*
> > > + * we mask interrupts to prevent:
> > > + * - timer tick to cause event rotation which
> > > + * could schedule back in cgroup events after
> > > + * they were switched out by perf_cgroup_sched_out()
> > > + *
> > > + * - preemption which could schedule back in cgroup events
> > > + */
> > > + local_irq_save(flags);
> > > + perf_cgroup_sched_out(tsk);
> > > cg = tsk->cgroups;
> > > tsk->cgroups = &init_css_set;
> > > + perf_cgroup_sched_in(tsk);
> > > + local_irq_restore(flags);
> > > task_unlock(tsk);
> > > if (cg)
> > > put_css_set_taskexit(cg);
> >
> > So you too need a callback on cgroup change there.. Li, Paul, any chance
> > we can fix this cgroup_subsys::exit callback? The scheduler code needs
> > to do funny thing because its in the wrong place as well.
>
> cgroup guys? Shall I just fix this exit thing since the only user seems
> to be the scheduler and now perf for both of which its unfortunate at
> best?a

Are you suggesting that the cgroup_exit on task_exit notification should be
pulled out?

>
> Balbir, memcontrol.c uses pre_destroy(), I pose that using this method
> is broken per definition since it makes the cgroup empty notification
> void.
>

We use pre_destroy() to reclaim, so that delete/rmdir() will be able
to clean up the node/group. I am not sure what you mean by it makes
the empty notification void and why pre_destroy() is broken?

--
Three Cheers,
Balbir

2011-02-02 12:45:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

On Wed, 2011-02-02 at 17:20 +0530, Balbir Singh wrote:
> * Peter Zijlstra <[email protected]> [2011-02-02 12:29:20]:
>
> > On Thu, 2011-01-20 at 15:39 +0100, Peter Zijlstra wrote:
> > > On Thu, 2011-01-20 at 15:30 +0200, Stephane Eranian wrote:
> > > > @@ -4259,8 +4261,20 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> > > >
> > > > /* Reassign the task to the init_css_set. */
> > > > task_lock(tsk);
> > > > + /*
> > > > + * we mask interrupts to prevent:
> > > > + * - timer tick to cause event rotation which
> > > > + * could schedule back in cgroup events after
> > > > + * they were switched out by perf_cgroup_sched_out()
> > > > + *
> > > > + * - preemption which could schedule back in cgroup events
> > > > + */
> > > > + local_irq_save(flags);
> > > > + perf_cgroup_sched_out(tsk);
> > > > cg = tsk->cgroups;
> > > > tsk->cgroups = &init_css_set;
> > > > + perf_cgroup_sched_in(tsk);
> > > > + local_irq_restore(flags);
> > > > task_unlock(tsk);
> > > > if (cg)
> > > > put_css_set_taskexit(cg);
> > >
> > > So you too need a callback on cgroup change there.. Li, Paul, any chance
> > > we can fix this cgroup_subsys::exit callback? The scheduler code needs
> > > to do funny thing because its in the wrong place as well.
> >
> > cgroup guys? Shall I just fix this exit thing since the only user seems
> > to be the scheduler and now perf for both of which its unfortunate at
> > best?
>
> Are you suggesting that the cgroup_exit on task_exit notification should be
> pulled out?


No, just fixed. The callback as it exists isn't useful and leads to
hacks like the above.


> > Balbir, memcontrol.c uses pre_destroy(), I pose that using this method
> > is broken per definition since it makes the cgroup empty notification
> > void.
> >
>
> We use pre_destroy() to reclaim, so that delete/rmdir() will be able
> to clean up the node/group. I am not sure what you mean by it makes
> the empty notification void and why pre_destroy() is broken?

A quick look at the code looked like it could return -EBUSY (and other
errors), in that case the rmdir of the empty cgroup will fail.

Therefore it can happen that after the last task is removed, and we get
the notification that the cgroup is empty, and we attempt the rmdir we
will fail.

This again means that all such notification handlers must poll state,
which is ridiculous.


2011-02-02 19:03:08

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

* Peter Zijlstra <[email protected]> [2011-02-02 13:46:32]:

> On Wed, 2011-02-02 at 17:20 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <[email protected]> [2011-02-02 12:29:20]:
> >
> > > On Thu, 2011-01-20 at 15:39 +0100, Peter Zijlstra wrote:
> > > > On Thu, 2011-01-20 at 15:30 +0200, Stephane Eranian wrote:
> > > > > @@ -4259,8 +4261,20 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> > > > >
> > > > > /* Reassign the task to the init_css_set. */
> > > > > task_lock(tsk);
> > > > > + /*
> > > > > + * we mask interrupts to prevent:
> > > > > + * - timer tick to cause event rotation which
> > > > > + * could schedule back in cgroup events after
> > > > > + * they were switched out by perf_cgroup_sched_out()
> > > > > + *
> > > > > + * - preemption which could schedule back in cgroup events
> > > > > + */
> > > > > + local_irq_save(flags);
> > > > > + perf_cgroup_sched_out(tsk);
> > > > > cg = tsk->cgroups;
> > > > > tsk->cgroups = &init_css_set;
> > > > > + perf_cgroup_sched_in(tsk);
> > > > > + local_irq_restore(flags);
> > > > > task_unlock(tsk);
> > > > > if (cg)
> > > > > put_css_set_taskexit(cg);
> > > >
> > > > So you too need a callback on cgroup change there.. Li, Paul, any chance
> > > > we can fix this cgroup_subsys::exit callback? The scheduler code needs
> > > > to do funny thing because its in the wrong place as well.
> > >
> > > cgroup guys? Shall I just fix this exit thing since the only user seems
> > > to be the scheduler and now perf for both of which its unfortunate at
> > > best?
> >
> > Are you suggesting that the cgroup_exit on task_exit notification should be
> > pulled out?
>
>
> No, just fixed. The callback as it exists isn't useful and leads to
> hacks like the above.
>

OK

>
> > > Balbir, memcontrol.c uses pre_destroy(), I pose that using this method
> > > is broken per definition since it makes the cgroup empty notification
> > > void.
> > >
> >
> > We use pre_destroy() to reclaim, so that delete/rmdir() will be able
> > to clean up the node/group. I am not sure what you mean by it makes
> > the empty notification void and why pre_destroy() is broken?
>
> A quick look at the code looked like it could return -EBUSY (and other
> errors), in that case the rmdir of the empty cgroup will fail.
>
> Therefore it can happen that after the last task is removed, and we get
> the notification that the cgroup is empty, and we attempt the rmdir we
> will fail.
>
> This again means that all such notification handlers must poll state,
> which is ridiculous.

The reason why the failure occurs is because someone has an active
reference to the cgroup structure. In the case of memory, it was every
page_cgroup earlier. The only reason why a notification would have to
poll state is if

1. notification is sent that there are no references, this group can
be cleaned up
2. A new reference is acquired before the cleanup

1 and 2 are unlikely


--
Three Cheers,
Balbir

2011-02-07 16:09:35

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Thu, 2011-02-03 at 00:32 +0530, Balbir Singh wrote:
> > No, just fixed. The callback as it exists isn't useful and leads to
> > hacks like the above.

---
Subject: cgroup: Fix cgroup_subsys::exit callback
From: Peter Zijlstra <[email protected]>
Date: Mon Feb 07 17:02:20 CET 2011

Make the ::exit method act like ::attach, it is after all very nearly
the same thing.

Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
---
Index: linux-2.6/include/linux/cgroup.h
===================================================================
--- linux-2.6.orig/include/linux/cgroup.h
+++ linux-2.6/include/linux/cgroup.h
@@ -474,7 +474,8 @@ struct cgroup_subsys {
struct cgroup *old_cgrp, struct task_struct *tsk,
bool threadgroup);
void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
- void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
+ void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup *old_cgrp, struct task_struct *task);
int (*populate)(struct cgroup_subsys *ss,
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -4230,20 +4230,10 @@ void cgroup_post_fork(struct task_struct
*/
void cgroup_exit(struct task_struct *tsk, int run_callbacks)
{
- int i;
struct css_set *cg;
+ int i;

- if (run_callbacks && need_forkexit_callback) {
- /*
- * modular subsystems can't use callbacks, so no need to lock
- * the subsys array
- */
- for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
- struct cgroup_subsys *ss = subsys[i];
- if (ss->exit)
- ss->exit(ss, tsk);
- }
- }
+ mutex_lock(&cgroup_mutex);

/*
* Unlink from the css_set task list if necessary.
@@ -4262,6 +4252,25 @@ void cgroup_exit(struct task_struct *tsk
cg = tsk->cgroups;
tsk->cgroups = &init_css_set;
task_unlock(tsk);
+
+ if (run_callbacks && need_forkexit_callback) {
+ /*
+ * modular subsystems can't use callbacks, so no need to lock
+ * the subsys array
+ */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
+ struct cgroup_subsys *ss = subsys[i];
+ if (ss->exit) {
+ struct cgroup *old_cgrp =
+ rcu_dereference_raw(cg->subsys[i])->cgroup;
+ struct cgroup *cgrp = task_cgroup(tsk, i);
+ ss->exit(ss, cgrp, old_cgrp, tsk);
+ }
+ }
+ }
+
+ mutex_unlock(&cgroup_mutex);
+
if (cg)
put_css_set_taskexit(cg);
}
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -606,9 +606,6 @@ static inline struct task_group *task_gr
struct task_group *tg;
struct cgroup_subsys_state *css;

- if (p->flags & PF_EXITING)
- return &root_task_group;
-
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
tg = container_of(css, struct task_group, css);
@@ -9081,7 +9078,8 @@ cpu_cgroup_attach(struct cgroup_subsys *
}

static void
-cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
+cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup *old_cgrp, struct task_struct *task)
{
/*
* cgroup_exit() is called in the copy_process() failure path.

2011-02-07 19:29:34

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

On Wed, Feb 2, 2011 at 4:46 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2011-02-02 at 17:20 +0530, Balbir Singh wrote:
>> * Peter Zijlstra <[email protected]> [2011-02-02 12:29:20]:
>>
>> > On Thu, 2011-01-20 at 15:39 +0100, Peter Zijlstra wrote:
>> > > On Thu, 2011-01-20 at 15:30 +0200, Stephane Eranian wrote:
>> > > > @@ -4259,8 +4261,20 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
>> > > >
>> > > > ? ? ? ? /* Reassign the task to the init_css_set. */
>> > > > ? ? ? ? task_lock(tsk);
>> > > > + ? ? ? /*
>> > > > + ? ? ? ?* we mask interrupts to prevent:
>> > > > + ? ? ? ?* - timer tick to cause event rotation which
>> > > > + ? ? ? ?* ? could schedule back in cgroup events after
>> > > > + ? ? ? ?* ? they were switched out by perf_cgroup_sched_out()
>> > > > + ? ? ? ?*
>> > > > + ? ? ? ?* - preemption which could schedule back in cgroup events
>> > > > + ? ? ? ?*/
>> > > > + ? ? ? local_irq_save(flags);
>> > > > + ? ? ? perf_cgroup_sched_out(tsk);
>> > > > ? ? ? ? cg = tsk->cgroups;
>> > > > ? ? ? ? tsk->cgroups = &init_css_set;
>> > > > + ? ? ? perf_cgroup_sched_in(tsk);
>> > > > + ? ? ? local_irq_restore(flags);
>> > > > ? ? ? ? task_unlock(tsk);
>> > > > ? ? ? ? if (cg)
>> > > > ? ? ? ? ? ? ? ? put_css_set_taskexit(cg);
>> > >
>> > > So you too need a callback on cgroup change there.. Li, Paul, any chance
>> > > we can fix this cgroup_subsys::exit callback? The scheduler code needs
>> > > to do funny thing because its in the wrong place as well.
>> >
>> > cgroup guys? Shall I just fix this exit thing since the only user seems
>> > to be the scheduler and now perf for both of which its unfortunate at
>> > best?
>>
>> Are you suggesting that the cgroup_exit on task_exit notification should be
>> pulled out?
>
>
> No, just fixed. The callback as it exists isn't useful and leads to
> hacks like the above.
>
>
>> > Balbir, memcontrol.c uses pre_destroy(), I pose that using this method
>> > is broken per definition since it makes the cgroup empty notification
>> > void.
>> >
>>
>> We use pre_destroy() to reclaim, so that delete/rmdir() will be able
>> to clean up the node/group. I am not sure what you mean by it makes
>> the empty notification void and why pre_destroy() is broken?
>
> A quick look at the code looked like it could return -EBUSY (and other
> errors), in that case the rmdir of the empty cgroup will fail.
>
> Therefore it can happen that after the last task is removed, and we get
> the notification that the cgroup is empty, and we attempt the rmdir we
> will fail.
>
> This again means that all such notification handlers must poll state,
> which is ridiculous.
>

Not necessarily - we could make it that a failed rmdir() sets a bit
that causes a notification again once the final refcount is dropped
again on the cgroup.

Paul

2011-02-07 19:29:26

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Mon, Feb 7, 2011 at 8:10 AM, Peter Zijlstra <[email protected]> wrote:
>
> Make the ::exit method act like ::attach, it is after all very nearly
> the same thing.

The major difference between attach and exit is that the former is
only triggered in response to user cgroup-management action, whereas
the latter is triggered whenever a task exits, even if cgroups aren't
set up.


> ?void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> ?{
> - ? ? ? int i;
> ? ? ? ?struct css_set *cg;
> + ? ? ? int i;
>
> - ? ? ? if (run_callbacks && need_forkexit_callback) {
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* modular subsystems can't use callbacks, so no need to lock
> - ? ? ? ? ? ? ? ?* the subsys array
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
> - ? ? ? ? ? ? ? ? ? ? ? struct cgroup_subsys *ss = subsys[i];
> - ? ? ? ? ? ? ? ? ? ? ? if (ss->exit)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ss->exit(ss, tsk);
> - ? ? ? ? ? ? ? }
> - ? ? ? }
> + ? ? ? mutex_lock(&cgroup_mutex);

NACK - cgroup_mutex is way too heavy to take in the task exit path.
We'll need to find some other way to fix this if it's really needed.
task->alloc_lock is also normally a valid thing to synchronize against
cgroup moves, but I'd have to look at the exit path to see if it's
still valid there.

Paul

2011-02-07 20:01:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Mon, 2011-02-07 at 11:28 -0800, Paul Menage wrote:
> On Mon, Feb 7, 2011 at 8:10 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Make the ::exit method act like ::attach, it is after all very nearly
> > the same thing.
>
> The major difference between attach and exit is that the former is
> only triggered in response to user cgroup-management action, whereas
> the latter is triggered whenever a task exits, even if cgroups aren't
> set up.

And the major likeness is that they both migrate a task from one cgroup
to another. You cannot simply ignore that.

> > void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> > {
> > - int i;
> > struct css_set *cg;
> > + int i;
> >
> > - if (run_callbacks && need_forkexit_callback) {
> > - /*
> > - * modular subsystems can't use callbacks, so no need to lock
> > - * the subsys array
> > - */
> > - for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
> > - struct cgroup_subsys *ss = subsys[i];
> > - if (ss->exit)
> > - ss->exit(ss, tsk);
> > - }
> > - }
> > + mutex_lock(&cgroup_mutex);
>
> NACK - cgroup_mutex is way too heavy to take in the task exit path.
> We'll need to find some other way to fix this if it's really needed.
> task->alloc_lock is also normally a valid thing to synchronize against
> cgroup moves, but I'd have to look at the exit path to see if it's
> still valid there.

If maybe you're like respond more often than about once every two months
I might actually care what you think.

2011-02-07 20:08:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

On Mon, 2011-02-07 at 11:29 -0800, Paul Menage wrote:
> > This again means that all such notification handlers must poll state,
> > which is ridiculous.
> >
>
> Not necessarily - we could make it that a failed rmdir() sets a bit
> that causes a notification again once the final refcount is dropped
> again on the cgroup.

But that doesn't work if you fail for any other reason than a refcount,
which you yourself proposed a while back.

Also, the core cgroup code should deal with the cgroup refcount, not
individual controllers.

2011-02-07 21:22:01

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Mon, Feb 7, 2011 at 12:02 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2011-02-07 at 11:28 -0800, Paul Menage wrote:
>> On Mon, Feb 7, 2011 at 8:10 AM, Peter Zijlstra <[email protected]> wrote:
>> >
>> > Make the ::exit method act like ::attach, it is after all very nearly
>> > the same thing.
>>
>> The major difference between attach and exit is that the former is
>> only triggered in response to user cgroup-management action, whereas
>> the latter is triggered whenever a task exits, even if cgroups aren't
>> set up.
>
> And the major likeness is that they both migrate a task from one cgroup
> to another. You cannot simply ignore that.

True, and the exit path for cgroups has always been a bit fuzzy - that
was kind of inherited from cpusets, where this worked out OK, but the
CPU subsystem has more interesting requirements.

An important semantic difference between attach and exit is that in
the exit case the destination is always the root, and the task in
question is going to be exiting before doing anything else
interesting. So it should be possible to optimize that path a lot
compared to the regular attach (many things related to resource usage
can be ignored, since the task won't have time to actually use any
non-trivial amount of resources).

>
> If maybe you're like respond more often than about once every two months
> I might actually care what you think.

Yes, sadly since switching groups at Google, cgroups has become pretty
much just a spare-time activity for me - but that in itself doesn't
automatically invalidate my opinion when I do have time to respond.
It's still the case that cgroup_mutex is an incredibly heavyweight
mutex that has no business being in the task exit path. Or do you just
believe that ad hominem is a valid style of argument?

How about if the exit callback was moved before the preceding
task_unlock()? Since I think the scheduler is still the only user of
the exit callback, redefining the locking semantics should be fine.

Paul

2011-02-07 21:34:00

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_events: add cgroup support (v8)

On Mon, Feb 7, 2011 at 12:09 PM, Peter Zijlstra <[email protected]> wrote:
>
> Also, the core cgroup code should deal with the cgroup refcount, not
> individual controllers.

Right, that's what I meant - make it so that an rmdir that fails
because a subsystem is still holding a refcount puts/leaves the cgroup
in a state where userspace will still get a notification when the last
refcount is dropped. Not something that the subsystem would have to
worry about.

Paul

2011-02-08 10:23:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Mon, 2011-02-07 at 13:21 -0800, Paul Menage wrote:
> On Mon, Feb 7, 2011 at 12:02 PM, Peter Zijlstra <[email protected]> wrote:
> > On Mon, 2011-02-07 at 11:28 -0800, Paul Menage wrote:
> >> On Mon, Feb 7, 2011 at 8:10 AM, Peter Zijlstra <[email protected]> wrote:
> >> >
> >> > Make the ::exit method act like ::attach, it is after all very nearly
> >> > the same thing.
> >>
> >> The major difference between attach and exit is that the former is
> >> only triggered in response to user cgroup-management action, whereas
> >> the latter is triggered whenever a task exits, even if cgroups aren't
> >> set up.
> >
> > And the major likeness is that they both migrate a task from one cgroup
> > to another. You cannot simply ignore that.
>
> True, and the exit path for cgroups has always been a bit fuzzy - that
> was kind of inherited from cpusets, where this worked out OK, but the
> CPU subsystem has more interesting requirements.

Yeah, from that pov we're actually still a running task that can and
will scheduler still.

> An important semantic difference between attach and exit is that in
> the exit case the destination is always the root, and the task in
> question is going to be exiting before doing anything else
> interesting. So it should be possible to optimize that path a lot
> compared to the regular attach (many things related to resource usage
> can be ignored, since the task won't have time to actually use any
> non-trivial amount of resources).

One could also argue that the accumulated time (/other resources) of all
tasks exiting might end up being a significant amount, but yeah
whatever :-)

I'm mostly concerned with not wrecking state (and crashing/leaking etc).

> >
> > If maybe you're like respond more often than about once every two months
> > I might actually care what you think.
>
> Yes, sadly since switching groups at Google, cgroups has become pretty
> much just a spare-time activity for me

And here I thought Google was starting to understand what community
participation meant.. is there anybody you know who can play a more
active role in the whole cgroup thing?

> - but that in itself doesn't
> automatically invalidate my opinion when I do have time to respond.
> It's still the case that cgroup_mutex is an incredibly heavyweight
> mutex that has no business being in the task exit path. Or do you just
> believe that ad hominem is a valid style of argument?

No, it was mostly frustration talking.

> How about if the exit callback was moved before the preceding
> task_unlock()? Since I think the scheduler is still the only user of
> the exit callback, redefining the locking semantics should be fine.

Like the below? Both the perf and sched exit callback are fine with
being called under task_lock afaict, but I haven't actually ran with
lockdep enabled to see if I missed something.

I also pondered doing the cgroup exit from __put_task_struct() but that
had another problem iirc.

---
Index: linux-2.6/include/linux/cgroup.h
===================================================================
--- linux-2.6.orig/include/linux/cgroup.h
+++ linux-2.6/include/linux/cgroup.h
@@ -474,7 +474,8 @@ struct cgroup_subsys {
struct cgroup *old_cgrp, struct task_struct *tsk,
bool threadgroup);
void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
- void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
+ void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup *old_cgrp, struct task_struct *task);
int (*populate)(struct cgroup_subsys *ss,
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -4230,20 +4230,8 @@ void cgroup_post_fork(struct task_struct
*/
void cgroup_exit(struct task_struct *tsk, int run_callbacks)
{
- int i;
struct css_set *cg;
-
- if (run_callbacks && need_forkexit_callback) {
- /*
- * modular subsystems can't use callbacks, so no need to lock
- * the subsys array
- */
- for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
- struct cgroup_subsys *ss = subsys[i];
- if (ss->exit)
- ss->exit(ss, tsk);
- }
- }
+ int i;

/*
* Unlink from the css_set task list if necessary.
@@ -4261,7 +4249,24 @@ void cgroup_exit(struct task_struct *tsk
task_lock(tsk);
cg = tsk->cgroups;
tsk->cgroups = &init_css_set;
+
+ if (run_callbacks && need_forkexit_callback) {
+ /*
+ * modular subsystems can't use callbacks, so no need to lock
+ * the subsys array
+ */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
+ struct cgroup_subsys *ss = subsys[i];
+ if (ss->exit) {
+ struct cgroup *old_cgrp =
+ rcu_dereference_raw(cg->subsys[i])->cgroup;
+ struct cgroup *cgrp = task_cgroup(tsk, i);
+ ss->exit(ss, cgrp, old_cgrp, tsk);
+ }
+ }
+ }
task_unlock(tsk);
+
if (cg)
put_css_set_taskexit(cg);
}
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -606,9 +606,6 @@ static inline struct task_group *task_gr
struct task_group *tg;
struct cgroup_subsys_state *css;

- if (p->flags & PF_EXITING)
- return &root_task_group;
-
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
tg = container_of(css, struct task_group, css);
@@ -9081,7 +9078,8 @@ cpu_cgroup_attach(struct cgroup_subsys *
}

static void
-cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
+cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup *old_cgrp, struct task_struct *task)
{
/*
* cgroup_exit() is called in the copy_process() failure path.

2011-02-10 02:03:48

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

(just came back from vacation)

>> How about if the exit callback was moved before the preceding
>> task_unlock()? Since I think the scheduler is still the only user of
>> the exit callback, redefining the locking semantics should be fine.
>
> Like the below? Both the perf and sched exit callback are fine with
> being called under task_lock afaict, but I haven't actually ran with
> lockdep enabled to see if I missed something.
>

This change should be fine.

...

> +
> + if (run_callbacks && need_forkexit_callback) {
> + /*
> + * modular subsystems can't use callbacks, so no need to lock
> + * the subsys array
> + */
> + for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
> + struct cgroup_subsys *ss = subsys[i];
> + if (ss->exit) {
> + struct cgroup *old_cgrp =
> + rcu_dereference_raw(cg->subsys[i])->cgroup;
> + struct cgroup *cgrp = task_cgroup(tsk, i);
> + ss->exit(ss, cgrp, old_cgrp, tsk);

Since both sched and perf won't use the 2 args @cgrp and @old_cgrp,
don't bother to change the ->exit interface?

> + }
> + }
> + }
> task_unlock(tsk);
> +
> if (cg)
> put_css_set_taskexit(cg);
> }

2011-02-11 13:12:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Thu, 2011-02-10 at 10:04 +0800, Li Zefan wrote:
> Since both sched and perf won't use the 2 args @cgrp and @old_cgrp,
> don't bother to change the ->exit interface?

symmetry with ->attach, both sched and perf use a common method to
implement attach and exit, if one needs it the other would too.


2011-02-13 12:52:24

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

* Peter Zijlstra <[email protected]> [2011-02-07 17:10:33]:

> On Thu, 2011-02-03 at 00:32 +0530, Balbir Singh wrote:
> > > No, just fixed. The callback as it exists isn't useful and leads to
> > > hacks like the above.
>
> ---
> Subject: cgroup: Fix cgroup_subsys::exit callback
> From: Peter Zijlstra <[email protected]>
> Date: Mon Feb 07 17:02:20 CET 2011
>
> Make the ::exit method act like ::attach, it is after all very nearly
> the same thing.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> LKML-Reference: <new-submission>
> ---
> Index: linux-2.6/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.orig/include/linux/cgroup.h
> +++ linux-2.6/include/linux/cgroup.h
> @@ -474,7 +474,8 @@ struct cgroup_subsys {
> struct cgroup *old_cgrp, struct task_struct *tsk,
> bool threadgroup);
> void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
> - void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
> + void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> + struct cgroup *old_cgrp, struct task_struct *task);

The effective change I see

1. mutex_lock() being held
2. Old cgroup being passed as a part of the notification

Is 1 required? I don't see anything in the changelog. For (2), I don't
see it being used, is the use in the scheduler cgroup path/patch?

--
Three Cheers,
Balbir

2011-02-14 04:33:29

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Fix cgroup_subsys::exit callback

On Tue, Feb 8, 2011 at 2:24 AM, Peter Zijlstra <[email protected]> wrote:
>
> And here I thought Google was starting to understand what community
> participation meant.. is there anybody you know who can play a more
> active role in the whole cgroup thing?

Google has plenty of people actively working on various aspects of
cgroups (particularly memory and I/O scheduling) in mainline. There's
no one tasked with core cgroups framework maintenance, but there are
folks from Fujitsu and IBM (Li Zefan, Balbir Singh, etc) who have more
experience and state in the core framework than anyone else in Google
anyway.

>
> Like the below? Both the perf and sched exit callback are fine with
> being called under task_lock afaict, but I haven't actually ran with

Sounds good to me.

Acked-by: Paul Menage <[email protected]>

> lockdep enabled to see if I missed something.
>
> I also pondered doing the cgroup exit from __put_task_struct() but that
> had another problem iirc.

My guess is that by that time, so much of the task's context has been
destroyed that it comes too late in the tear-down for some/many
subsystems? Proving that guess either true or false (for all current
and future potential subsystems) would probably be tricky, though.

>
> ---
> Index: linux-2.6/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.orig/include/linux/cgroup.h
> +++ linux-2.6/include/linux/cgroup.h
> @@ -474,7 +474,8 @@ struct cgroup_subsys {
> ? ? ? ? ? ? ? ? ? ? ? ?struct cgroup *old_cgrp, struct task_struct *tsk,
> ? ? ? ? ? ? ? ? ? ? ? ?bool threadgroup);
> ? ? ? ?void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
> - ? ? ? void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
> + ? ? ? void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> + ? ? ? ? ? ? ? ? ? ? ? struct cgroup *old_cgrp, struct task_struct *task);
> ? ? ? ?int (*populate)(struct cgroup_subsys *ss,
> ? ? ? ? ? ? ? ? ? ? ? ?struct cgroup *cgrp);
> ? ? ? ?void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> Index: linux-2.6/kernel/cgroup.c
> ===================================================================

Probably also worth including a note in the docs:

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -602,6 +602,7 @@ void fork(struct cgroup_subsy *ss, struct task_struct *task)
Called when a task is forked into a cgroup.

void exit(struct cgroup_subsys *ss, struct task_struct *task)
+(task->alloc_lock held by caller)

Called during task exit.


> --- linux-2.6.orig/kernel/cgroup.c
> +++ linux-2.6/kernel/cgroup.c
> @@ -4230,20 +4230,8 @@ void cgroup_post_fork(struct task_struct
> ?*/
> ?void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> ?{
> - ? ? ? int i;
> ? ? ? ?struct css_set *cg;
> -
> - ? ? ? if (run_callbacks && need_forkexit_callback) {
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* modular subsystems can't use callbacks, so no need to lock
> - ? ? ? ? ? ? ? ?* the subsys array
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
> - ? ? ? ? ? ? ? ? ? ? ? struct cgroup_subsys *ss = subsys[i];
> - ? ? ? ? ? ? ? ? ? ? ? if (ss->exit)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ss->exit(ss, tsk);
> - ? ? ? ? ? ? ? }
> - ? ? ? }
> + ? ? ? int i;
>
> ? ? ? ?/*
> ? ? ? ? * Unlink from the css_set task list if necessary.
> @@ -4261,7 +4249,24 @@ void cgroup_exit(struct task_struct *tsk
> ? ? ? ?task_lock(tsk);
> ? ? ? ?cg = tsk->cgroups;
> ? ? ? ?tsk->cgroups = &init_css_set;
> +
> + ? ? ? if (run_callbacks && need_forkexit_callback) {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* modular subsystems can't use callbacks, so no need to lock
> + ? ? ? ? ? ? ? ?* the subsys array
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
> + ? ? ? ? ? ? ? ? ? ? ? struct cgroup_subsys *ss = subsys[i];
> + ? ? ? ? ? ? ? ? ? ? ? if (ss->exit) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup *old_cgrp =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rcu_dereference_raw(cg->subsys[i])->cgroup;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup *cgrp = task_cgroup(tsk, i);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ss->exit(ss, cgrp, old_cgrp, tsk);
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> ? ? ? ?task_unlock(tsk);
> +
> ? ? ? ?if (cg)
> ? ? ? ? ? ? ? ?put_css_set_taskexit(cg);
> ?}
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -606,9 +606,6 @@ static inline struct task_group *task_gr
> ? ? ? ?struct task_group *tg;
> ? ? ? ?struct cgroup_subsys_state *css;
>
> - ? ? ? if (p->flags & PF_EXITING)
> - ? ? ? ? ? ? ? return &root_task_group;
> -
> ? ? ? ?css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> ? ? ? ? ? ? ? ? ? ? ? ?lockdep_is_held(&task_rq(p)->lock));
> ? ? ? ?tg = container_of(css, struct task_group, css);
> @@ -9081,7 +9078,8 @@ cpu_cgroup_attach(struct cgroup_subsys *
> ?}
>
> ?static void
> -cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
> +cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
> + ? ? ? ? ? ? ? struct cgroup *old_cgrp, struct task_struct *task)
> ?{
> ? ? ? ?/*
> ? ? ? ? * cgroup_exit() is called in the copy_process() failure path.
>
>
>

2011-02-16 13:46:41

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:perf/core] cgroup: Fix cgroup_subsys::exit callback

Commit-ID: d41d5a01631af821d3a3447e6613a316f5ee6c25
Gitweb: http://git.kernel.org/tip/d41d5a01631af821d3a3447e6613a316f5ee6c25
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 7 Feb 2011 17:02:20 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Feb 2011 13:30:47 +0100

cgroup: Fix cgroup_subsys::exit callback

Make the ::exit method act like ::attach, it is after all very nearly
the same thing.

The bug had no effect on correctness - fixing it is an optimization for
the scheduler. Also, later perf-cgroups patches rely on it.

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Paul Menage <[email protected]>
LKML-Reference: <1297160655.13327.92.camel@laptop>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/cgroup.h | 3 ++-
kernel/cgroup.c | 31 ++++++++++++++++++-------------
kernel/sched.c | 6 ++----
3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..38117d9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -474,7 +474,8 @@ struct cgroup_subsys {
struct cgroup *old_cgrp, struct task_struct *tsk,
bool threadgroup);
void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
- void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
+ void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup *old_cgrp, struct task_struct *task);
int (*populate)(struct cgroup_subsys *ss,
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b24d702..f6495f3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4230,20 +4230,8 @@ void cgroup_post_fork(struct task_struct *child)
*/
void cgroup_exit(struct task_struct *tsk, int run_callbacks)
{
- int i;
struct css_set *cg;
-
- if (run_callbacks && need_forkexit_callback) {
- /*
- * modular subsystems can't use callbacks, so no need to lock
- * the subsys array
- */
- for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
- struct cgroup_subsys *ss = subsys[i];
- if (ss->exit)
- ss->exit(ss, tsk);
- }
- }
+ int i;

/*
* Unlink from the css_set task list if necessary.
@@ -4261,7 +4249,24 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
task_lock(tsk);
cg = tsk->cgroups;
tsk->cgroups = &init_css_set;
+
+ if (run_callbacks && need_forkexit_callback) {
+ /*
+ * modular subsystems can't use callbacks, so no need to lock
+ * the subsys array
+ */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
+ struct cgroup_subsys *ss = subsys[i];
+ if (ss->exit) {
+ struct cgroup *old_cgrp =
+ rcu_dereference_raw(cg->subsys[i])->cgroup;
+ struct cgroup *cgrp = task_cgroup(tsk, i);
+ ss->exit(ss, cgrp, old_cgrp, tsk);
+ }
+ }
+ }
task_unlock(tsk);
+
if (cg)
put_css_set_taskexit(cg);
}
diff --git a/kernel/sched.c b/kernel/sched.c
index e142e92..79e611c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -606,9 +606,6 @@ static inline struct task_group *task_group(struct task_struct *p)
struct task_group *tg;
struct cgroup_subsys_state *css;

- if (p->flags & PF_EXITING)
- return &root_task_group;
-
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
tg = container_of(css, struct task_group, css);
@@ -8863,7 +8860,8 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
}

static void
-cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
+cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup *old_cgrp, struct task_struct *task)
{
/*
* cgroup_exit() is called in the copy_process() failure path.