2011-04-07 09:56:04

by Mike Galbraith

[permalink] [raw]
Subject: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

Greetings,

Wrt these patches:

https://lkml.org/lkml/2010/11/24/14 [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
https://lkml.org/lkml/2010/11/24/15 [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

I received a query regarding 2/2 because a large database company is
apparently moving tasks between cgroups frequently enough that their
database initialization time dropped from ~11 hours to ~4 hours when
they applied this patch.

Curious why these got no traction.

Regardless of the wisdom of frequent cgroup moves, Joe User launching
light tasks via cgexec would gain quite a bit as well it seems.

Q/D measurement of cgexec -g cpu:test sh -c exit.

Unpatched:

marge:~ # time perf stat --repeat=1000 cgexec -g cpu:test sh -c exit

Performance counter stats for 'cgexec -g cpu:test sh -c exit' (1000 runs):

2.686149 task-clock-msecs # 0.119 CPUs ( +- 0.067% )
2 context-switches # 0.001 M/sec ( +- 0.100% )
0 CPU-migrations # 0.000 M/sec ( +- 35.231% )
633 page-faults # 0.236 M/sec ( +- 0.003% )
6292755 cycles # 2342.668 M/sec ( +- 0.062% )
4844702 instructions # 0.770 IPC ( +- 0.004% )
974011 branches # 362.605 M/sec ( +- 0.005% )
35187 branch-misses # 3.613 % ( +- 0.027% )
<not counted> cache-references
<not counted> cache-misses

0.022597372 seconds time elapsed ( +- 0.981% )


real 0m23.054s
user 0m0.380s
sys 0m0.000s

Patched:

marge:~ # time perf stat --repeat=1000 cgexec -g cpu:test sh -c exit

Performance counter stats for 'cgexec -g cpu:test sh -c exit' (1000 runs):

2.639303 task-clock-msecs # 0.915 CPUs ( +- 0.051% )
1 context-switches # 0.000 M/sec ( +- 1.048% )
0 CPU-migrations # 0.000 M/sec ( +- 21.602% )
633 page-faults # 0.240 M/sec ( +- 0.003% )
3883572 cycles # 1471.439 M/sec ( +- 1.767% )
4830809 instructions # 1.244 IPC ( +- 0.191% )
975204 branches # 369.493 M/sec ( +- 0.009% )
35314 branch-misses # 3.621 % ( +- 0.060% )
<not counted> cache-references
<not counted> cache-misses

0.002884424 seconds time elapsed ( +- 0.055% )


real 0m3.329s
user 0m0.012s
sys 0m0.352s



2011-04-13 01:59:16

by Li Zefan

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

Mike Galbraith wrote:
> Greetings,
>
> Wrt these patches:
>
> https://lkml.org/lkml/2010/11/24/14 [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
> https://lkml.org/lkml/2010/11/24/15 [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
>
> I received a query regarding 2/2 because a large database company is
> apparently moving tasks between cgroups frequently enough that their
> database initialization time dropped from ~11 hours to ~4 hours when
> they applied this patch.
>
> Curious why these got no traction.

I thought Paul was following the this. I'll spend some time on patch
review.

>
> Regardless of the wisdom of frequent cgroup moves, Joe User launching
> light tasks via cgexec would gain quite a bit as well it seems.
>

2011-04-13 03:12:00

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Wed, 2011-04-13 at 10:02 +0800, Li Zefan wrote:
> Mike Galbraith wrote:
> > Greetings,
> >
> > Wrt these patches:
> >
> > https://lkml.org/lkml/2010/11/24/14 [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
> > https://lkml.org/lkml/2010/11/24/15 [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
> >
> > I received a query regarding 2/2 because a large database company is
> > apparently moving tasks between cgroups frequently enough that their
> > database initialization time dropped from ~11 hours to ~4 hours when
> > they applied this patch.
> >
> > Curious why these got no traction.
>
> I thought Paul was following the this. I'll spend some time on patch
> review.

Great!

Three orders of magnitude latency improvements are a terrible thing to
waste ;-) I tried doing it a bit differently, but would have ended up
about the same due to the need for rmdir to succeed after the attach
(detach of last task) returns.

However...

If the user _does_ that rmdir(), it's more or less back to square one.
RCU grace periods should not impact userland, but if you try to do
create/attach/detach/destroy, you run into the same bottleneck, as does
any asynchronous GC, though that's not such a poke in the eye. I tried
a straight forward move to schedule_work(), and it seems to work just
fine. rmdir() no longer takes ~30ms on my box, but closer to 20us.

cgroups: Remove call to synchronize_rcu() in cgroup_diput()

Instead of synchronously waiting via synchronize_rcu(), then initiating cgroup
destruction, schedule asynchronous destruction via call_rcu()->schedule_work()
and move along smartly.

Some numbers:
1000 x simple loop - create/attach self/detatch self/destroy, zero work.

Virgin source
real 1m39.713s 1.000000
user 0m0.000s
sys 0m0.076s

+ Android commits 60cdbd1f and 05946a1
real 0m33.627s .337237
user 0m0.056s
sys 0m0.000s

+ Android commits + below
real 0m0.046s .000461
user 0m0.000s
sys 0m0.044s

Not-signed-off-by: Mike Galbraith <[email protected]>

---
include/linux/cgroup.h | 1
kernel/cgroup.c | 59 +++++++++++++++++++++++++++----------------------
2 files changed, 34 insertions(+), 26 deletions(-)

Index: linux-2.6.39.git/include/linux/cgroup.h
===================================================================
--- linux-2.6.39.git.orig/include/linux/cgroup.h
+++ linux-2.6.39.git/include/linux/cgroup.h
@@ -231,6 +231,7 @@ struct cgroup {

/* For RCU-protected deletion */
struct rcu_head rcu_head;
+ struct work_struct work;

/* List of events which userspace want to recieve */
struct list_head event_list;
Index: linux-2.6.39.git/kernel/cgroup.c
===================================================================
--- linux-2.6.39.git.orig/kernel/cgroup.c
+++ linux-2.6.39.git/kernel/cgroup.c
@@ -836,11 +836,42 @@ static int cgroup_call_pre_destroy(struc
return ret;
}

+static void free_cgroup_work(struct work_struct *work)
+{
+ struct cgroup *cgrp = container_of(work, struct cgroup, work);
+ struct cgroup_subsys *ss;
+
+ mutex_lock(&cgroup_mutex);
+ /*
+ * Release the subsystem state objects.
+ */
+ for_each_subsys(cgrp->root, ss)
+ ss->destroy(ss, cgrp);
+
+ cgrp->root->number_of_cgroups--;
+ mutex_unlock(&cgroup_mutex);
+
+ /*
+ * Drop the active superblock reference that we took when we
+ * created the cgroup
+ */
+ deactivate_super(cgrp->root->sb);
+
+ /*
+ * if we're getting rid of the cgroup, refcount should ensure
+ * that there are no pidlists left.
+ */
+ BUG_ON(!list_empty(&cgrp->pidlists));
+
+ kfree(cgrp);
+}
+
static void free_cgroup_rcu(struct rcu_head *obj)
{
struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);

- kfree(cgrp);
+ INIT_WORK(&cgrp->work, free_cgroup_work);
+ schedule_work(&cgrp->work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -848,7 +879,7 @@ static void cgroup_diput(struct dentry *
/* is dentry a directory ? if so, kfree() associated cgroup */
if (S_ISDIR(inode->i_mode)) {
struct cgroup *cgrp = dentry->d_fsdata;
- struct cgroup_subsys *ss;
+
BUG_ON(!(cgroup_is_removed(cgrp)));
/* It's possible for external users to be holding css
* reference counts on a cgroup; css_put() needs to
@@ -856,30 +887,6 @@ static void cgroup_diput(struct dentry *
* the reference count in order to know if it needs to
* queue the cgroup to be handled by the release
* agent */
- synchronize_rcu();
-
- mutex_lock(&cgroup_mutex);
- /*
- * Release the subsystem state objects.
- */
- for_each_subsys(cgrp->root, ss)
- ss->destroy(ss, cgrp);
-
- cgrp->root->number_of_cgroups--;
- mutex_unlock(&cgroup_mutex);
-
- /*
- * Drop the active superblock reference that we took when we
- * created the cgroup
- */
- deactivate_super(cgrp->root->sb);
-
- /*
- * if we're getting rid of the cgroup, refcount should ensure
- * that there are no pidlists left.
- */
- BUG_ON(!list_empty(&cgrp->pidlists));
-
call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
}
iput(inode);

2011-04-13 13:11:23

by Paul Menage

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Thu, Apr 7, 2011 at 11:55 AM, Mike Galbraith <[email protected]> wrote:
> Greetings,
>
> Wrt these patches:
>
> https://lkml.org/lkml/2010/11/24/14 [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
> https://lkml.org/lkml/2010/11/24/15 [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
>
> I received a query regarding 2/2 because a large database company is
> apparently moving tasks between cgroups frequently enough that their
> database initialization time dropped from ~11 hours to ~4 hours when
> they applied this patch.

That sounds like a problem in their user-space code too, although I
agree that making cgroup moves faster is a good thing.

>
> Curious why these got no traction.
>

Apart from just my chronic lack of time to work on cgroups, there were
a couple of issues:

1) we had trouble getting the semantics right for the release_agent
notifications. Not that this is something that I suspect many people
care about, but it has been part of the API since the cpuset days. I
spent a while trying to juggle the way that release notifications were
done (via an event counter rather than a simple flag) but never got
them finished.

2) I have this nagging feeling that the synchronize_rcu() call in
cgroup_attach_task() was protecting more than is obvious. Certainly
when cgroups first went in, that synchronize_rcu() call meant that
cgroup_rmdir() could guarantee that if the cgroup was empty, there
were no threads in RCU-read sections accessing their old cgroup via
their RCU-proected current->cgroups pointers, so objects could just be
deleted at that point. A year or two ago we RCU-ified most/all of the
cgroup deletion path, so this shouldn't be an issue now, but I'm still
a bit worried that we missed something. I'm probably being
over-paranoid though.

We're looking at testing these patches at Google, which will give a
little more confidence.

There's a conflicting patchset (allowing moving entire processes by
writing to cgroup.procs) that Ben Blum has been trying to get in for
ages, and which has just gone in to -mm - the RCU change patches will
likely need a bit of merge love.

Paul

2011-04-13 13:17:19

by Paul Menage

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Wed, Apr 13, 2011 at 5:11 AM, Mike Galbraith <[email protected]> wrote:
> If the user _does_ that rmdir(), it's more or less back to square one.
> RCU grace periods should not impact userland, but if you try to do
> create/attach/detach/destroy, you run into the same bottleneck, as does
> any asynchronous GC, though that's not such a poke in the eye. ?I tried
> a straight forward move to schedule_work(), and it seems to work just
> fine. ?rmdir() no longer takes ~30ms on my box, but closer to 20us.

> + ? ? ? /*
> + ? ? ? ?* Release the subsystem state objects.
> + ? ? ? ?*/
> + ? ? ? for_each_subsys(cgrp->root, ss)
> + ? ? ? ? ? ? ? ss->destroy(ss, cgrp);
> +
> + ? ? ? cgrp->root->number_of_cgroups--;
> + ? ? ? mutex_unlock(&cgroup_mutex);
> +
> + ? ? ? /*
> + ? ? ? ?* Drop the active superblock reference that we took when we
> + ? ? ? ?* created the cgroup
> + ? ? ? ?*/
> + ? ? ? deactivate_super(cgrp->root->sb);
> +
> + ? ? ? /*
> + ? ? ? ?* if we're getting rid of the cgroup, refcount should ensure
> + ? ? ? ?* that there are no pidlists left.
> + ? ? ? ?*/
> + ? ? ? BUG_ON(!list_empty(&cgrp->pidlists));
> +
> + ? ? ? kfree(cgrp);

We might want to punt this through RCU again, in case the subsystem
destroy() callbacks left anything around that was previously depending
on the RCU barrier.

Also, I'd be concerned that subsystems might get confused by the fact
that a new group called 'foo' could be created before the old 'foo'
has been cleaned up? (And do any subsystems rely on being able to
access the cgroup dentry up until the point when destroy() is called?

Paul

2011-04-13 16:52:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Wed, 2011-04-13 at 15:10 +0200, Paul Menage wrote:
> On Thu, Apr 7, 2011 at 11:55 AM, Mike Galbraith <[email protected]> wrote:
> > Greetings,
> >
> > Wrt these patches:
> >
> > https://lkml.org/lkml/2010/11/24/14 [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
> > https://lkml.org/lkml/2010/11/24/15 [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
> >
> > I received a query regarding 2/2 because a large database company is
> > apparently moving tasks between cgroups frequently enough that their
> > database initialization time dropped from ~11 hours to ~4 hours when
> > they applied this patch.
>
> That sounds like a problem in their user-space code too, although I
> agree that making cgroup moves faster is a good thing.

I suspect they could avoid the issue, but don't have details.

> > Curious why these got no traction.
> >
>
> Apart from just my chronic lack of time to work on cgroups, there were
> a couple of issues:
>
> 1) we had trouble getting the semantics right for the release_agent
> notifications. Not that this is something that I suspect many people
> care about, but it has been part of the API since the cpuset days. I
> spent a while trying to juggle the way that release notifications were
> done (via an event counter rather than a simple flag) but never got
> them finished.
>
> 2) I have this nagging feeling that the synchronize_rcu() call in
> cgroup_attach_task() was protecting more than is obvious. Certainly
> when cgroups first went in, that synchronize_rcu() call meant that
> cgroup_rmdir() could guarantee that if the cgroup was empty, there
> were no threads in RCU-read sections accessing their old cgroup via
> their RCU-proected current->cgroups pointers, so objects could just be
> deleted at that point. A year or two ago we RCU-ified most/all of the
> cgroup deletion path, so this shouldn't be an issue now, but I'm still
> a bit worried that we missed something. I'm probably being
> over-paranoid though.

Thanks for the info.

> We're looking at testing these patches at Google, which will give a
> little more confidence.

Goody. I'll be doing some small scale beating too fwiw.

> There's a conflicting patchset (allowing moving entire processes by
> writing to cgroup.procs) that Ben Blum has been trying to get in for
> ages, and which has just gone in to -mm - the RCU change patches will
> likely need a bit of merge love.

Yeah, saw those.

-Mike

2011-04-13 16:57:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Wed, 2011-04-13 at 15:16 +0200, Paul Menage wrote:
> On Wed, Apr 13, 2011 at 5:11 AM, Mike Galbraith <[email protected]> wrote:
> > If the user _does_ that rmdir(), it's more or less back to square one.
> > RCU grace periods should not impact userland, but if you try to do
> > create/attach/detach/destroy, you run into the same bottleneck, as does
> > any asynchronous GC, though that's not such a poke in the eye. I tried
> > a straight forward move to schedule_work(), and it seems to work just
> > fine. rmdir() no longer takes ~30ms on my box, but closer to 20us.
>
> > + /*
> > + * Release the subsystem state objects.
> > + */
> > + for_each_subsys(cgrp->root, ss)
> > + ss->destroy(ss, cgrp);
> > +
> > + cgrp->root->number_of_cgroups--;
> > + mutex_unlock(&cgroup_mutex);
> > +
> > + /*
> > + * Drop the active superblock reference that we took when we
> > + * created the cgroup
> > + */
> > + deactivate_super(cgrp->root->sb);
> > +
> > + /*
> > + * if we're getting rid of the cgroup, refcount should ensure
> > + * that there are no pidlists left.
> > + */
> > + BUG_ON(!list_empty(&cgrp->pidlists));
> > +
> > + kfree(cgrp);
>
> We might want to punt this through RCU again, in case the subsystem
> destroy() callbacks left anything around that was previously depending
> on the RCU barrier.
>
> Also, I'd be concerned that subsystems might get confused by the fact
> that a new group called 'foo' could be created before the old 'foo'
> has been cleaned up? (And do any subsystems rely on being able to
> access the cgroup dentry up until the point when destroy() is called?

Yeah, I already have head scratching sessions planned for these, why I
used 'seems' to work fine, and Not-signed-off-by: :)

-Mike

2011-04-14 07:26:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Wed, 2011-04-13 at 18:56 +0200, Mike Galbraith wrote:
> On Wed, 2011-04-13 at 15:16 +0200, Paul Menage wrote:
> > On Wed, Apr 13, 2011 at 5:11 AM, Mike Galbraith <[email protected]> wrote:
> > > If the user _does_ that rmdir(), it's more or less back to square one.
> > > RCU grace periods should not impact userland, but if you try to do
> > > create/attach/detach/destroy, you run into the same bottleneck, as does
> > > any asynchronous GC, though that's not such a poke in the eye. I tried
> > > a straight forward move to schedule_work(), and it seems to work just
> > > fine. rmdir() no longer takes ~30ms on my box, but closer to 20us.
> >
> > > + /*
> > > + * Release the subsystem state objects.
> > > + */
> > > + for_each_subsys(cgrp->root, ss)
> > > + ss->destroy(ss, cgrp);
> > > +
> > > + cgrp->root->number_of_cgroups--;
> > > + mutex_unlock(&cgroup_mutex);
> > > +
> > > + /*
> > > + * Drop the active superblock reference that we took when we
> > > + * created the cgroup
> > > + */
> > > + deactivate_super(cgrp->root->sb);
> > > +
> > > + /*
> > > + * if we're getting rid of the cgroup, refcount should ensure
> > > + * that there are no pidlists left.
> > > + */
> > > + BUG_ON(!list_empty(&cgrp->pidlists));
> > > +
> > > + kfree(cgrp);
> >
> > We might want to punt this through RCU again, in case the subsystem
> > destroy() callbacks left anything around that was previously depending
> > on the RCU barrier.
> >
> > Also, I'd be concerned that subsystems might get confused by the fact
> > that a new group called 'foo' could be created before the old 'foo'
> > has been cleaned up? (And do any subsystems rely on being able to
> > access the cgroup dentry up until the point when destroy() is called?
>
> Yeah, I already have head scratching sessions planned for these, why I
> used 'seems' to work fine, and Not-signed-off-by: :)

Definitely-not-signed-off-by: /me

Multiple threads...

[ 155.009282] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 155.009286] IP: [<ffffffff810511fb>] process_one_work+0x3b/0x370
[ 155.009293] PGD 22c5f7067 PUD 22980c067 PMD 0
[ 155.009296] Oops: 0000 [#1] SMP
[ 155.009298] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[ 155.009301] CPU 0
[ 155.009302] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm sr_mod usbmouse usbhid usb_storage sg hid firewire_ohci cdrom e1000e snd_timer usb_libusual firewire_core i2c_i801 snd soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd ehci_hcd sd_mod rtc_cmos usbcore rtc_core rtc_lib ahci libahci libata scsi_mod fan processor thermal
[ 155.009331]
[ 155.009333] Pid: 7924, comm: kworker/0:14 Not tainted 2.6.39-smpxx #1905 MEDIONPC MS-7502/MS-7502
[ 155.009336] RIP: 0010:[<ffffffff810511fb>] [<ffffffff810511fb>] process_one_work+0x3b/0x370
[ 155.009340] RSP: 0018:ffff880228ae5e00 EFLAGS: 00010003
[ 155.009341] RAX: ffff8802295ece48 RBX: ffff880229881500 RCX: 0000000000000000
[ 155.009343] RDX: 07fffc40114af672 RSI: ffff8802295ece48 RDI: 001ffff100452bd9
[ 155.009345] RBP: ffff880228ae5e50 R08: ffff88022a0b9e50 R09: ffff88022a0baab0
[ 155.009346] R10: dead000000200200 R11: 0000000000000001 R12: ffff88022fc0d980
[ 155.009348] R13: 0000000000000000 R14: ffffffff8107bc40 R15: 0000000000011d00
[ 155.009350] FS: 0000000000000000(0000) GS:ffff88022fc00000(0000) knlGS:0000000000000000
[ 155.009352] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 155.009354] CR2: 0000000000000000 CR3: 0000000228f61000 CR4: 00000000000006f0
[ 155.009355] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 155.009357] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 155.009359] Process kworker/0:14 (pid: 7924, threadinfo ffff880228ae4000, task ffff8801e6895240)
[ 155.009360] Stack:
[ 155.009361] ffff880228ae5e50 ffffffff81051758 ffffffff00000000 0000000100000082
[ 155.009364] 0000000000000000 ffff880229881500 ffff88022fc0d980 ffff880229881520
[ 155.009366] ffff88022fc0d988 0000000000011d00 ffff880228ae5ee0 ffffffff81051911
[ 155.009371] Call Trace:
[ 155.009374] [<ffffffff81051758>] ? manage_workers+0x1e8/0x240
[ 155.009377] [<ffffffff81051911>] worker_thread+0x161/0x330
[ 155.009380] [<ffffffff8102a129>] ? __wake_up_common+0x59/0x90
[ 155.009384] [<ffffffff8105c8cf>] ? switch_task_namespaces+0x1f/0x60
[ 155.009386] [<ffffffff810517b0>] ? manage_workers+0x240/0x240
[ 155.009389] [<ffffffff81057a26>] kthread+0x96/0xa0
[ 155.009392] [<ffffffff8134e894>] kernel_thread_helper+0x4/0x10
[ 155.009394] [<ffffffff81057990>] ? kthread_worker_fn+0x190/0x190
[ 155.009397] [<ffffffff8134e890>] ? gs_change+0xb/0xb
[ 155.009399] Code: 41 54 53 48 89 fb 48 83 ec 28 48 89 f7 48 8b 16 4c 8b 76 18 48 89 d1 30 c9 83 e2 04 48 89 f2 4c 0f 45 e9 48 c1 ea 05 48 c1 ef 0b <4d> 8b 65 00 48 01 d7 49 8b 55 08 83 e7 3f 8b 12 4d 8b 44 fc 38
[ 155.009416] RIP [<ffffffff810511fb>] process_one_work+0x3b/0x370
[ 155.009418] RSP <ffff880228ae5e00>
[ 155.009419] CR2: 0000000000000000
[ 155.009422] ---[ end trace ee5a315197a1a60e ]---

....dead box

2011-04-14 08:34:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

> > On Wed, 2011-04-13 at 15:16 +0200, Paul Menage wrote:

> > > > + kfree(cgrp);
> > >
> > > We might want to punt this through RCU again, in case the subsystem
> > > destroy() callbacks left anything around that was previously depending
> > > on the RCU barrier.

Yup, great idea. Two stage teardown/free quenched the fireworks

> > > Also, I'd be concerned that subsystems might get confused by the fact
> > > that a new group called 'foo' could be created before the old 'foo'
> > > has been cleaned up? (And do any subsystems rely on being able to
> > > access the cgroup dentry up until the point when destroy() is called?

Re-creating a group doesn't appear to be a problem for 'cpu', either
with previous version running single a single beater proggy, or now
running multiple instances of create/attach/detach/destroy.

All other subsystems need testing.. and 'cpu' wants heftier testing.

cgroups: Remove call to synchronize_rcu() in cgroup_diput()

Instead of synchronously waiting via synchronize_rcu(), then initiating cgroup
destruction, schedule asynchronous destruction via call_rcu()->schedule_work()->
destroy_cgroup_rcu()->call_rcu()->free_cgroup_rcu()->kfree() two stage teardown.

Some numbers:
1000 x simple loop - create/attach self/detatch self/destroy, zero work.

Virgin source
real 1m39.713s 1.000000
user 0m0.000s
sys 0m0.076s

+ Android commits 60cdbd1f and 05946a1
real 0m33.627s .337237
user 0m0.056s
sys 0m0.000s

+ Android commits + below
real 0m0.046s .000461
user 0m0.000s
sys 0m0.044s

Not-signed-off-by: Mike Galbraith <[email protected]>
---
include/linux/cgroup.h | 1
kernel/cgroup.c | 67 +++++++++++++++++++++++++++++--------------------
2 files changed, 42 insertions(+), 26 deletions(-)

Index: linux-2.6.39.git/include/linux/cgroup.h
===================================================================
--- linux-2.6.39.git.orig/include/linux/cgroup.h
+++ linux-2.6.39.git/include/linux/cgroup.h
@@ -231,6 +231,7 @@ struct cgroup {

/* For RCU-protected deletion */
struct rcu_head rcu_head;
+ struct work_struct work;

/* List of events which userspace want to recieve */
struct list_head event_list;
Index: linux-2.6.39.git/kernel/cgroup.c
===================================================================
--- linux-2.6.39.git.orig/kernel/cgroup.c
+++ linux-2.6.39.git/kernel/cgroup.c
@@ -836,6 +836,7 @@ static int cgroup_call_pre_destroy(struc
return ret;
}

+
static void free_cgroup_rcu(struct rcu_head *obj)
{
struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);
@@ -843,12 +844,50 @@ static void free_cgroup_rcu(struct rcu_h
kfree(cgrp);
}

+static void free_cgroup_work(struct work_struct *work)
+{
+ struct cgroup *cgrp = container_of(work, struct cgroup, work);
+ struct cgroup_subsys *ss;
+
+ mutex_lock(&cgroup_mutex);
+ /*
+ * Release the subsystem state objects.
+ */
+ for_each_subsys(cgrp->root, ss)
+ ss->destroy(ss, cgrp);
+
+ cgrp->root->number_of_cgroups--;
+ mutex_unlock(&cgroup_mutex);
+
+ /*
+ * Drop the active superblock reference that we took when we
+ * created the cgroup
+ */
+ deactivate_super(cgrp->root->sb);
+
+ /*
+ * if we're getting rid of the cgroup, refcount should ensure
+ * that there are no pidlists left.
+ */
+ BUG_ON(!list_empty(&cgrp->pidlists));
+
+ call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
+}
+
+static void destroy_cgroup_rcu(struct rcu_head *obj)
+{
+ struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);
+
+ INIT_WORK(&cgrp->work, free_cgroup_work);
+ schedule_work(&cgrp->work);
+}
+
static void cgroup_diput(struct dentry *dentry, struct inode *inode)
{
/* is dentry a directory ? if so, kfree() associated cgroup */
if (S_ISDIR(inode->i_mode)) {
struct cgroup *cgrp = dentry->d_fsdata;
- struct cgroup_subsys *ss;
+
BUG_ON(!(cgroup_is_removed(cgrp)));
/* It's possible for external users to be holding css
* reference counts on a cgroup; css_put() needs to
@@ -856,31 +895,7 @@ static void cgroup_diput(struct dentry *
* the reference count in order to know if it needs to
* queue the cgroup to be handled by the release
* agent */
- synchronize_rcu();
-
- mutex_lock(&cgroup_mutex);
- /*
- * Release the subsystem state objects.
- */
- for_each_subsys(cgrp->root, ss)
- ss->destroy(ss, cgrp);
-
- cgrp->root->number_of_cgroups--;
- mutex_unlock(&cgroup_mutex);
-
- /*
- * Drop the active superblock reference that we took when we
- * created the cgroup
- */
- deactivate_super(cgrp->root->sb);
-
- /*
- * if we're getting rid of the cgroup, refcount should ensure
- * that there are no pidlists left.
- */
- BUG_ON(!list_empty(&cgrp->pidlists));
-
- call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
+ call_rcu(&cgrp->rcu_head, destroy_cgroup_rcu);
}
iput(inode);
}



2011-04-14 08:44:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Thu, 2011-04-14 at 10:34 +0200, Mike Galbraith wrote:
> > > On Wed, 2011-04-13 at 15:16 +0200, Paul Menage wrote:
>
> > > > > + kfree(cgrp);
> > > >
> > > > We might want to punt this through RCU again, in case the subsystem
> > > > destroy() callbacks left anything around that was previously depending
> > > > on the RCU barrier.
>
> Yup, great idea. Two stage teardown/free quenched the fireworks

Correction: made fireworks take longer to ignite. Roughly a minute
after I poked xmit, box exploded. Hohum. BTTODB

-Mike

2011-04-18 14:21:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Wed, 2011-04-13 at 15:16 +0200, Paul Menage wrote:

> Also, I'd be concerned that subsystems might get confused by the fact
> that a new group called 'foo' could be created before the old 'foo'
> has been cleaned up?

Decided to try beating on that without my hack. Seems these patches...
patches/cgroup-Set-CGRP_RELEASABLE-when-adding-to-a-group
patches/cgroup-Remove-call-to-synchronize_rcu-in-cgroup_attach_task
...either introduced a bug, or made an existing one easier to hit. With
only those two applied, and several tasks doing - create, attach,
detach, fork off a rmdir - all for the same cgroup, box griped.

With my hack on top, I'd either see rcu workers explode, or list
corruption in cgroup_attach_task(), depending on phase-of-moon.

(hohum, maybe it'll turn out to be a defenseless _baby_ dragon..)

v2.6.39-rc2

[ 5009.808292] general protection fault: 0000 [#1] SMP
[ 5009.812007] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[ 5009.812007] CPU 2
[ 5009.812007] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge stp cpufreq_conservative cpufreq_ondemand microcode cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm sr_mod sg usb_storage usbmouse cdrom snd_timer usbhid hid firewire_ohci firewire_core snd usb_libusual i2c_i801 soundcore snd_page_alloc e1000e crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd ehci_hcd sd_mod rtc_cmos usbcore rtc_core rtc_lib ahci libahci libata scsi_mod fan processor thermal
[ 5009.812007]
[ 5009.812007] Pid: 21021, comm: rmdir Not tainted 2.6.39-smpx #1906 MEDIONPC MS-7502/MS-7502
[ 5009.812007] RIP: 0010:[<ffffffff81079bb5>] [<ffffffff81079bb5>] cgroup_css_sets_empty+0x25/0x50
[ 5009.812007] RSP: 0018:ffff88017dfe5dd8 EFLAGS: 00010246
[ 5009.812007] RAX: dead000000100100 RBX: 0000000000000000 RCX: ffff880228c8ce50
[ 5009.812007] RDX: dead000000100100 RSI: ffff88022db9b900 RDI: ffff880228c8cc00
[ 5009.812007] RBP: ffff88017dfe5dd8 R08: 0000000000000005 R09: ffff88022db9b93c
[ 5009.812007] R10: ffff88017dfe5d18 R11: ffff88017dfe5c84 R12: ffff880228c8cc00
[ 5009.812007] R13: ffff88017dfe5df8 R14: ffff880228c8cc20 R15: ffff8802161cd240
[ 5009.812007] FS: 00007fefea0a1700(0000) GS:ffff88022fd00000(0000) knlGS:0000000000000000
[ 5009.812007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5009.955702] CR2: 00007fefe9c1acb0 CR3: 000000019107c000 CR4: 00000000000006e0
[ 5009.955702] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5009.955702] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5009.955702] Process rmdir (pid: 21021, threadinfo ffff88017dfe4000, task ffff8802161cd240)
[ 5009.955702] Stack:
[ 5009.955702] ffff88017dfe5e58 ffffffff8107a514 ffff88022db9b9a0 ffff88022db9badc
[ 5009.955702] 0000000000000000 ffff8802161cd240 ffffffff81057ee0 ffff88017dfe5e10
[ 5009.955702] ffff88017dfe5e10 0000000000000000 0000000000000001 ffff88022db9b900
[ 5009.955702] Call Trace:
[ 5009.955702] [<ffffffff8107a514>] cgroup_rmdir+0x64/0x510
[ 5009.955702] [<ffffffff81057ee0>] ? wake_up_bit+0x40/0x40
[ 5009.955702] [<ffffffff810d46c6>] vfs_rmdir+0x96/0xe0
[ 5009.955702] [<ffffffff810d6313>] do_rmdir+0x103/0x120
[ 5009.955702] [<ffffffff810d6e21>] sys_rmdir+0x11/0x20
[ 5009.955702] [<ffffffff8134db7b>] system_call_fastpath+0x16/0x1b
[ 5009.955702] Code: 84 00 00 00 00 00 55 48 8b 87 50 02 00 00 48 89 e5 48 8d 8f 50 02 00 00 eb 11 0f 1f 40 00 48 8b 40 28 8b 00 85 c0 7f 1e 48 89 d0
[ 5009.955702] 8b 10 48 39 c8 0f 18 0a 75 e8 b8 01 00 00 00 c9 c3 66 0f 1f
[ 5009.955702] RIP [<ffffffff81079bb5>] cgroup_css_sets_empty+0x25/0x50
[ 5009.955702] RSP <ffff88017dfe5dd8>
[ 5010.075058] hpet1: lost 12 rtc interrupts
[ 5010.079153] ---[ end trace df95d1af02295a25 ]---

(gdb) list *cgroup_css_sets_empty+0x25
0xffffffff81079df5 is in cgroup_css_sets_empty (kernel/cgroup.c:3589).
3584 * Must be called with css_set_lock held */
3585 static int cgroup_css_sets_empty(struct cgroup *cgrp)
3586 {
3587 struct cg_cgroup_link *link;
3588
3589 list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
3590 struct css_set *cg = link->cg;
3591 if (atomic_read(&cg->refcount) > 0)
3592 return 0;
3593 }
(gdb)

2011-04-28 09:38:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Mon, 2011-04-18 at 16:21 +0200, Mike Galbraith wrote:
> On Wed, 2011-04-13 at 15:16 +0200, Paul Menage wrote:
>
> > Also, I'd be concerned that subsystems might get confused by the fact
> > that a new group called 'foo' could be created before the old 'foo'
> > has been cleaned up?
>
> Decided to try beating on that without my hack. Seems these patches...
> patches/cgroup-Set-CGRP_RELEASABLE-when-adding-to-a-group
> patches/cgroup-Remove-call-to-synchronize_rcu-in-cgroup_attach_task
> ...either introduced a bug, or made an existing one easier to hit. With
> only those two applied, and several tasks doing - create, attach,
> detach, fork off a rmdir - all for the same cgroup, box griped.
>
> With my hack on top, I'd either see rcu workers explode, or list
> corruption in cgroup_attach_task(), depending on phase-of-moon.
>
> (hohum, maybe it'll turn out to be a defenseless _baby_ dragon..)

The explosions are because the logic to snag rmdir() should anyone grab
a reference will let us zip right through and free a cgroup while
there's a destruction in flight. Adding a cgrp->count check before
trying to cgroup_clear_css_refs() prevents the explosions, but that
leaves RCU grace periods still embedded in userspace.

So...

I bent up put_css_set() a bit to try to destroy synchronously on final
put if possible, so rmdir() will only be snagged if that fails. The
thing seems to be working, so I'll show it. Readers (beware) may notice
some gratuitous looking chicken scratches. Just ignore those, and move
along smartly to the suggesting a much better way part, for surely one
must exist.

marge:~ # time sh -c "mkdir /cgroups/foo;echo $$ > /cgroups/foo/tasks;echo $$ > /cgroups/tasks;rmdir /cgroups/foo"

virgin tip

real 0m0.086s
user 0m0.004s
sys 0m0.000s

tip + rocks 'n sticks

real 0m0.006s
user 0m0.000s
sys 0m0.004s

(on top of Android patches)

---
include/linux/cgroup.h | 1
kernel/cgroup.c | 206 ++++++++++++++++++++++++++++++++++---------------
2 files changed, 146 insertions(+), 61 deletions(-)

Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -134,7 +134,7 @@ struct css_id {
* The css to which this ID points. This pointer is set to valid value
* after cgroup is populated. If cgroup is removed, this will be NULL.
* This pointer is expected to be RCU-safe because destroy()
- * is called after synchronize_rcu(). But for safe use, css_is_removed()
+ * is called after an RCU grace period. But for safe use, css_is_removed()
* css_tryget() should be used for avoiding race.
*/
struct cgroup_subsys_state __rcu *css;
@@ -353,7 +353,14 @@ static struct hlist_head *css_set_hash(s
return &css_set_table[index];
}

-static void free_css_set_work(struct work_struct *work)
+static void free_css_set_rcu(struct rcu_head *obj)
+{
+ struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+
+ kfree(cg);
+}
+
+static void destroy_css_set_work(struct work_struct *work)
{
struct css_set *cg = container_of(work, struct css_set, work);
struct cg_cgroup_link *link;
@@ -373,14 +380,14 @@ static void free_css_set_work(struct wor
}
write_unlock(&css_set_lock);

- kfree(cg);
+ call_rcu(&cg->rcu_head, free_css_set_rcu);
}

-static void free_css_set_rcu(struct rcu_head *obj)
+static void destroy_css_set_rcu(struct rcu_head *obj)
{
struct css_set *cg = container_of(obj, struct css_set, rcu_head);

- INIT_WORK(&cg->work, free_css_set_work);
+ INIT_WORK(&cg->work, destroy_css_set_work);
schedule_work(&cg->work);
}

@@ -398,8 +405,13 @@ static inline void get_css_set(struct cs
atomic_inc(&cg->refcount);
}

-static void put_css_set(struct css_set *cg)
+static int cgroup_clear_css_refs(struct cgroup *cgrp);
+
+static void put_css_set(struct css_set *cg, int locked)
{
+ struct cg_cgroup_link *link, *tmp;
+ int sync = locked, owner = 0;
+
/*
* Ensure that the refcount doesn't hit zero while any readers
* can see it. Similar to atomic_dec_and_lock(), but for an
@@ -407,17 +419,53 @@ static void put_css_set(struct css_set *
*/
if (atomic_add_unless(&cg->refcount, -1, 1))
return;
+ if (!locked)
+ sync = owner = mutex_trylock(&cgroup_mutex);
write_lock(&css_set_lock);
if (!atomic_dec_and_test(&cg->refcount)) {
write_unlock(&css_set_lock);
+ if (owner)
+ mutex_unlock(&cgroup_mutex);
return;
}

hlist_del(&cg->hlist);
css_set_count--;

+ /* synchronous destruction requires cgroup mutex. */
+ if (!sync)
+ goto out;
+
+ list_for_each_entry_safe(link, tmp, &cg->cg_links, cg_link_list) {
+ struct cgroup *cgrp = link->cgrp;
+
+ if (!cgroup_clear_css_refs(cgrp)) {
+ sync = 0;
+ goto out;
+ }
+ }
+
+ list_for_each_entry_safe(link, tmp, &cg->cg_links, cg_link_list) {
+ struct cgroup *cgrp = link->cgrp;
+
+ list_del(&link->cg_link_list);
+ list_del(&link->cgrp_link_list);
+ if (atomic_dec_and_test(&cgrp->count)) {
+ check_for_release(cgrp);
+ cgroup_wakeup_rmdir_waiter(cgrp);
+ }
+ kfree(link);
+ }
+
+out:
write_unlock(&css_set_lock);
- call_rcu(&cg->rcu_head, free_css_set_rcu);
+ if (owner)
+ mutex_unlock(&cgroup_mutex);
+
+ if (sync)
+ call_rcu(&cg->rcu_head, free_css_set_rcu);
+ else
+ call_rcu(&cg->rcu_head, destroy_css_set_rcu);
}

/*
@@ -620,7 +668,7 @@ static struct css_set *find_css_set(
struct list_head tmp_cg_links;

struct hlist_head *hhead;
- struct cg_cgroup_link *link;
+ struct cg_cgroup_link *link, *tmp;

/* First see if we already have a cgroup group that matches
* the desired set */
@@ -654,7 +702,7 @@ static struct css_set *find_css_set(

write_lock(&css_set_lock);
/* Add reference counts and links from the new css_set. */
- list_for_each_entry(link, &oldcg->cg_links, cg_link_list) {
+ list_for_each_entry_safe(link, tmp, &oldcg->cg_links, cg_link_list) {
struct cgroup *c = link->cgrp;
if (c->root == cgrp->root)
c = cgrp;
@@ -695,8 +743,8 @@ static struct cgroup *task_cgroup_from_r
if (css == &init_css_set) {
res = &root->top_cgroup;
} else {
- struct cg_cgroup_link *link;
- list_for_each_entry(link, &css->cg_links, cg_link_list) {
+ struct cg_cgroup_link *link, *tmp;
+ list_for_each_entry_safe(link, tmp, &css->cg_links, cg_link_list) {
struct cgroup *c = link->cgrp;
if (c->root == root) {
res = c;
@@ -843,12 +891,50 @@ static void free_cgroup_rcu(struct rcu_h
kfree(cgrp);
}

+static void destroy_cgroup_work(struct work_struct *work)
+{
+ struct cgroup *cgrp = container_of(work, struct cgroup, work);
+ struct cgroup_subsys *ss;
+
+ mutex_lock(&cgroup_mutex);
+ /*
+ * Release the subsystem state objects.
+ */
+ for_each_subsys(cgrp->root, ss)
+ ss->destroy(ss, cgrp);
+
+ cgrp->root->number_of_cgroups--;
+ mutex_unlock(&cgroup_mutex);
+
+ /*
+ * Drop the active superblock reference that we took when we
+ * created the cgroup
+ */
+ deactivate_super(cgrp->root->sb);
+
+ /*
+ * if we're getting rid of the cgroup, refcount should ensure
+ * that there are no pidlists left.
+ */
+ BUG_ON(!list_empty(&cgrp->pidlists));
+
+ call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
+}
+
+static void destroy_cgroup_rcu(struct rcu_head *obj)
+{
+ struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);
+
+ INIT_WORK(&cgrp->work, destroy_cgroup_work);
+ schedule_work(&cgrp->work);
+}
+
static void cgroup_diput(struct dentry *dentry, struct inode *inode)
{
+ struct cgroup *cgrp = dentry->d_fsdata;
+
/* is dentry a directory ? if so, kfree() associated cgroup */
if (S_ISDIR(inode->i_mode)) {
- struct cgroup *cgrp = dentry->d_fsdata;
- struct cgroup_subsys *ss;
BUG_ON(!(cgroup_is_removed(cgrp)));
/* It's possible for external users to be holding css
* reference counts on a cgroup; css_put() needs to
@@ -856,31 +942,7 @@ static void cgroup_diput(struct dentry *
* the reference count in order to know if it needs to
* queue the cgroup to be handled by the release
* agent */
- synchronize_rcu();
-
- mutex_lock(&cgroup_mutex);
- /*
- * Release the subsystem state objects.
- */
- for_each_subsys(cgrp->root, ss)
- ss->destroy(ss, cgrp);
-
- cgrp->root->number_of_cgroups--;
- mutex_unlock(&cgroup_mutex);
-
- /*
- * Drop the active superblock reference that we took when we
- * created the cgroup
- */
- deactivate_super(cgrp->root->sb);
-
- /*
- * if we're getting rid of the cgroup, refcount should ensure
- * that there are no pidlists left.
- */
- BUG_ON(!list_empty(&cgrp->pidlists));
-
- call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
+ call_rcu(&cgrp->rcu_head, destroy_cgroup_rcu);
}
iput(inode);
}
@@ -1792,7 +1854,7 @@ int cgroup_attach_task(struct cgroup *cg
* based on its final set of cgroups
*/
newcg = find_css_set(cg, cgrp);
- put_css_set(cg);
+ put_css_set(cg, 1);
if (!newcg) {
retval = -ENOMEM;
goto out;
@@ -1801,7 +1863,7 @@ int cgroup_attach_task(struct cgroup *cg
task_lock(tsk);
if (tsk->flags & PF_EXITING) {
task_unlock(tsk);
- put_css_set(newcg);
+ put_css_set(newcg, 1);
retval = -ESRCH;
goto out;
}
@@ -1820,7 +1882,7 @@ int cgroup_attach_task(struct cgroup *cg
}
set_bit(CGRP_RELEASABLE, &cgrp->flags);
/* put_css_set will not destroy cg until after an RCU grace period */
- put_css_set(cg);
+ put_css_set(cg, 1);

/*
* wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -2361,12 +2423,14 @@ EXPORT_SYMBOL_GPL(cgroup_add_files);
int cgroup_task_count(const struct cgroup *cgrp)
{
int count = 0;
- struct cg_cgroup_link *link;
+ struct cg_cgroup_link *link, *tmp;

read_lock(&css_set_lock);
- list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
+ rcu_read_lock();
+ list_for_each_entry_safe(link, tmp, &cgrp->css_sets, cgrp_link_list) {
count += atomic_read(&link->cg->refcount);
}
+ rcu_read_unlock();
read_unlock(&css_set_lock);
return count;
}
@@ -3538,10 +3602,16 @@ static int cgroup_clear_css_refs(struct
struct cgroup_subsys *ss;
unsigned long flags;
bool failed = false;
+
local_irq_save(flags);
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
int refcnt;
+
+ /* Someone beat us to it. */
+ if (test_bit(CSS_REMOVED, &css->flags))
+ continue;
+
while (1) {
/* We can only remove a CSS with a refcnt==1 */
refcnt = atomic_read(&css->refcnt);
@@ -3580,17 +3650,20 @@ static int cgroup_clear_css_refs(struct
return !failed;
}

-/* checks if all of the css_sets attached to a cgroup have a refcount of 0.
- * Must be called with css_set_lock held */
+/* checks if all of the css_sets attached to a cgroup have a refcount of 0. */
static int cgroup_css_sets_empty(struct cgroup *cgrp)
{
- struct cg_cgroup_link *link;
+ struct cg_cgroup_link *link, *tmp;

- list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
+ read_lock(&css_set_lock);
+ list_for_each_entry_safe(link, tmp, &cgrp->css_sets, cgrp_link_list) {
struct css_set *cg = link->cg;
- if (atomic_read(&cg->refcount) > 0)
+ if (atomic_read(&cg->refcount) > 0) {
+ read_unlock(&css_set_lock);
return 0;
+ }
}
+ read_unlock(&css_set_lock);

return 1;
}
@@ -3606,7 +3679,8 @@ static int cgroup_rmdir(struct inode *un

/* the vfs holds both inode->i_mutex already */
again:
- mutex_lock(&cgroup_mutex);
+ if (!cgroup_lock_live_group(cgrp))
+ return -ENODEV;
if (!cgroup_css_sets_empty(cgrp)) {
mutex_unlock(&cgroup_mutex);
return -EBUSY;
@@ -3638,15 +3712,22 @@ again:
return ret;
}

- mutex_lock(&cgroup_mutex);
- parent = cgrp->parent;
+ if (!cgroup_lock_live_group(cgrp))
+ return -ENODEV;
if (!cgroup_css_sets_empty(cgrp) || !list_empty(&cgrp->children)) {
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
+
prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
- if (!cgroup_clear_css_refs(cgrp)) {
+ parent = cgrp->parent;
+
+ /*
+ * Oops, someone grabbed a reference, or there is an asynchronous
+ * final put_css_set() in flight, so we have to wait.
+ */
+ if (atomic_read(&cgrp->count) || !cgroup_clear_css_refs(cgrp)) {
mutex_unlock(&cgroup_mutex);
/*
* Because someone may call cgroup_wakeup_rmdir_waiter() before
@@ -3660,12 +3741,13 @@ again:
return -EINTR;
goto again;
}
- /* NO css_tryget() can success after here. */
finish_wait(&cgroup_rmdir_waitq, &wait);
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);

- spin_lock(&release_list_lock);
+ /* NO css_tryget() can success after here. */
set_bit(CGRP_REMOVED, &cgrp->flags);
+
+ spin_lock(&release_list_lock);
if (!list_empty(&cgrp->release_list))
list_del_init(&cgrp->release_list);
spin_unlock(&release_list_lock);
@@ -4279,7 +4361,7 @@ void cgroup_exit(struct task_struct *tsk
task_unlock(tsk);

if (cg)
- put_css_set(cg);
+ put_css_set(cg, 0);
}

/**
@@ -4366,7 +4448,7 @@ int cgroup_clone(struct task_struct *tsk
(parent != task_cgroup(tsk, subsys->subsys_id))) {
/* Aargh, we raced ... */
mutex_unlock(&inode->i_mutex);
- put_css_set(cg);
+ put_css_set(cg, 1);

deactivate_super(root->sb);
/* The cgroup is still accessible in the VFS, but
@@ -4392,7 +4474,7 @@ int cgroup_clone(struct task_struct *tsk
mutex_unlock(&inode->i_mutex);

mutex_lock(&cgroup_mutex);
- put_css_set(cg);
+ put_css_set(cg, 1);
mutex_unlock(&cgroup_mutex);
deactivate_super(root->sb);
return ret;
@@ -4904,13 +4986,13 @@ static int current_css_set_cg_links_read
struct cftype *cft,
struct seq_file *seq)
{
- struct cg_cgroup_link *link;
+ struct cg_cgroup_link *link, *tmp;
struct css_set *cg;

read_lock(&css_set_lock);
rcu_read_lock();
cg = rcu_dereference(current->cgroups);
- list_for_each_entry(link, &cg->cg_links, cg_link_list) {
+ list_for_each_entry_safe(link, tmp, &cg->cg_links, cg_link_list) {
struct cgroup *c = link->cgrp;
const char *name;

@@ -4931,10 +5013,11 @@ static int cgroup_css_links_read(struct
struct cftype *cft,
struct seq_file *seq)
{
- struct cg_cgroup_link *link;
+ struct cg_cgroup_link *link, *tmp;

read_lock(&css_set_lock);
- list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
+ rcu_read_lock();
+ list_for_each_entry_safe(link, tmp, &cont->css_sets, cgrp_link_list) {
struct css_set *cg = link->cg;
struct task_struct *task;
int count = 0;
@@ -4949,6 +5032,7 @@ static int cgroup_css_links_read(struct
}
}
}
+ rcu_read_unlock();
read_unlock(&css_set_lock);
return 0;
}
Index: linux-2.6/include/linux/cgroup.h
===================================================================
--- linux-2.6.orig/include/linux/cgroup.h
+++ linux-2.6/include/linux/cgroup.h
@@ -231,6 +231,7 @@ struct cgroup {

/* For RCU-protected deletion */
struct rcu_head rcu_head;
+ struct work_struct work;

/* List of events which userspace want to receive */
struct list_head event_list;



2011-04-29 12:34:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

(ok crickets, keep the noise down please)

On Thu, 2011-04-28 at 11:38 +0200, Mike Galbraith wrote:

> The explosions are because the logic to snag rmdir() should anyone grab
> a reference will let us zip right through and free a cgroup while
> there's a destruction in flight. Adding a cgrp->count check before
> trying to cgroup_clear_css_refs() prevents the explosions, but that
> leaves RCU grace periods still embedded in userspace.
>
> So...
>
> I bent up put_css_set() a bit to try to destroy synchronously on final
> put if possible, so rmdir() will only be snagged if that fails. The
> thing seems to be working, so I'll show it. Readers (beware) may notice
> some gratuitous looking chicken scratches. Just ignore those, and move
> along smartly to the suggesting a much better way part, for surely one
> must exist.

Hi Self, (*howdy*)

You might try the below. No weird gyrations to accomplish the same
thing, and I see no slub debug gripes, no list debug gripes, nada.

Makes one wonder what these things do for a living.

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 25c7eb5..b8c64bf 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -826,13 +826,6 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
struct cgroup *cgrp = dentry->d_fsdata;
struct cgroup_subsys *ss;
BUG_ON(!(cgroup_is_removed(cgrp)));
- /* It's possible for external users to be holding css
- * reference counts on a cgroup; css_put() needs to
- * be able to access the cgroup after decrementing
- * the reference count in order to know if it needs to
- * queue the cgroup to be handled by the release
- * agent */
- synchronize_rcu();

mutex_lock(&cgroup_mutex);
/*
@@ -1822,7 +1815,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
ss->attach(ss, cgrp, oldcgrp, tsk, false);
}
set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
- synchronize_rcu();
put_css_set(cg);

/*

2011-05-02 13:46:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Fri, Apr 29, 2011 at 02:34:47PM +0200, Mike Galbraith wrote:
> (ok crickets, keep the noise down please)
>
> On Thu, 2011-04-28 at 11:38 +0200, Mike Galbraith wrote:
>
> > The explosions are because the logic to snag rmdir() should anyone grab
> > a reference will let us zip right through and free a cgroup while
> > there's a destruction in flight. Adding a cgrp->count check before
> > trying to cgroup_clear_css_refs() prevents the explosions, but that
> > leaves RCU grace periods still embedded in userspace.
> >
> > So...
> >
> > I bent up put_css_set() a bit to try to destroy synchronously on final
> > put if possible, so rmdir() will only be snagged if that fails. The
> > thing seems to be working, so I'll show it. Readers (beware) may notice
> > some gratuitous looking chicken scratches. Just ignore those, and move
> > along smartly to the suggesting a much better way part, for surely one
> > must exist.
>
> Hi Self, (*howdy*)
>
> You might try the below. No weird gyrations to accomplish the same
> thing, and I see no slub debug gripes, no list debug gripes, nada.
>
> Makes one wonder what these things do for a living.

If you are adding something to an RCU-protected data structure, then you do
not need synchronize_rcu(). But if you are removing something from
an RCU-protected data structure, then you really do need them. If you
leave them out, you can see the following type of failure:

1. CPU 0, running in an RCU read-side critical section, obtains
a pointer to data item A.

2. CPU 1 removes data item A from the structure.

3. CPU 1 does not do a synchronize_rcu(). If CPU 1 had done a
synchronize_rcu(), then it would have waited until CPU 0 had
left its RCU read-side critical section, and thus until
CPU 0 stopped using its pointer to data item A. But there was
no synchronize_rcu(), so CPU 0 is still looking at data item A.

4. CPU 1 frees data item A.

This is very bad. CPU 0 has a pointer into the freelist. Worse yet,
some other CPU might allocate memory and get a pointer to data item A.
That CPU and CPU 0 would then have an interesting but short lived
disagreement about that memory's type. Crash goes the kernel.

So please do not remove synchronize_rcu() calls unless you can prove
that it is safe to do so!

Thanx, Paul

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 25c7eb5..b8c64bf 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -826,13 +826,6 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> struct cgroup *cgrp = dentry->d_fsdata;
> struct cgroup_subsys *ss;
> BUG_ON(!(cgroup_is_removed(cgrp)));
> - /* It's possible for external users to be holding css
> - * reference counts on a cgroup; css_put() needs to
> - * be able to access the cgroup after decrementing
> - * the reference count in order to know if it needs to
> - * queue the cgroup to be handled by the release
> - * agent */
> - synchronize_rcu();
>
> mutex_lock(&cgroup_mutex);
> /*
> @@ -1822,7 +1815,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> ss->attach(ss, cgrp, oldcgrp, tsk, false);
> }
> set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> - synchronize_rcu();
> put_css_set(cg);
>
> /*
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-05-02 14:29:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Mon, 2011-05-02 at 06:46 -0700, Paul E. McKenney wrote:
> On Fri, Apr 29, 2011 at 02:34:47PM +0200, Mike Galbraith wrote:

> > Makes one wonder what these things do for a living.
>
> If you are adding something to an RCU-protected data structure, then you do
> not need synchronize_rcu(). But if you are removing something from
> an RCU-protected data structure, then you really do need them. If you
> leave them out, you can see the following type of failure:
>
> 1. CPU 0, running in an RCU read-side critical section, obtains
> a pointer to data item A.
>
> 2. CPU 1 removes data item A from the structure.
>
> 3. CPU 1 does not do a synchronize_rcu(). If CPU 1 had done a
> synchronize_rcu(), then it would have waited until CPU 0 had
> left its RCU read-side critical section, and thus until
> CPU 0 stopped using its pointer to data item A. But there was
> no synchronize_rcu(), so CPU 0 is still looking at data item A.
>
> 4. CPU 1 frees data item A.
>
> This is very bad. CPU 0 has a pointer into the freelist. Worse yet,
> some other CPU might allocate memory and get a pointer to data item A.
> That CPU and CPU 0 would then have an interesting but short lived
> disagreement about that memory's type. Crash goes the kernel.
>
> So please do not remove synchronize_rcu() calls unless you can prove
> that it is safe to do so!

In these instances are a little different.

We have..
start teardown
synchronize_rcu()
finish teardown
call_rcu(kfree_it)
..so removal wouldn't trigger the standard "let's rummage around in
freed memory" kind of excitement.

But yeah, removing them without proof is out.

My box was telling me that they _are_ safe to remove, by not exploding
with list/slub debug enabled while I beat the snot out of it.. which is
evidence, but not proof :)

-Mike

2011-05-02 15:04:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Mon, 2011-05-02 at 16:29 +0200, Mike Galbraith wrote:
> On Mon, 2011-05-02 at 06:46 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 29, 2011 at 02:34:47PM +0200, Mike Galbraith wrote:
>
> > > Makes one wonder what these things do for a living.
> >
> > If you are adding something to an RCU-protected data structure, then you do
> > not need synchronize_rcu(). But if you are removing something from
> > an RCU-protected data structure, then you really do need them. If you
> > leave them out, you can see the following type of failure:
> >
> > 1. CPU 0, running in an RCU read-side critical section, obtains
> > a pointer to data item A.
> >
> > 2. CPU 1 removes data item A from the structure.
> >
> > 3. CPU 1 does not do a synchronize_rcu(). If CPU 1 had done a
> > synchronize_rcu(), then it would have waited until CPU 0 had
> > left its RCU read-side critical section, and thus until
> > CPU 0 stopped using its pointer to data item A. But there was
> > no synchronize_rcu(), so CPU 0 is still looking at data item A.
> >
> > 4. CPU 1 frees data item A.
> >
> > This is very bad. CPU 0 has a pointer into the freelist. Worse yet,
> > some other CPU might allocate memory and get a pointer to data item A.
> > That CPU and CPU 0 would then have an interesting but short lived
> > disagreement about that memory's type. Crash goes the kernel.
> >
> > So please do not remove synchronize_rcu() calls unless you can prove
> > that it is safe to do so!
>
> In these instances are a little different.
>
> We have..
> start teardown
> synchronize_rcu()
> finish teardown
> call_rcu(kfree_it)
> ..so removal wouldn't trigger the standard "let's rummage around in
> freed memory" kind of excitement.
>
> But yeah, removing them without proof is out.
>
> My box was telling me that they _are_ safe to remove, by not exploding
> with list/slub debug enabled while I beat the snot out of it.. which is
> evidence, but not proof :)

P.S. the explosions I was looking into were caused by that finish
teardown being in flight via schedule_work() when android removed
synchronize_rcu() _and synchronization on in-flight teardown_. I became
curious wrt the need for synchronize_rcu() at all when I fixed these
explosions by ensuring that teardown was _not_ in flight before
shredding the cgroup via rmdir, by doing synchronous teardown if
possible, and only synchronizing if it wasn't possible. Only removing
synchronize_rcu() does essentially the same, since teardown is then done
synchronously under the big mutex. Freeing is still done via rcu.

So it wasn't "these things make userspace sleepy, let's remove them".

2011-05-02 23:36:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: query: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task

On Mon, May 02, 2011 at 05:04:00PM +0200, Mike Galbraith wrote:
> On Mon, 2011-05-02 at 16:29 +0200, Mike Galbraith wrote:
> > On Mon, 2011-05-02 at 06:46 -0700, Paul E. McKenney wrote:
> > > On Fri, Apr 29, 2011 at 02:34:47PM +0200, Mike Galbraith wrote:
> >
> > > > Makes one wonder what these things do for a living.
> > >
> > > If you are adding something to an RCU-protected data structure, then you do
> > > not need synchronize_rcu(). But if you are removing something from
> > > an RCU-protected data structure, then you really do need them. If you
> > > leave them out, you can see the following type of failure:
> > >
> > > 1. CPU 0, running in an RCU read-side critical section, obtains
> > > a pointer to data item A.
> > >
> > > 2. CPU 1 removes data item A from the structure.
> > >
> > > 3. CPU 1 does not do a synchronize_rcu(). If CPU 1 had done a
> > > synchronize_rcu(), then it would have waited until CPU 0 had
> > > left its RCU read-side critical section, and thus until
> > > CPU 0 stopped using its pointer to data item A. But there was
> > > no synchronize_rcu(), so CPU 0 is still looking at data item A.
> > >
> > > 4. CPU 1 frees data item A.
> > >
> > > This is very bad. CPU 0 has a pointer into the freelist. Worse yet,
> > > some other CPU might allocate memory and get a pointer to data item A.
> > > That CPU and CPU 0 would then have an interesting but short lived
> > > disagreement about that memory's type. Crash goes the kernel.
> > >
> > > So please do not remove synchronize_rcu() calls unless you can prove
> > > that it is safe to do so!
> >
> > In these instances are a little different.
> >
> > We have..
> > start teardown
> > synchronize_rcu()
> > finish teardown
> > call_rcu(kfree_it)
> > ..so removal wouldn't trigger the standard "let's rummage around in
> > freed memory" kind of excitement.
> >
> > But yeah, removing them without proof is out.
> >
> > My box was telling me that they _are_ safe to remove, by not exploding
> > with list/slub debug enabled while I beat the snot out of it.. which is
> > evidence, but not proof :)
>
> P.S. the explosions I was looking into were caused by that finish
> teardown being in flight via schedule_work() when android removed
> synchronize_rcu() _and synchronization on in-flight teardown_. I became
> curious wrt the need for synchronize_rcu() at all when I fixed these
> explosions by ensuring that teardown was _not_ in flight before
> shredding the cgroup via rmdir, by doing synchronous teardown if
> possible, and only synchronizing if it wasn't possible. Only removing
> synchronize_rcu() does essentially the same, since teardown is then done
> synchronously under the big mutex. Freeing is still done via rcu.
>
> So it wasn't "these things make userspace sleepy, let's remove them".

OK, but you did have me going for a bit there! ;-)

Thanx, Paul