LinuxLists.cc - cgroup: rmdir() does not complete

2010-08-26 16:34:41

Subject: cgroup: rmdir() does not complete

I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
spins, others queue up behind it with the following:

INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
Call Trace:
[<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
[<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
[<ffffffff81108a7c>] ? path_put+0x1d/0x22
[<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
[<ffffffff81427c4f>] mutex_lock+0x31/0x4b
[<ffffffff8110bdf8>] do_rmdir+0x74/0x102
[<ffffffff8110bebd>] sys_rmdir+0x11/0x13
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
tasks.

Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
the rmdir. It looks like what I am seeing here and indicates that some
cgroup subsystem is busy, indefinitely.

I have not worked out how to reproduce it quickly. My only way is to
complete a 'dd' command in the cgroup, but then the problem is so rare it
is slow progress.

Documentation/cgroup.memory.txt describes how force_empty can be required
in some cases. Does this mean that with the patch above, these cases will
now spin on rmdir(), instead of returning -EBUSY? How can produce a
reliable test case requiring memory.force_empty to be used, to test this?

Or is it likely to be some other cause, and how best to find it?

Thanks

--
Mark

2010-08-27 01:04:33

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

Hi.

On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
Mark Hills <[email protected]> wrote:

> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
> spins, others queue up behind it with the following:
>
> INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
> ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> Call Trace:
> [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>
> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
> tasks.
>
> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
> the rmdir. It looks like what I am seeing here and indicates that some
> cgroup subsystem is busy, indefinitely.
>
The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
The fix was merged in 2.6.31, so it seems that you hit a new one...

> I have not worked out how to reproduce it quickly. My only way is to
> complete a 'dd' command in the cgroup, but then the problem is so rare it
> is slow progress.
>
> Documentation/cgroup.memory.txt describes how force_empty can be required
> in some cases. Does this mean that with the patch above, these cases will
> now spin on rmdir(), instead of returning -EBUSY? How can produce a
> reliable test case requiring memory.force_empty to be used, to test this?
>
You don't need to touch "force_empty". rmdir() does what "force_empty" does.

> Or is it likely to be some other cause, and how best to find it?
>
What cgroup subsystem did you mount where the directory existed you tried
to rmdir() first ?
If you mounted several subsystems on the same hierarchy, can you mount them
separately to narrow down the cause ?

Thanks,
Daisuke Nishimura.

2010-08-27 01:20:50

by Balbir Singh

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, Aug 27, 2010 at 6:26 AM, Daisuke Nishimura
<[email protected]> wrote:
> Hi.
>
> On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
> Mark Hills <[email protected]> wrote:
>
>> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
>> spins, others queue up behind it with the following:
>>
>> ? INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
>> ? "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> ? soaked-cgrou D ffff8800058157c0 ? ? 0 27257 ?29411 0x00000000
>> ? ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
>> ? 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
>> ? ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
>> ? Call Trace:
>> ? [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
>> ? [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
>> ? [<ffffffff81108a7c>] ? path_put+0x1d/0x22
>> ? [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
>> ? [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
>> ? [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
>> ? [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
>> ? [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>>
>> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
>> tasks.
>>
>> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
>> the rmdir. It looks like what I am seeing here and indicates that some
>> cgroup subsystem is busy, indefinitely.
>>
> The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
> The fix was merged in 2.6.31, so it seems that you hit a new one...
>
>> I have not worked out how to reproduce it quickly. My only way is to
>> complete a 'dd' command in the cgroup, but then the problem is so rare it
>> is slow progress.
>>
>> Documentation/cgroup.memory.txt describes how force_empty can be required
>> in some cases. Does this mean that with the patch above, these cases will
>> now spin on rmdir(), instead of returning -EBUSY? How can produce a
>> reliable test case requiring memory.force_empty to be used, to test this?
>>
> You don't need to touch "force_empty". rmdir() does what "force_empty" does.
>
>> Or is it likely to be some other cause, and how best to find it?
>>
> What cgroup subsystem did you mount where the directory existed you tried
> to rmdir() first ?
> If you mounted several subsystems on the same hierarchy, can you mount them
> separately to narrow down the cause ?
>

It would also be nice to see what your mounted cgroup (filesystem
perspective) looks like and what /proc/cgroups looks like when the
problem occurs.

Balbir

2010-08-27 01:30:28

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
Mark Hills <[email protected]> wrote:

> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
> spins, others queue up behind it with the following:
>
> INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
> ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> Call Trace:
> [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>
> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
> tasks.
>
> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
> the rmdir. It looks like what I am seeing here and indicates that some
> cgroup subsystem is busy, indefinitely.
>

Hmm. really spin ? sleeping-forever-no-wake-up ?

> I have not worked out how to reproduce it quickly. My only way is to
> complete a 'dd' command in the cgroup, but then the problem is so rare it
> is slow progress.
>
please show how-to-reproduce in your way.
And what cgroup is mounted ? memory cgroup only ?

> Documentation/cgroup.memory.txt describes how force_empty can be required
> in some cases.

Ah, maybe that's wrong text. rmdir() calls force-empty automatically.

> Does this mean that with the patch above, these cases will
> now spin on rmdir(), instead of returning -EBUSY? How can produce a
> reliable test case requiring memory.force_empty to be used, to test this?
>

Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.
I'm grad if you can check it can happen in stock kernel, 2.6.35.

> Or is it likely to be some other cause, and how best to find it?
>

At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
Then, rmdir doesn't seem to reach cgroup code...
Do you do another operation on the directory while rmdir is called ?

Thanks,
-Kame

2010-08-27 02:40:05

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, 27 Aug 2010 09:56:39 +0900
Daisuke Nishimura <[email protected]> wrote:

> > Or is it likely to be some other cause, and how best to find it?
> >
> What cgroup subsystem did you mount where the directory existed you tried
> to rmdir() first ?
> If you mounted several subsystems on the same hierarchy, can you mount them
> separately to narrow down the cause ?
>

It seems I can reproduce the issue on mmotm-0811, too.

try this.

Here, memory cgroup is mounted at /cgroups.
==
#!/bin/bash -x

while sleep 1; do
date
mkdir /cgroups/test
echo 0 > /cgroups/test/tasks
echo 300M > /cgroups/test/memory.limit_in_bytes
cat /proc/self/cgroup
dd if=/dev/zero of=./tmpfile bs=4096 count=100000
echo 0 > /cgroups/tasks
cat /proc/self/cgroup
rmdir /cgroups/test
rm ./tmpfile
done
==

hangs at rmdir. I'm no investigating force_empty.

Thanks,
-Kame

2010-08-27 03:41:46

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, 27 Aug 2010 11:35:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Fri, 27 Aug 2010 09:56:39 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > > Or is it likely to be some other cause, and how best to find it?
> > >
> > What cgroup subsystem did you mount where the directory existed you tried
> > to rmdir() first ?
> > If you mounted several subsystems on the same hierarchy, can you mount them
> > separately to narrow down the cause ?
> >
>
> It seems I can reproduce the issue on mmotm-0811, too.
>
> try this.
>
> Here, memory cgroup is mounted at /cgroups.
> ==
> #!/bin/bash -x
>
> while sleep 1; do
> date
> mkdir /cgroups/test
> echo 0 > /cgroups/test/tasks
> echo 300M > /cgroups/test/memory.limit_in_bytes
> cat /proc/self/cgroup
> dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> echo 0 > /cgroups/tasks
> cat /proc/self/cgroup
> rmdir /cgroups/test
> rm ./tmpfile
> done
> ==
>
> hangs at rmdir. I'm no investigating force_empty.
>
Thank you very much for your information.

Some questions.

Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
And, how long does it likely to take to cause this problem ?
I've run it on RHEL6-based kernel/ext3 for about one hour, but
I cannot reproduce it yet.

Thanks,
Daisuke Nishimura.

2010-08-27 05:47:41

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, 27 Aug 2010 12:39:48 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Fri, 27 Aug 2010 11:35:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Fri, 27 Aug 2010 09:56:39 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > > Or is it likely to be some other cause, and how best to find it?
> > > >
> > > What cgroup subsystem did you mount where the directory existed you tried
> > > to rmdir() first ?
> > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > separately to narrow down the cause ?
> > >
> >
> > It seems I can reproduce the issue on mmotm-0811, too.
> >
> > try this.
> >
> > Here, memory cgroup is mounted at /cgroups.
> > ==
> > #!/bin/bash -x
> >
> > while sleep 1; do
> > date
> > mkdir /cgroups/test
> > echo 0 > /cgroups/test/tasks
> > echo 300M > /cgroups/test/memory.limit_in_bytes
> > cat /proc/self/cgroup
> > dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > echo 0 > /cgroups/tasks
> > cat /proc/self/cgroup
> > rmdir /cgroups/test
> > rm ./tmpfile
> > done
> > ==
> >
> > hangs at rmdir. I'm no investigating force_empty.
> >
> Thank you very much for your information.
>
> Some questions.
>
> Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
on ext4.

> And, how long does it likely to take to cause this problem ?

very soon. 10-20 loop.

> I've run it on RHEL6-based kernel/ext3 for about one hour, but
> I cannot reproduce it yet.
>

Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...

Thanks,
-Kame

2010-08-27 06:35:00

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, 27 Aug 2010 14:42:25 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> > I've run it on RHEL6-based kernel/ext3 for about one hour, but
> > I cannot reproduce it yet.
> >
>
> Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
>
>
Sorry, my test just hangs on -mm + (other patches)
no troubles on 2.6.34 and 2.6.36-rc1.

Where can I see 2.6.33.6(Fedora) kernel ?

Thanks,
-Kame

2010-08-30 07:32:11

by Balbir Singh

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

* KAMEZAWA Hiroyuki <[email protected]> [2010-08-27 15:29:58]:

> On Fri, 27 Aug 2010 14:42:25 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > > I've run it on RHEL6-based kernel/ext3 for about one hour, but
> > > I cannot reproduce it yet.
> > >
> >
> > Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
> >
> >
> Sorry, my test just hangs on -mm + (other patches)
> no troubles on 2.6.34 and 2.6.36-rc1.
>
> Where can I see 2.6.33.6(Fedora) kernel ?
>

You can get the SRPM from the mirrors, one place to find it would be

http://download.fedora.redhat.com/pub/fedora/linux/updates/13/SRPMS/

--
Three Cheers,
Balbir

2010-08-30 09:13:21

by Mark Hills

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 27 Aug 2010 12:39:48 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Fri, 27 Aug 2010 11:35:06 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > > > > Or is it likely to be some other cause, and how best to find it?
> > > > >
> > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > to rmdir() first ?
> > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > separately to narrow down the cause ?
> > > >
> > >
> > > It seems I can reproduce the issue on mmotm-0811, too.
> > >
> > > try this.
> > >
> > > Here, memory cgroup is mounted at /cgroups.
> > > ==
> > > #!/bin/bash -x
> > >
> > > while sleep 1; do
> > > date
> > > mkdir /cgroups/test
> > > echo 0 > /cgroups/test/tasks
> > > echo 300M > /cgroups/test/memory.limit_in_bytes
> > > cat /proc/self/cgroup
> > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > > echo 0 > /cgroups/tasks
> > > cat /proc/self/cgroup
> > > rmdir /cgroups/test
> > > rm ./tmpfile
> > > done
> > > ==
> > >
> > > hangs at rmdir. I'm no investigating force_empty.
> > >
> > Thank you very much for your information.
> >
> > Some questions.
> >
> > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> on ext4.
>
> > And, how long does it likely to take to cause this problem ?
>
> very soon. 10-20 loop.

The test case I was running is similar to the above. With the Lustre
filesystem the problem takes 4 hours or more to show itself. Recently I
ran 4 threads for over 24 hours without it being seen -- I suspect some
external factor is involved.

I also tried NFS, and did not see a problem after 8 hours or so, but this
is inconclusive.

The use of the Fedora kernel, and the Lustre filesystem is not
satisfactory to trace the bug. Until I can get a test case which is more
readily reproducable, I'm not able to reasonably think about changing
variables.

It is interesting you see the problem so readily on ext4; I will test that
soon (it is currently holiday weekend in the UK). I hope it will give me
the test case I am looking for.

Thanks

--
Mark

2010-08-30 09:25:57

by Mark Hills

[permalink] [raw]

Subject: Re: cgroup: rmdir() does not complete

On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:

> On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
> Mark Hills <[email protected]> wrote:
>
> > I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
> > spins, others queue up behind it with the following:
> >
> > INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
> > ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> > 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> > ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> > Call Trace:
> > [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> > [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> > [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> > [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> > [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> > [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> > [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> > [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
> >
> > Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
> > tasks.
> >
> > Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
> > the rmdir. It looks like what I am seeing here and indicates that some
> > cgroup subsystem is busy, indefinitely.
> >
>
> Hmm. really spin ? sleeping-forever-no-wake-up ?

It sleeps in D state, but enters interruptable state periodically which is
why my attention was drawn to that loop.

> > I have not worked out how to reproduce it quickly. My only way is to
> > complete a 'dd' command in the cgroup, but then the problem is so rare it
> > is slow progress.
> >
> please show how-to-reproduce in your way.

I use a C program which creates a container and places itself in the
container, then forks a dd process.

But it seems you found an easier test case; I hope to test that soon.

> And what cgroup is mounted ? memory cgroup only ?

Quite a few: memory, blkio, cpuacct, cpuset.

Until I can get a more reproducable test case (see my previous mail), I
can't really reduce this.

> > Documentation/cgroup.memory.txt describes how force_empty can be required
> > in some cases.
>
> Ah, maybe that's wrong text. rmdir() calls force-empty automatically.
>
> > Does this mean that with the patch above, these cases will
> > now spin on rmdir(), instead of returning -EBUSY? How can produce a
> > reliable test case requiring memory.force_empty to be used, to test this?
> >
>
> Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.
> I'm grad if you can check it can happen in stock kernel, 2.6.35.
>
> > Or is it likely to be some other cause, and how best to find it?
> >
>
> At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
> Then, rmdir doesn't seem to reach cgroup code...

Interesting, I checked for that but not sure how I missed it. There is
clearly a mutex lock in do_rmdir() in fs/namei.c.

> Do you do another operation on the directory while rmdir is called ?

In one case I did an 'ls -l' on the filesystem which coencided with a lock
up, but I was not able to reproduce this.

--
Mark