2014-07-30 14:01:12

by Richard Weinberger

[permalink] [raw]
Subject: [PATCH] vfs: Fix RCU usage in __propagate_umount()

If we use the plain list_empty() we might not see the
hlist_del_init_rcu() and therefore miss one member of the
list.

It fixes the following issue:
$ unshare -m /usr/bin/sleep 10000 &
$ mkdir -p foo/proc
$ mount -t proc none foo/proc
$ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
$ umount -l foo/proc
$ rmdir foo/proc
rmdir: failed to remove ‘foo/proc’: Device or resource busy

rmdir fails because the last entry in the RCU list, "proc", was
not propagated as list_empty() still returned false instead of true.

Signed-off-by: Richard Weinberger <[email protected]>
---
Hi!

Please review this patch with care, the comments in rculist.h
confused me like hell:

First it says:
/*
* Why is there no list_empty_rcu()? Because list_empty() serves this
* purpose. The list_empty() function fetches the RCU-protected pointer
* and compares it to the address of the list head, but neither dereferences
* this pointer itself nor provides this pointer to the caller. Therefore,
* it is not necessary to use rcu_dereference(), so that list_empty() can
* be used anywhere you would want to use a list_empty_rcu().
*/

And later:
/**
* Where are list_empty_rcu() and list_first_entry_rcu()?
*
* Implementing those functions following their counterparts list_empty() and
* list_first_entry() is not advisable because they lead to subtle race
* conditions as the following snippet shows:
*
* if (!list_empty_rcu(mylist)) {
* struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
* do_something(bar);
* }
*
* The list may not be empty when list_empty_rcu checks it, but it may be when
* list_first_entry_rcu rereads the ->next pointer.
*
* Rereading the ->next pointer is not a problem for list_empty() and
* list_first_entry() because they would be protected by a lock that blocks
* writers.
*
* See list_first_or_null_rcu for an alternative.
*/

To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
or am I missing something?

Thanks,
//richard

fs/pnode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 302bf22..883901c 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
* umount the child only if the child has no
* other children
*/
- if (child && list_empty(&child->mnt_mounts)) {
+ if (child && list_first_or_null_rcu(&child->mnt_mounts,
+ struct mount, mnt_mounts)) {
hlist_del_init_rcu(&child->mnt_hash);
hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
}
--
1.8.4.5


2014-07-30 14:20:18

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()

Am 30.07.2014 15:59, schrieb Richard Weinberger:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
>
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>
> rmdir fails because the last entry in the RCU list, "proc", was
> not propagated as list_empty() still returned false instead of true.
>
> Signed-off-by: Richard Weinberger <[email protected]>

Please drop this patch, it is wrong. :-\

Thanks,
//richard

> ---
> Hi!
>
> Please review this patch with care, the comments in rculist.h
> confused me like hell:
>
> First it says:
> /*
> * Why is there no list_empty_rcu()? Because list_empty() serves this
> * purpose. The list_empty() function fetches the RCU-protected pointer
> * and compares it to the address of the list head, but neither dereferences
> * this pointer itself nor provides this pointer to the caller. Therefore,
> * it is not necessary to use rcu_dereference(), so that list_empty() can
> * be used anywhere you would want to use a list_empty_rcu().
> */
>
> And later:
> /**
> * Where are list_empty_rcu() and list_first_entry_rcu()?
> *
> * Implementing those functions following their counterparts list_empty() and
> * list_first_entry() is not advisable because they lead to subtle race
> * conditions as the following snippet shows:
> *
> * if (!list_empty_rcu(mylist)) {
> * struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
> * do_something(bar);
> * }
> *
> * The list may not be empty when list_empty_rcu checks it, but it may be when
> * list_first_entry_rcu rereads the ->next pointer.
> *
> * Rereading the ->next pointer is not a problem for list_empty() and
> * list_first_entry() because they would be protected by a lock that blocks
> * writers.
> *
> * See list_first_or_null_rcu for an alternative.
> */
>
> To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
> or am I missing something?
>
> Thanks,
> //richard
>
> fs/pnode.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 302bf22..883901c 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
> * umount the child only if the child has no
> * other children
> */
> - if (child && list_empty(&child->mnt_mounts)) {
> + if (child && list_first_or_null_rcu(&child->mnt_mounts,
> + struct mount, mnt_mounts)) {
> hlist_del_init_rcu(&child->mnt_hash);
> hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
> }
>

--
sigma star gmbh - Bundesstrasse 3 - 6111 Volders - Austria
ATU66964118 - FN 374287y


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-07-30 18:39:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()

On Wed, Jul 30, 2014 at 03:59:16PM +0200, Richard Weinberger wrote:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
>
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>
> rmdir fails because the last entry in the RCU list, "proc", was
> not propagated as list_empty() still returned false instead of true.
>
> Signed-off-by: Richard Weinberger <[email protected]>
> ---
> Hi!
>
> Please review this patch with care, the comments in rculist.h
> confused me like hell:

So there are some special cases where list_empty() can be applied to
an RCU-protected list. One such case is where only one task is
permitted to remove from the list (or, equivalently, where removal
is protected by a lock). Then if that task sees !list_empty(), it
knows that the list will remain non-empty because no one else is
removing things.

There are other special cases, but this one gives the general flavor.

Thanx, Paul

> First it says:
> /*
> * Why is there no list_empty_rcu()? Because list_empty() serves this
> * purpose. The list_empty() function fetches the RCU-protected pointer
> * and compares it to the address of the list head, but neither dereferences
> * this pointer itself nor provides this pointer to the caller. Therefore,
> * it is not necessary to use rcu_dereference(), so that list_empty() can
> * be used anywhere you would want to use a list_empty_rcu().
> */
>
> And later:
> /**
> * Where are list_empty_rcu() and list_first_entry_rcu()?
> *
> * Implementing those functions following their counterparts list_empty() and
> * list_first_entry() is not advisable because they lead to subtle race
> * conditions as the following snippet shows:
> *
> * if (!list_empty_rcu(mylist)) {
> * struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
> * do_something(bar);
> * }
> *
> * The list may not be empty when list_empty_rcu checks it, but it may be when
> * list_first_entry_rcu rereads the ->next pointer.
> *
> * Rereading the ->next pointer is not a problem for list_empty() and
> * list_first_entry() because they would be protected by a lock that blocks
> * writers.
> *
> * See list_first_or_null_rcu for an alternative.
> */
>
> To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
> or am I missing something?
>
> Thanks,
> //richard
>
> fs/pnode.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 302bf22..883901c 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
> * umount the child only if the child has no
> * other children
> */
> - if (child && list_empty(&child->mnt_mounts)) {
> + if (child && list_first_or_null_rcu(&child->mnt_mounts,
> + struct mount, mnt_mounts)) {
> hlist_del_init_rcu(&child->mnt_hash);
> hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
> }
> --
> 1.8.4.5
>

2014-07-30 20:46:37

by Richard Weinberger

[permalink] [raw]
Subject: MNT_DETACH and mount namespace issue (was: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount())

Am 30.07.2014 15:59, schrieb Richard Weinberger:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
>
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy

Although my fix was wrong, the issue is real, it seems to exist for a very long
time. Just was able to reproduce it on 2.6.32.
Please note that you need a shared root subtree to trigger the issue.
i.e. mount --shared /
Maybe this is why nobody noticed it so far as only systemd distros
have the root subtree shared by default.

I hit the issue on openSUSE 13.1 where an application creates a chroot environment
and then lazy umounts /proc.
It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
were affected. This did not make any sense until I discovered that the OpenVPN systemd
service file has set "PrivateTmp=true". This setting creates
a mount namespace for the said service...

In __propagate_umount() the following piece of code is interesting:

/*
* umount the child only if the child has no
* other children
*/
if (child && list_empty(&child->mnt_mounts)) {
hlist_del_init_rcu(&child->mnt_hash);
hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
}

child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
subtree was removed.
I'm not sure whether this is only one more symptom or the main culprit.

Any ideas?

Thanks,
//richard

2014-07-31 22:17:20

by Richard Weinberger

[permalink] [raw]
Subject: Re: MNT_DETACH and mount namespace issue

Am 30.07.2014 22:46, schrieb Richard Weinberger:
> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>> If we use the plain list_empty() we might not see the
>> hlist_del_init_rcu() and therefore miss one member of the
>> list.
>>
>> It fixes the following issue:
>> $ unshare -m /usr/bin/sleep 10000 &
>> $ mkdir -p foo/proc
>> $ mount -t proc none foo/proc
>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>> $ umount -l foo/proc
>> $ rmdir foo/proc
>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>
> Although my fix was wrong, the issue is real, it seems to exist for a very long
> time. Just was able to reproduce it on 2.6.32.
> Please note that you need a shared root subtree to trigger the issue.
> i.e. mount --shared /
> Maybe this is why nobody noticed it so far as only systemd distros
> have the root subtree shared by default.
>
> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
> and then lazy umounts /proc.
> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
> were affected. This did not make any sense until I discovered that the OpenVPN systemd
> service file has set "PrivateTmp=true". This setting creates
> a mount namespace for the said service...
>
> In __propagate_umount() the following piece of code is interesting:
>
> /*
> * umount the child only if the child has no
> * other children
> */
> if (child && list_empty(&child->mnt_mounts)) {
> hlist_del_init_rcu(&child->mnt_hash);
> hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
> }
>
> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
> subtree was removed.
> I'm not sure whether this is only one more symptom or the main culprit.

CC'ing Ram Pai.

Ram, you are the author of the said code. Can you please explain why we need that
list_empty() check?
To my (limited) understanding of VFS, the following change should be fine to fix the issue:

diff --git a/fs/pnode.c b/fs/pnode.c
index 302bf22..627b35f 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -376,11 +376,7 @@ static void __propagate_umount(struct mount *mnt)

struct mount *child = __lookup_mnt_last(&m->mnt,
mnt->mnt_mountpoint);
- /*
- * umount the child only if the child has no
- * other children
- */
- if (child && list_empty(&child->mnt_mounts)) {
+ if (child) {
hlist_del_init_rcu(&child->mnt_hash);
hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
}

Thanks,
//richard