2014-08-29 16:27:54

by Trond Myklebust

[permalink] [raw]
Subject: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

Hi Al,

Our internal testing shows that commit aba809cf0944 (namespace.c: get
rid of mnt_ghosts) appears to break the NFS automounter functionality.
When unmounting, the user now gets a permanent -EBUSY, presumably due
to a refcounting issue with the patch.

The reproducer is as follows:

On the server, set up an export of the form:

% showmount -e 192.168.214.128
Export list for 192.168.214.128:
/export *

On the client:

[fedora19@~]$sudo mount 192.168.214.128:/ /mnt/
[fedora19@~]$df
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 38G 15G 24G 38% /
devtmpfs 229M 0 229M 0% /dev
tmpfs 242M 80K 242M 1% /dev/shm
tmpfs 242M 1.5M 241M 1% /run
tmpfs 242M 0 242M 0% /sys/fs/cgroup
tmpfs 242M 4.0K 242M 1% /tmp
/dev/sda1 1.5G 248M 1.3G 17% /boot
192.168.214.128:/ 28G 25G 1.1G 96% /mnt
[fedora19@~]$ls /mnt/export/
[fedora19@~]$df
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 38G 15G 24G 38% /
devtmpfs 229M 0 229M 0% /dev
tmpfs 242M 80K 242M 1% /dev/shm
tmpfs 242M 1.5M 241M 1% /run
tmpfs 242M 0 242M 0% /sys/fs/cgroup
tmpfs 242M 4.0K 242M 1% /tmp
/dev/sda1 1.5G 248M 1.3G 17% /boot
192.168.214.128:/ 28G 25G 1.1G 96% /mnt
192.168.214.128:/export 28G 25G 1.1G 96% /mnt/export
[fedora19@~]$sudo umount /mnt
umount.nfs4: /mnt: device is busy
[fedora19@~]$df
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 38G 15G 24G 38% /
devtmpfs 229M 0 229M 0% /dev
tmpfs 242M 80K 242M 1% /dev/shm
tmpfs 242M 1.5M 241M 1% /run
tmpfs 242M 0 242M 0% /sys/fs/cgroup
tmpfs 242M 4.0K 242M 1% /tmp
/dev/sda1 1.5G 248M 1.3G 17% /boot
192.168.214.128:/ 28G 25G 1.1G 96% /mnt

Reverting commit aba809cf0944 fixes the problem for us on a 3.14.15
client kernel.
(Thanks to Tao Peng for providing the above reproducer and for bisecting.)

Cheers
Trond
--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]


2014-08-29 20:58:19

by Al Viro

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Fri, Aug 29, 2014 at 12:27:53PM -0400, Trond Myklebust wrote:
> Hi Al,
>
> Our internal testing shows that commit aba809cf0944 (namespace.c: get
> rid of mnt_ghosts) appears to break the NFS automounter functionality.
> When unmounting, the user now gets a permanent -EBUSY, presumably due
> to a refcounting issue with the patch.
>
> The reproducer is as follows:
>
> On the server, set up an export of the form:
>
> % showmount -e 192.168.214.128
> Export list for 192.168.214.128:
> /export *
>
> On the client:
>
> [fedora19@~]$sudo mount 192.168.214.128:/ /mnt/
> [fedora19@~]$df
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/fedora-root 38G 15G 24G 38% /
> devtmpfs 229M 0 229M 0% /dev
> tmpfs 242M 80K 242M 1% /dev/shm
> tmpfs 242M 1.5M 241M 1% /run
> tmpfs 242M 0 242M 0% /sys/fs/cgroup
> tmpfs 242M 4.0K 242M 1% /tmp
> /dev/sda1 1.5G 248M 1.3G 17% /boot
> 192.168.214.128:/ 28G 25G 1.1G 96% /mnt
> [fedora19@~]$ls /mnt/export/
> [fedora19@~]$df
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/fedora-root 38G 15G 24G 38% /
> devtmpfs 229M 0 229M 0% /dev
> tmpfs 242M 80K 242M 1% /dev/shm
> tmpfs 242M 1.5M 241M 1% /run
> tmpfs 242M 0 242M 0% /sys/fs/cgroup
> tmpfs 242M 4.0K 242M 1% /tmp
> /dev/sda1 1.5G 248M 1.3G 17% /boot
> 192.168.214.128:/ 28G 25G 1.1G 96% /mnt
> 192.168.214.128:/export 28G 25G 1.1G 96% /mnt/export
> [fedora19@~]$sudo umount /mnt
> umount.nfs4: /mnt: device is busy
> [fedora19@~]$df
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/fedora-root 38G 15G 24G 38% /
> devtmpfs 229M 0 229M 0% /dev
> tmpfs 242M 80K 242M 1% /dev/shm
> tmpfs 242M 1.5M 241M 1% /run
> tmpfs 242M 0 242M 0% /sys/fs/cgroup
> tmpfs 242M 4.0K 242M 1% /tmp
> /dev/sda1 1.5G 248M 1.3G 17% /boot
> 192.168.214.128:/ 28G 25G 1.1G 96% /mnt
>
> Reverting commit aba809cf0944 fixes the problem for us on a 3.14.15
> client kernel.

Can't reproduce on a debian client with 3.12 kernel (which contains the
commit in question); it'll take a bit to get Fedora setup - I have it as KVM
image, but it's sandboxed to hell and back and certainly not allowed to play
with NFS traffic...

In the meanwhile, if you have a reproducer setup... Could you verify that
fuser -m /mnt does come empty? Another thing: does mount --make-rprivate /
done before that umount /mnt change behaviour? IOW, I wonder if it could
be affected by shared-subtree setup; that might be the relevant difference
between the working box and F19 one in this case. Could you at least post
the contents of /proc/*/mountinfo from all namespaces? Output of
grep . `md5sum /proc/*/mountinfo|sort|uniq -w32|awk '{print $2}'`
would do...

2014-08-31 00:32:09

by Peng Tao

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Sun, Aug 31, 2014 at 7:27 AM, Al Viro <[email protected]> wrote:
> Could you check if vfs.git#for-linus fixes that crap?
The related commits are the top two of vfs.git#for-linus, right? I
cherry-picked them to 3.14.15 and they fixed the bug instantly. Thank
you for the quick fix!

Cheers,
Tao

2014-08-30 17:36:40

by Al Viro

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Sat, Aug 30, 2014 at 08:38:56AM +0800, Peng Tao wrote:
> On Sat, Aug 30, 2014 at 7:01 AM, Al Viro <[email protected]> wrote:
> > On Fri, Aug 29, 2014 at 05:47:58PM -0400, Trond Myklebust wrote:
> >
> >> Note that the issue happens with NFSv4 or NFSv4.1 (not NFSv3).
> >
> > That's what I'd been using for testing.
> >
> >> Note that on my system, if I call 'umount' a second time after getting
> >> the 'device is busy' error, then it succeeds. It looks as if the first
> >> call to 'umount /mnt' causes the directory /mnt/export to clear,
> >> causing the second 'umount /mnt' to succeeed (unless I try to access
> >> /mnt/export again):
> Just to be clear, it is what I saw as well. I did not get permanent
> -EBUSY and I could umount /mnt if I run `umount /mnt` twice.
>
> And I did not do git bisect. I had doubts in the code so I found the
> specific commit to revert it and it worked. But if it is some other
> commit in the middle, I could have missed it.
>
> My theory to blame aba809cf0944 is that in
> do_umount()->shrink_submounts()->umount_tree(), it transfers parent
> mnt refcount (held by child mnt) to mnt_ex_mountpoint but that is put
> _after_ do_umount() calls propagate_mount_busy(mnt, 2). So
> propagate_mount_busy() would fail do_refcount_check(). Does it make
> sense to you?

Of course; the question is what to do with that. The root of the problem
is in ordering constraints on fs shutdown and dput() of mountpoints. What
we have right now solves that by holding "ghost" reference to parent vfsmount
until namespace_unlock(), where we drop them after having done that dput().

The constrants are
* dput() of mountpoint should happen only after we'd dropped namespace
lock - we really don't want any IO under it.
* it should happen before fs containing that mountpoint gets shut down.
* non-lazy umount(2) should not return until all filesystems it
will shut down are, indeed, shut down

FWIW, I suspect that the simplest solution is this:
* decrement parent's refcount when detaching a child from it in
umount_tree()
* in namespace_unlock(), _before_ dropping the rwsem, go through the
list of victims and bump refcounts on their parents. Note that those could
not have been shut down, let alone freed yet - either we are the ones to
take the parent out, in which case its refcount won't get to zero until
namespace_unlock() regardless of the children, or we are not. In which case
its ->mnt_ns is non-NULL and will stay that way until rwsem is dropped (and
its refcount is also guaranteed to stay positive, not that mntput() would
bother checking that with non-NULL ->mnt_ns).

It does give a window when parent looks busy - from dropping rwsem in
namespace_unlock() to the moment when we get to the (ex-)child in the
loop there. So what? It doesn't affect the busy checks for that syscall -
we'd already done them all. It doesn't affect the busy checks for anything
we do after returning from that syscall - namespace_unlock() completes before
we return to userland and all is good. And if another process calls umount()
before ours return... Well, it gets its EBUSY - same as it would if it got
there just as the first one has finished finding the mountpoint of child.
Two processes playing with umount(2) on the same area in the tree without
serialization should be ready to see the damn things transiently busy, simply
from another process looking at its victims...

IOW, how about this (completely untested)? That's on top of that commit
from -next, but those two should be easy to backport to older trees
(->mnt_mp_list won't be there for those, but that's about it).

diff --git a/fs/namespace.c b/fs/namespace.c
index 3733cfb..00e5b1e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1299,6 +1299,11 @@ static void namespace_unlock(void)
head.first->pprev = &head.first;
INIT_HLIST_HEAD(&unmounted);

+ /* undo decrements we'd done in umount_tree() */
+ hlist_for_each_entry(mnt, &head, mnt_hash)
+ if (mnt->mnt_ex_mountpoint.mnt)
+ mntget(mnt->mnt_ex_mountpoint.mnt);
+
up_write(&namespace_sem);

synchronize_rcu();
@@ -1351,6 +1356,7 @@ void umount_tree(struct mount *mnt, int how)
if (mnt_has_parent(p)) {
hlist_del_init(&p->mnt_mp_list);
put_mountpoint(p->mnt_mp);
+ mnt_add_count(p->mnt_parent, -1);
/* move the reference to mountpoint into ->mnt_ex_mountpoint */
p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;

2014-08-29 21:48:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Fri, Aug 29, 2014 at 4:58 PM, Al Viro <[email protected]> wrote:
> On Fri, Aug 29, 2014 at 12:27:53PM -0400, Trond Myklebust wrote:
>> Hi Al,
>>
>> Our internal testing shows that commit aba809cf0944 (namespace.c: get
>> rid of mnt_ghosts) appears to break the NFS automounter functionality.
>> When unmounting, the user now gets a permanent -EBUSY, presumably due
>> to a refcounting issue with the patch.
>>
>> The reproducer is as follows:
>>
>> On the server, set up an export of the form:
>>
>> % showmount -e 192.168.214.128
>> Export list for 192.168.214.128:
>> /export *
>>
>> On the client:
>>
>> [fedora19@~]$sudo mount 192.168.214.128:/ /mnt/
>> [fedora19@~]$df
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/fedora-root 38G 15G 24G 38% /
>> devtmpfs 229M 0 229M 0% /dev
>> tmpfs 242M 80K 242M 1% /dev/shm
>> tmpfs 242M 1.5M 241M 1% /run
>> tmpfs 242M 0 242M 0% /sys/fs/cgroup
>> tmpfs 242M 4.0K 242M 1% /tmp
>> /dev/sda1 1.5G 248M 1.3G 17% /boot
>> 192.168.214.128:/ 28G 25G 1.1G 96% /mnt
>> [fedora19@~]$ls /mnt/export/
>> [fedora19@~]$df
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/fedora-root 38G 15G 24G 38% /
>> devtmpfs 229M 0 229M 0% /dev
>> tmpfs 242M 80K 242M 1% /dev/shm
>> tmpfs 242M 1.5M 241M 1% /run
>> tmpfs 242M 0 242M 0% /sys/fs/cgroup
>> tmpfs 242M 4.0K 242M 1% /tmp
>> /dev/sda1 1.5G 248M 1.3G 17% /boot
>> 192.168.214.128:/ 28G 25G 1.1G 96% /mnt
>> 192.168.214.128:/export 28G 25G 1.1G 96% /mnt/export
>> [fedora19@~]$sudo umount /mnt
>> umount.nfs4: /mnt: device is busy
>> [fedora19@~]$df
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/fedora-root 38G 15G 24G 38% /
>> devtmpfs 229M 0 229M 0% /dev
>> tmpfs 242M 80K 242M 1% /dev/shm
>> tmpfs 242M 1.5M 241M 1% /run
>> tmpfs 242M 0 242M 0% /sys/fs/cgroup
>> tmpfs 242M 4.0K 242M 1% /tmp
>> /dev/sda1 1.5G 248M 1.3G 17% /boot
>> 192.168.214.128:/ 28G 25G 1.1G 96% /mnt
>>
>> Reverting commit aba809cf0944 fixes the problem for us on a 3.14.15
>> client kernel.
>
> Can't reproduce on a debian client with 3.12 kernel (which contains the
> commit in question); it'll take a bit to get Fedora setup - I have it as KVM
> image, but it's sandboxed to hell and back and certainly not allowed to play
> with NFS traffic...
>
> In the meanwhile, if you have a reproducer setup... Could you verify that
> fuser -m /mnt does come empty? Another thing: does mount --make-rprivate /
> done before that umount /mnt change behaviour? IOW, I wonder if it could
> be affected by shared-subtree setup; that might be the relevant difference
> between the working box and F19 one in this case. Could you at least post
> the contents of /proc/*/mountinfo from all namespaces? Output of
> grep . `md5sum /proc/*/mountinfo|sort|uniq -w32|awk '{print $2}'`
> would do...

Note that the issue happens with NFSv4 or NFSv4.1 (not NFSv3). The
following was done on a Fedora 20 kernel (3.15.10) but the results are
more or less the same:

[root@vmw-test-1 ~]# showmount -e vmw-test-2
Export list for vmw-test-2:
/export 172.16.74.0/24

[root@vmw-test-1 ~]# mount -t nfs -overs=4 vmw-test-2:/ /mnt
[root@vmw-test-1 ~]# ls /mnt/export/
trondmy
[root@vmw-test-1 ~]# fuser -m /mnt
[root@vmw-test-1 ~]# umount /mnt
umount.nfs4: /mnt: device is busy
[root@vmw-test-1 ~]# fuser -m /mnt
[root@vmw-test-1 ~]# umount /mnt

I've attached the mountinfo output. It was created as follows:

[root@vmw-test-1 ~]# ls /mnt/export/
trondmy
[root@vmw-test-1 ~]# grep . `md5sum /proc/*/mountinfo|sort|uniq
-w32|awk '{print $2}'`>/tmp/mountdigest.txt
[root@vmw-test-1 ~]# umount /mnt
umount.nfs4: /mnt: device is busy

(so the mount was still busy).

Note that on my system, if I call 'umount' a second time after getting
the 'device is busy' error, then it succeeds. It looks as if the first
call to 'umount /mnt' causes the directory /mnt/export to clear,
causing the second 'umount /mnt' to succeeed (unless I try to access
/mnt/export again):

[root@vmw-test-1 ~]# mount -t nfs -overs=4 vmw-test-2:/ /mnt
[root@vmw-test-1 ~]# ls /mnt/export/
trondmy
[root@vmw-test-1 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/fedora-root 35740620 18968904 16771716 54% /
devtmpfs 1947224 0 1947224 0% /dev
tmpfs 1955120 156 1954964 1% /dev/shm
tmpfs 1955120 1004 1954116 1% /run
tmpfs 1955120 0 1955120 0% /sys/fs/cgroup
tmpfs 1955120 1888 1953232 1% /tmp
/dev/sda1 1998672 88396 1789036 5% /boot
vmw-test-2:/ 35789824 8589824 27200000 25% /mnt
vmw-test-2:/export 35789824 8589824 27200000 25% /mnt/export
[root@vmw-test-1 ~]# umount /mnt
umount.nfs4: /mnt: device is busy
[root@vmw-test-1 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/fedora-root 35740620 18968904 16771716 54% /
devtmpfs 1947224 0 1947224 0% /dev
tmpfs 1955120 156 1954964 1% /dev/shm
tmpfs 1955120 1004 1954116 1% /run
tmpfs 1955120 0 1955120 0% /sys/fs/cgroup
tmpfs 1955120 1892 1953228 1% /tmp
/dev/sda1 1998672 88396 1789036 5% /boot
vmw-test-2:/ 35789824 8590336 27199488 25% /mnt
[root@vmw-test-1 ~]# umount /mnt


--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]


Attachments:
mountdigest.txt (28.84 kB)

2014-08-30 23:27:56

by Al Viro

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

Could you check if vfs.git#for-linus fixes that crap?

2014-08-29 23:01:50

by Al Viro

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Fri, Aug 29, 2014 at 05:47:58PM -0400, Trond Myklebust wrote:

> Note that the issue happens with NFSv4 or NFSv4.1 (not NFSv3).

That's what I'd been using for testing.

> Note that on my system, if I call 'umount' a second time after getting
> the 'device is busy' error, then it succeeds. It looks as if the first
> call to 'umount /mnt' causes the directory /mnt/export to clear,
> causing the second 'umount /mnt' to succeeed (unless I try to access
> /mnt/export again):

Now, that smells like a different bug - see if commit ab8f2c from -next
helps with that one.

Anyway, I think I see what's going on; moreover, I have a kinda-sorta
solution for the current tree, but it's not something that'll be fun
to backport and I'd prefer to look for alternative solutions before going
for that one. Hopefully I'll have something postable by tomorrow morning...

2014-08-31 04:09:13

by Al Viro

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Sun, Aug 31, 2014 at 08:31:47AM +0800, Peng Tao wrote:
> On Sun, Aug 31, 2014 at 7:27 AM, Al Viro <[email protected]> wrote:
> > Could you check if vfs.git#for-linus fixes that crap?
> The related commits are the top two of vfs.git#for-linus, right? I
> cherry-picked them to 3.14.15 and they fixed the bug instantly. Thank
> you for the quick fix!

Folks, could somebody put together an xfstests (or LTP, for that matter)
patch adding that testcase? Which regression suite is normally used
for NFS work, BTW? It's not, strictly speaking, an NFS-specific bug
(you can trigger the same thing on AFS and CIFS as well), but NFS is
by far the most common of those three, so...

2014-08-30 00:39:17

by Peng Tao

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Sat, Aug 30, 2014 at 7:01 AM, Al Viro <[email protected]> wrote:
> On Fri, Aug 29, 2014 at 05:47:58PM -0400, Trond Myklebust wrote:
>
>> Note that the issue happens with NFSv4 or NFSv4.1 (not NFSv3).
>
> That's what I'd been using for testing.
>
>> Note that on my system, if I call 'umount' a second time after getting
>> the 'device is busy' error, then it succeeds. It looks as if the first
>> call to 'umount /mnt' causes the directory /mnt/export to clear,
>> causing the second 'umount /mnt' to succeeed (unless I try to access
>> /mnt/export again):
Just to be clear, it is what I saw as well. I did not get permanent
-EBUSY and I could umount /mnt if I run `umount /mnt` twice.

And I did not do git bisect. I had doubts in the code so I found the
specific commit to revert it and it worked. But if it is some other
commit in the middle, I could have missed it.

My theory to blame aba809cf0944 is that in
do_umount()->shrink_submounts()->umount_tree(), it transfers parent
mnt refcount (held by child mnt) to mnt_ex_mountpoint but that is put
_after_ do_umount() calls propagate_mount_busy(mnt, 2). So
propagate_mount_busy() would fail do_refcount_check(). Does it make
sense to you?

>
> Now, that smells like a different bug - see if commit ab8f2c from -next
> helps with that one.
>
I cherri-picked it and it still failed with -EBUSY.

Cheers,
Tao

2014-09-01 23:52:29

by Dave Chinner

[permalink] [raw]
Subject: Re: VFS regression: commit aba809cf0944 breaks MNT_SHRINKABLE automounted partitions

On Sun, Aug 31, 2014 at 05:09:10AM +0100, Al Viro wrote:
> On Sun, Aug 31, 2014 at 08:31:47AM +0800, Peng Tao wrote:
> > On Sun, Aug 31, 2014 at 7:27 AM, Al Viro <[email protected]> wrote:
> > > Could you check if vfs.git#for-linus fixes that crap?
> > The related commits are the top two of vfs.git#for-linus, right? I
> > cherry-picked them to 3.14.15 and they fixed the bug instantly. Thank
> > you for the quick fix!
>
> Folks, could somebody put together an xfstests (or LTP, for that matter)
> patch adding that testcase? Which regression suite is normally used
> for NFS work, BTW? It's not, strictly speaking, an NFS-specific bug
> (you can trigger the same thing on AFS and CIFS as well), but NFS is
> by far the most common of those three, so...

You can use xfstests to test NFS from the client side, and there are
patches pending to add CIFS client support to xfstests as well.

Cheers,

Dave.
--
Dave Chinner
[email protected]