2012-11-29 19:17:01

by Patrick McLean

[permalink] [raw]
Subject: Regression with initramfs and nfsroot (appears to be in the dcache)

With 3.6-rc1 and up, when using a (dracut) initramfs with a read-only
nfs root, all accesses to /proc. /sys and /dev return EBUSY.

Bisecting finds this commit as where this was introduced:

> ee3efa91e240f513898050ef305a49a653c8ed90 is the first bad commit
> commit ee3efa91e240f513898050ef305a49a653c8ed90
> Author: Al Viro <[email protected]>
> Date: Fri Jun 8 15:59:33 2012 -0400
>
> __d_unalias() should refuse to move mountpoints
>
> Signed-off-by: Al Viro <[email protected]>
>
> :040000 040000 1d6ecde959d3f8252b33f4adff3c4bf1e67f2b92 992ec34563b90fb349957418f76d4673c1af4ab6 M fs


2012-11-29 21:33:18

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 11:16:59AM -0800, Patrick McLean wrote:
> With 3.6-rc1 and up, when using a (dracut) initramfs with a read-only
> nfs root, all accesses to /proc. /sys and /dev return EBUSY.

See "[PATCH] Revert "__d_unalias() should refuse to move mountpoints"
thread. If you have a convenient reproducer, could you check if
the fixes the breakage? If so, we'll need to look into false negatives
from nfs_same_file() in there...

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ce8cb92..55436f5 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -450,7 +450,10 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
nfs_refresh_inode(dentry->d_inode, entry->fattr);
goto out;
} else {
- d_drop(dentry);
+ if (d_invalidate(dentry) != 0) {
+ WARN_ON(1);
+ goto out;
+ }
dput(dentry);
}
}

2012-11-29 22:06:25

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 1:33 PM, Al Viro <[email protected]> wrote:
> On Thu, Nov 29, 2012 at 11:16:59AM -0800, Patrick McLean wrote:
>> With 3.6-rc1 and up, when using a (dracut) initramfs with a read-only
>> nfs root, all accesses to /proc. /sys and /dev return EBUSY.
>
> See "[PATCH] Revert "__d_unalias() should refuse to move mountpoints"
> thread. If you have a convenient reproducer, could you check if
> the fixes the breakage? If so, we'll need to look into false negatives
> from nfs_same_file() in there...
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index ce8cb92..55436f5 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,10 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
> nfs_refresh_inode(dentry->d_inode, entry->fattr);
> goto out;
> } else {
> - d_drop(dentry);
> + if (d_invalidate(dentry) != 0) {
> + WARN_ON(1);
> + goto out;
> + }
> dput(dentry);
> }
> }

I have a trivial reproducer and am happy to help debug in any way that
I can. That patch seems to fix the problem, and produces these
warnings in dmesg:

[ 3.306483] dracut: Switching root
[ 4.324378] systemd-udevd[552]: starting version 195
[ 9.254972] ------------[ cut here ]------------
[ 9.254981] WARNING: at fs/nfs/dir.c:454
nfs_readdir_page_filler+0x1cc/0x3a2()
[ 9.254983] Hardware name: Bochs
[ 9.254984] Modules linked in:
[ 9.254989] Pid: 676, comm: ls Not tainted 3.7.0-rc7+ #35
[ 9.254990] Call Trace:
[ 9.254999] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.255002] [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
[ 9.255005] [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.255009] [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.255014] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.255017] [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.255020] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.255025] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255028] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.255031] [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
[ 9.255036] [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.255039] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255042] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255045] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.255049] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.255053] [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
[ 9.255055] ---[ end trace 5e8b5f37fe752ab1 ]---
[ 9.255062] ------------[ cut here ]------------
[ 9.255065] WARNING: at fs/nfs/dir.c:454
nfs_readdir_page_filler+0x1cc/0x3a2()
[ 9.255066] Hardware name: Bochs
[ 9.255067] Modules linked in:
[ 9.255070] Pid: 676, comm: ls Tainted: G W 3.7.0-rc7+ #35
[ 9.255071] Call Trace:
[ 9.255075] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.255077] [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
[ 9.255080] [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.255083] [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.255087] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.255089] [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.255093] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.255096] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255099] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.255102] [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
[ 9.255105] [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.255109] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255112] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255115] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.255118] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.255121] [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
[ 9.255122] ---[ end trace 5e8b5f37fe752ab2 ]---
[ 9.255133] ------------[ cut here ]------------
[ 9.255135] WARNING: at fs/nfs/dir.c:454
nfs_readdir_page_filler+0x1cc/0x3a2()
[ 9.255136] Hardware name: Bochs
[ 9.255137] Modules linked in:
[ 9.255140] Pid: 676, comm: ls Tainted: G W 3.7.0-rc7+ #35
[ 9.255141] Call Trace:
[ 9.255144] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.255147] [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
[ 9.255150] [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.255153] [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.255157] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.255159] [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.255162] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.255166] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255169] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.255171] [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
[ 9.255175] [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.255178] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255181] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.255184] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.255188] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.255190] [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
[ 9.255192] ---[ end trace 5e8b5f37fe752ab3 ]---

2012-11-29 22:21:12

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:

> I have a trivial reproducer and am happy to help debug in any way that
> I can. That patch seems to fix the problem, and produces these
> warnings in dmesg:
>
> [ 3.306483] dracut: Switching root
> [ 4.324378] systemd-udevd[552]: starting version 195
> [ 9.254972] ------------[ cut here ]------------
> [ 9.254981] WARNING: at fs/nfs/dir.c:454
> nfs_readdir_page_filler+0x1cc/0x3a2()
> [ 9.254983] Hardware name: Bochs
> [ 9.254984] Modules linked in:
> [ 9.254989] Pid: 676, comm: ls Not tainted 3.7.0-rc7+ #35
> [ 9.254990] Call Trace:
> [ 9.254999] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
> [ 9.255002] [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
> [ 9.255005] [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
> [ 9.255009] [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
> [ 9.255014] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
> [ 9.255017] [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
> [ 9.255020] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
> [ 9.255025] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
> [ 9.255028] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
> [ 9.255031] [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
> [ 9.255036] [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
> [ 9.255039] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
> [ 9.255042] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
> [ 9.255045] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
> [ 9.255049] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
> [ 9.255053] [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
> [ 9.255055] ---[ end trace 5e8b5f37fe752ab1 ]---

OK... So we have differing entry->fh and NFS_FH(dentry->d_inode). Something
like
static void dump_fh(const struct nfs_fh *fh)
{
int i;
printk(KERN_INFO "FH(%d)", fh->size);
for (i = 0; i < fh->size; i++)
printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
printk(KERN_CONT "]\n");
}
with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
that WARN_ON(1) would probably be interesting. And probably would make
sense to print filename->name as well, to see which files it is about.

2012-11-29 22:53:58

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 02:21 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
>
>> I have a trivial reproducer and am happy to help debug in any way that
>> I can. That patch seems to fix the problem, and produces these
>> warnings in dmesg:
>
> OK... So we have differing entry->fh and NFS_FH(dentry->d_inode). Something
> like
> static void dump_fh(const struct nfs_fh *fh)
> {
> int i;
> printk(KERN_INFO "FH(%d)", fh->size);
> for (i = 0; i < fh->size; i++)
> printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
> printk(KERN_CONT "]\n");
> }
> with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
> that WARN_ON(1) would probably be interesting. And probably would make
> sense to print filename->name as well, to see which files it is about.
>

Here is the output of the first of the 3 times it hits the WARN_ON (I can include all 3 if desired), with the filename.name at the end:

[ 8.821503] ------------[ cut here ]------------
[ 8.821512] WARNING: at fs/nfs/dir.c:463 nfs_readdir_page_filler+0x1d0/0x3d2()
[ 8.821513] Hardware name: Bochs
[ 8.821515] Modules linked in:
[ 8.821519] Pid: 630, comm: bash Not tainted 3.7.0-rc7+ #36
[ 8.821520] Call Trace:
[ 8.821528] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 8.821531] [<ffffffff8117de95>] ? nfs_readdir_page_filler+0x1d0/0x3d2
[ 8.821535] [<ffffffff8117e6b3>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 8.821538] [<ffffffff8117e73c>] ? nfs_readdir_filler+0x1c/0x6b
[ 8.821543] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 8.821546] [<ffffffff8117e720>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 8.821549] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 8.821554] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 8.821557] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 8.821560] [<ffffffff8117e8b8>] ? nfs_readdir+0x12d/0x435
[ 8.821564] [<ffffffff8118e683>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 8.821568] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 8.821571] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 8.821574] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 8.821577] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 8.821581] [<ffffffff814ac7e9>] ? system_call_fastpath+0x16/0x1b
[ 8.821583] ---[ end trace 89263124889205c1 ]---
[ 8.821584] FH(0)]
[ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
[ 8.821601] filename: proc

2012-11-29 23:43:32

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 02:53:13PM -0800, Patrick McLean wrote:
> On 29/11/12 02:21 PM, Al Viro wrote:
> > On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
> >
> >> I have a trivial reproducer and am happy to help debug in any way that
> >> I can. That patch seems to fix the problem, and produces these
> >> warnings in dmesg:
> >
> > OK... So we have differing entry->fh and NFS_FH(dentry->d_inode). Something
> > like
> > static void dump_fh(const struct nfs_fh *fh)
> > {
> > int i;
> > printk(KERN_INFO "FH(%d)", fh->size);
> > for (i = 0; i < fh->size; i++)
> > printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
> > printk(KERN_CONT "]\n");
> > }
> > with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
> > that WARN_ON(1) would probably be interesting. And probably would make
> > sense to print filename->name as well, to see which files it is about.

> [ 8.821584] FH(0)]
> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
> [ 8.821601] filename: proc

*whoa*

So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
version it is? entry->fh->size is set by nfs[34]_decode_dirent().

NFS folks: any ideas on best way to debug it? The brute-force way would be
to capture all NFS traffic with tcpdump and see what's going on, but that
would be a lot of work...

Looks like we have READDIRPLUS attempted and succeeded, but fhandle was not
given. Result: nfs_prime_dcache() is doing blind d_drop() on perfectly
valid dentries, no matter how busy.

2012-11-30 00:19:56

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 03:43 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 02:53:13PM -0800, Patrick McLean wrote:
>> On 29/11/12 02:21 PM, Al Viro wrote:
>>> On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
>>>
>>>> I have a trivial reproducer and am happy to help debug in any way that
>>>> I can. That patch seems to fix the problem, and produces these
>>>> warnings in dmesg:
>>>
>>> OK... So we have differing entry->fh and NFS_FH(dentry->d_inode). Something
>>> like
>>> static void dump_fh(const struct nfs_fh *fh)
>>> {
>>> int i;
>>> printk(KERN_INFO "FH(%d)", fh->size);
>>> for (i = 0; i < fh->size; i++)
>>> printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
>>> printk(KERN_CONT "]\n");
>>> }
>>> with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
>>> that WARN_ON(1) would probably be interesting. And probably would make
>>> sense to print filename->name as well, to see which files it is about.
>
>> [ 8.821584] FH(0)]
>> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
>> [ 8.821601] filename: proc
>
> *whoa*
>
> So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
> version it is? entry->fh->size is set by nfs[34]_decode_dirent().

This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.

> NFS folks: any ideas on best way to debug it? The brute-force way would be
> to capture all NFS traffic with tcpdump and see what's going on, but that
> would be a lot of work...
>
> Looks like we have READDIRPLUS attempted and succeeded, but fhandle was not
> given. Result: nfs_prime_dcache() is doing blind d_drop() on perfectly
> valid dentries, no matter how busy.
>

2012-11-30 00:35:06

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 04:19:51PM -0800, Patrick McLean wrote:
> >> [ 8.821584] FH(0)]
> >> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
> >> [ 8.821601] filename: proc
> >
> > *whoa*
> >
> > So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
> > version it is? entry->fh->size is set by nfs[34]_decode_dirent().
>
> This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.

So we have nfs3_decode_dirent(), stepping into
/* In fact, a post_op_fh3: */
p = xdr_inline_decode(xdr, 4);
if (unlikely(p == NULL))
goto out_overflow;
if (*p != xdr_zero) {
error = decode_nfs_fh3(xdr, entry->fh);
if (unlikely(error)) {
if (error == -E2BIG)
goto out_truncated;
return error;
}
} else
zero_nfs_fh3(entry->fh);
Interesting... Server-side that should've been produced by
encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
which has explicit
if (d_mountpoint(dchild))
goto out;
resulting in ENOENT on everything that's overmounted on server.

Do you, by any chance, have the server really exporting its own root
filesystem? Another thing to check: have nfs_prime_dcache() print
filename.name of everything that fails nfs_same_entry() and has
zero entry->fh->size, regardless of d_invalidate() results.

2012-11-30 01:04:54

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 04:35 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 04:19:51PM -0800, Patrick McLean wrote:
>>>> [ 8.821584] FH(0)]
>>>> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
>>>> [ 8.821601] filename: proc
>>>
>>> *whoa*
>>>
>>> So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
>>> version it is? entry->fh->size is set by nfs[34]_decode_dirent().
>>
>> This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.
>
> So we have nfs3_decode_dirent(), stepping into
> /* In fact, a post_op_fh3: */
> p = xdr_inline_decode(xdr, 4);
> if (unlikely(p == NULL))
> goto out_overflow;
> if (*p != xdr_zero) {
> error = decode_nfs_fh3(xdr, entry->fh);
> if (unlikely(error)) {
> if (error == -E2BIG)
> goto out_truncated;
> return error;
> }
> } else
> zero_nfs_fh3(entry->fh);
> Interesting... Server-side that should've been produced by
> encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
> which has explicit
> if (d_mountpoint(dchild))
> goto out;
> resulting in ENOENT on everything that's overmounted on server.
>
> Do you, by any chance, have the server really exporting its own root
> filesystem? Another thing to check: have nfs_prime_dcache() print
> filename.name of everything that fails nfs_same_entry() and has
> zero entry->fh->size, regardless of d_invalidate() results.

The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).

The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
"ls /" at in single user mode.

I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
are triggering the WARN_ON, the relevant dmesg is below.

[ 9.495217] entry->fh->size is 0 on: proc
[ 9.495222] ------------[ cut here ]------------
[ 9.495230] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[ 9.495232] Hardware name: Bochs
[ 9.495233] Modules linked in:
[ 9.495237] Pid: 655, comm: ls Not tainted 3.7.0-rc7+ #40
[ 9.495239] Call Trace:
[ 9.495247] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.495250] [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[ 9.495254] [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.495257] [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.495263] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.495266] [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.495269] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.495274] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495277] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.495280] [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[ 9.495285] [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.495288] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495291] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495294] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.495298] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.495302] [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[ 9.495304] ---[ end trace e502c5d56c594e85 ]---
[ 9.495306] FH(0)]
[ 9.495308] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
[ 9.495323] filename: proc
[ 9.495330] entry->fh->size is 0 on: dev
[ 9.495332] ------------[ cut here ]------------
[ 9.495335] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[ 9.495336] Hardware name: Bochs
[ 9.495337] Modules linked in:
[ 9.495340] Pid: 655, comm: ls Tainted: G W 3.7.0-rc7+ #40
[ 9.495341] Call Trace:
[ 9.495345] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.495348] [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[ 9.495351] [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.495354] [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.495358] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.495361] [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.495364] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.495368] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495371] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.495373] [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[ 9.495377] [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.495380] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495383] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495387] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.495390] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.495393] [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[ 9.495395] ---[ end trace e502c5d56c594e86 ]---
[ 9.495396] FH(0)]
[ 9.495397] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 85 00 00 40 d6 39 e0 7d]
[ 9.495412] filename: dev
[ 9.495422] entry->fh->size is 0 on: sys
[ 9.495423] ------------[ cut here ]------------
[ 9.495426] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[ 9.495427] Hardware name: Bochs
[ 9.495428] Modules linked in:
[ 9.495430] Pid: 655, comm: ls Tainted: G W 3.7.0-rc7+ #40
[ 9.495431] Call Trace:
[ 9.495435] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.495438] [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[ 9.495441] [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.495444] [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.495448] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.495450] [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.495454] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.495457] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495460] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.495463] [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[ 9.495466] [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.495470] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495473] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495476] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.495479] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.495482] [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[ 9.495484] ---[ end trace e502c5d56c594e87 ]---
[ 9.495485] FH(0)]
[ 9.495486] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a3 0e 00 40 42 57 d3 90]
[ 9.495511] filename: sys

2012-11-30 01:36:33

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 04:57:19PM -0800, Patrick McLean wrote:
> > Interesting... Server-side that should've been produced by
> > encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
> > which has explicit
> > if (d_mountpoint(dchild))
> > goto out;
> > resulting in ENOENT on everything that's overmounted on server.
> >
> > Do you, by any chance, have the server really exporting its own root
> > filesystem? Another thing to check: have nfs_prime_dcache() print
> > filename.name of everything that fails nfs_same_entry() and has
> > zero entry->fh->size, regardless of d_invalidate() results.
>
> The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).
>
> The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
> "ls /" at in single user mode.
>
> I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
> are triggering the WARN_ON, the relevant dmesg is below.

[the same /dev, /proc and /sys]

Very interesting. Do you have anything mounted on the corresponding
directories on server? The picture looks like you are getting empty
fhandles in readdir+ respons for exactly the same directories that happen
to be mountpoints on client. In any case, we shouldn't do that blind
d_drop() - empty fhandles can happen. The only remaining question is
why do they happen on that set of entries. From my reading of
encode_entryplus_baggage() it looks like we have compose_entry_fh()
failing for those entries and those entries alone. One possible cause
would be d_mountpoint(dchild) being true on server. If it is true, we
can declare the case closed; if not, I really wonder what's going on.

Note that if the same fs is mounted elsewhere, d_mountpoint() would mean
that something is mounted on top of that directory in _some_ instance;
not necessary the exported one. Can you slap printks on fs/nfsd/nfs3xdr.c
compose_entry_fh() failure exits and see which one triggers server-side?

2012-11-30 01:54:09

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 05:36 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 04:57:19PM -0800, Patrick McLean wrote:
>>> Interesting... Server-side that should've been produced by
>>> encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
>>> which has explicit
>>> if (d_mountpoint(dchild))
>>> goto out;
>>> resulting in ENOENT on everything that's overmounted on server.
>>>
>>> Do you, by any chance, have the server really exporting its own root
>>> filesystem? Another thing to check: have nfs_prime_dcache() print
>>> filename.name of everything that fails nfs_same_entry() and has
>>> zero entry->fh->size, regardless of d_invalidate() results.
>>
>> The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).
>>
>> The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
>> "ls /" at in single user mode.
>>
>> I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
>> are triggering the WARN_ON, the relevant dmesg is below.
>
> [the same /dev, /proc and /sys]
>
> Very interesting. Do you have anything mounted on the corresponding
> directories on server? The picture looks like you are getting empty
> fhandles in readdir+ respons for exactly the same directories that happen
> to be mountpoints on client. In any case, we shouldn't do that blind
> d_drop() - empty fhandles can happen. The only remaining question is
> why do they happen on that set of entries. From my reading of
> encode_entryplus_baggage() it looks like we have compose_entry_fh()
> failing for those entries and those entries alone. One possible cause
> would be d_mountpoint(dchild) being true on server. If it is true, we
> can declare the case closed; if not, I really wonder what's going on.

Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.

Unmounting those directories on the server does appear to stop the WARN_ON from triggering.

> Note that if the same fs is mounted elsewhere, d_mountpoint() would mean
> that something is mounted on top of that directory in _some_ instance;
> not necessary the exported one. Can you slap printks on fs/nfsd/nfs3xdr.c
> compose_entry_fh() failure exits and see which one triggers server-side?

2012-11-30 02:00:51

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
> > Very interesting. Do you have anything mounted on the corresponding
> > directories on server? The picture looks like you are getting empty
> > fhandles in readdir+ respons for exactly the same directories that happen
> > to be mountpoints on client. In any case, we shouldn't do that blind
> > d_drop() - empty fhandles can happen. The only remaining question is
> > why do they happen on that set of entries. From my reading of
> > encode_entryplus_baggage() it looks like we have compose_entry_fh()
> > failing for those entries and those entries alone. One possible cause
> > would be d_mountpoint(dchild) being true on server. If it is true, we
> > can declare the case closed; if not, I really wonder what's going on.
>
> Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
>
> Unmounting those directories on the server does appear to stop the WARN_ON from triggering.

OK, that settles it. WARN_ON() and printks in the area can be dropped;
the right fix is below. However, there's a similar place in cifs that
also needs to be dealt with and I really, really wonder why the hell do
we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
bug, but I would like to understand what's wrong with simply returning
0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
of unhashing, etc. itself. Would make have_submounts() in there pointless
as well - we could just return 0 and let d_invalidate() take care of the
checks... Trond?

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
nfs_refresh_inode(dentry->d_inode, entry->fattr);
goto out;
} else {
- d_drop(dentry);
+ if (d_invalidate(dentry) != 0)
+ goto out;
dput(dentry);
}
}

2012-11-30 02:34:05

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 06:00 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
>>> Very interesting. Do you have anything mounted on the corresponding
>>> directories on server? The picture looks like you are getting empty
>>> fhandles in readdir+ respons for exactly the same directories that happen
>>> to be mountpoints on client. In any case, we shouldn't do that blind
>>> d_drop() - empty fhandles can happen. The only remaining question is
>>> why do they happen on that set of entries. From my reading of
>>> encode_entryplus_baggage() it looks like we have compose_entry_fh()
>>> failing for those entries and those entries alone. One possible cause
>>> would be d_mountpoint(dchild) being true on server. If it is true, we
>>> can declare the case closed; if not, I really wonder what's going on.
>>
>> Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
>>
>> Unmounting those directories on the server does appear to stop the WARN_ON from triggering.
>
> OK, that settles it. WARN_ON() and printks in the area can be dropped;
> the right fix is below. However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself. Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks... Trond?
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
> nfs_refresh_inode(dentry->d_inode, entry->fattr);
> goto out;
> } else {
> - d_drop(dentry);
> + if (d_invalidate(dentry) != 0)
> + goto out;
> dput(dentry);
> }
> }

Excellent, thanks. Is there any chance this will make it to 3.7? Also we might want to cc stable@ on this as well since it is a regression in 3.6.

2012-11-30 04:11:09

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 06:33:53PM -0800, Patrick McLean wrote:

> Excellent, thanks. Is there any chance this will make it to 3.7? Also we might want to cc stable@ on this as well since it is a regression in 3.6.

Definitely. I've dropped that into vfs.git#for-linus and vfs.git#for-next
and tomorrow to Linus it goes...

2012-11-30 13:58:22

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Fri, 2012-11-30 at 02:00 +0000, Al Viro wrote:
> On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
> > > Very interesting. Do you have anything mounted on the corresponding
> > > directories on server? The picture looks like you are getting empty
> > > fhandles in readdir+ respons for exactly the same directories that happen
> > > to be mountpoints on client. In any case, we shouldn't do that blind
> > > d_drop() - empty fhandles can happen. The only remaining question is
> > > why do they happen on that set of entries. From my reading of
> > > encode_entryplus_baggage() it looks like we have compose_entry_fh()
> > > failing for those entries and those entries alone. One possible cause
> > > would be d_mountpoint(dchild) being true on server. If it is true, we
> > > can declare the case closed; if not, I really wonder what's going on.
> >
> > Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
> >
> > Unmounting those directories on the server does appear to stop the WARN_ON from triggering.
>
> OK, that settles it. WARN_ON() and printks in the area can be dropped;
> the right fix is below. However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself. Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks... Trond?

The reason for the choice of d_drop over d_invalidate() is the d_count
checks. It really doesn't matter whether or not the client thinks it has
users for a directory if the server is telling you that it is ESTALE. So
we force a d_drop to prevent further lookups from finding it.

IOW: It is there in order to fix the case where the user does
'rmdir("foo"); mkdir("foo")' on the server.


--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-12-01 02:39:18

by Simon Kirby

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Fri, Nov 30, 2012 at 02:00:48AM +0000, Al Viro wrote:

> OK, that settles it. WARN_ON() and printks in the area can be dropped;
> the right fix is below. However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself. Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks... Trond?
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
> nfs_refresh_inode(dentry->d_inode, entry->fattr);
> goto out;
> } else {
> - d_drop(dentry);
> + if (d_invalidate(dentry) != 0)
> + goto out;
> dput(dentry);
> }
> }

Hello,

With your previous patch (with the WARN_ON), I hit the WARN_ON() in the
test case described here: https://patchwork.kernel.org/patch/1446851/ .
The __d_move()ing mountpoint case no longer hits, and there is no longer
an EBUSY, so this seems to work for me (in 3.6, where it broke).

Simon-