2012-11-29 23:43:28

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 02:53:13PM -0800, Patrick McLean wrote:
> On 29/11/12 02:21 PM, Al Viro wrote:
> > On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
> >
> >> I have a trivial reproducer and am happy to help debug in any way that
> >> I can. That patch seems to fix the problem, and produces these
> >> warnings in dmesg:
> >
> > OK... So we have differing entry->fh and NFS_FH(dentry->d_inode). Something
> > like
> > static void dump_fh(const struct nfs_fh *fh)
> > {
> > int i;
> > printk(KERN_INFO "FH(%d)", fh->size);
> > for (i = 0; i < fh->size; i++)
> > printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
> > printk(KERN_CONT "]\n");
> > }
> > with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
> > that WARN_ON(1) would probably be interesting. And probably would make
> > sense to print filename->name as well, to see which files it is about.

> [ 8.821584] FH(0)]
> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
> [ 8.821601] filename: proc

*whoa*

So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
version it is? entry->fh->size is set by nfs[34]_decode_dirent().

NFS folks: any ideas on best way to debug it? The brute-force way would be
to capture all NFS traffic with tcpdump and see what's going on, but that
would be a lot of work...

Looks like we have READDIRPLUS attempted and succeeded, but fhandle was not
given. Result: nfs_prime_dcache() is doing blind d_drop() on perfectly
valid dentries, no matter how busy.


2012-11-30 00:35:05

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 04:19:51PM -0800, Patrick McLean wrote:
> >> [ 8.821584] FH(0)]
> >> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
> >> [ 8.821601] filename: proc
> >
> > *whoa*
> >
> > So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
> > version it is? entry->fh->size is set by nfs[34]_decode_dirent().
>
> This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.

So we have nfs3_decode_dirent(), stepping into
/* In fact, a post_op_fh3: */
p = xdr_inline_decode(xdr, 4);
if (unlikely(p == NULL))
goto out_overflow;
if (*p != xdr_zero) {
error = decode_nfs_fh3(xdr, entry->fh);
if (unlikely(error)) {
if (error == -E2BIG)
goto out_truncated;
return error;
}
} else
zero_nfs_fh3(entry->fh);
Interesting... Server-side that should've been produced by
encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
which has explicit
if (d_mountpoint(dchild))
goto out;
resulting in ENOENT on everything that's overmounted on server.

Do you, by any chance, have the server really exporting its own root
filesystem? Another thing to check: have nfs_prime_dcache() print
filename.name of everything that fails nfs_same_entry() and has
zero entry->fh->size, regardless of d_invalidate() results.

2012-11-30 00:19:54

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 03:43 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 02:53:13PM -0800, Patrick McLean wrote:
>> On 29/11/12 02:21 PM, Al Viro wrote:
>>> On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
>>>
>>>> I have a trivial reproducer and am happy to help debug in any way that
>>>> I can. That patch seems to fix the problem, and produces these
>>>> warnings in dmesg:
>>>
>>> OK... So we have differing entry->fh and NFS_FH(dentry->d_inode). Something
>>> like
>>> static void dump_fh(const struct nfs_fh *fh)
>>> {
>>> int i;
>>> printk(KERN_INFO "FH(%d)", fh->size);
>>> for (i = 0; i < fh->size; i++)
>>> printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
>>> printk(KERN_CONT "]\n");
>>> }
>>> with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
>>> that WARN_ON(1) would probably be interesting. And probably would make
>>> sense to print filename->name as well, to see which files it is about.
>
>> [ 8.821584] FH(0)]
>> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
>> [ 8.821601] filename: proc
>
> *whoa*
>
> So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
> version it is? entry->fh->size is set by nfs[34]_decode_dirent().

This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.

> NFS folks: any ideas on best way to debug it? The brute-force way would be
> to capture all NFS traffic with tcpdump and see what's going on, but that
> would be a lot of work...
>
> Looks like we have READDIRPLUS attempted and succeeded, but fhandle was not
> given. Result: nfs_prime_dcache() is doing blind d_drop() on perfectly
> valid dentries, no matter how busy.
>

2012-11-30 04:11:06

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 06:33:53PM -0800, Patrick McLean wrote:

> Excellent, thanks. Is there any chance this will make it to 3.7? Also we might want to cc stable@ on this as well since it is a regression in 3.6.

Definitely. I've dropped that into vfs.git#for-linus and vfs.git#for-next
and tomorrow to Linus it goes...

2012-11-30 02:00:50

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
> > Very interesting. Do you have anything mounted on the corresponding
> > directories on server? The picture looks like you are getting empty
> > fhandles in readdir+ respons for exactly the same directories that happen
> > to be mountpoints on client. In any case, we shouldn't do that blind
> > d_drop() - empty fhandles can happen. The only remaining question is
> > why do they happen on that set of entries. From my reading of
> > encode_entryplus_baggage() it looks like we have compose_entry_fh()
> > failing for those entries and those entries alone. One possible cause
> > would be d_mountpoint(dchild) being true on server. If it is true, we
> > can declare the case closed; if not, I really wonder what's going on.
>
> Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
>
> Unmounting those directories on the server does appear to stop the WARN_ON from triggering.

OK, that settles it. WARN_ON() and printks in the area can be dropped;
the right fix is below. However, there's a similar place in cifs that
also needs to be dealt with and I really, really wonder why the hell do
we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
bug, but I would like to understand what's wrong with simply returning
0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
of unhashing, etc. itself. Would make have_submounts() in there pointless
as well - we could just return 0 and let d_invalidate() take care of the
checks... Trond?

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
nfs_refresh_inode(dentry->d_inode, entry->fattr);
goto out;
} else {
- d_drop(dentry);
+ if (d_invalidate(dentry) != 0)
+ goto out;
dput(dentry);
}
}

2012-11-30 01:04:52

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 04:35 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 04:19:51PM -0800, Patrick McLean wrote:
>>>> [ 8.821584] FH(0)]
>>>> [ 8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
>>>> [ 8.821601] filename: proc
>>>
>>> *whoa*
>>>
>>> So we have zero entry->fh->size? No wonder it doesn't match... Which NFS
>>> version it is? entry->fh->size is set by nfs[34]_decode_dirent().
>>
>> This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.
>
> So we have nfs3_decode_dirent(), stepping into
> /* In fact, a post_op_fh3: */
> p = xdr_inline_decode(xdr, 4);
> if (unlikely(p == NULL))
> goto out_overflow;
> if (*p != xdr_zero) {
> error = decode_nfs_fh3(xdr, entry->fh);
> if (unlikely(error)) {
> if (error == -E2BIG)
> goto out_truncated;
> return error;
> }
> } else
> zero_nfs_fh3(entry->fh);
> Interesting... Server-side that should've been produced by
> encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
> which has explicit
> if (d_mountpoint(dchild))
> goto out;
> resulting in ENOENT on everything that's overmounted on server.
>
> Do you, by any chance, have the server really exporting its own root
> filesystem? Another thing to check: have nfs_prime_dcache() print
> filename.name of everything that fails nfs_same_entry() and has
> zero entry->fh->size, regardless of d_invalidate() results.

The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).

The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
"ls /" at in single user mode.

I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
are triggering the WARN_ON, the relevant dmesg is below.

[ 9.495217] entry->fh->size is 0 on: proc
[ 9.495222] ------------[ cut here ]------------
[ 9.495230] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[ 9.495232] Hardware name: Bochs
[ 9.495233] Modules linked in:
[ 9.495237] Pid: 655, comm: ls Not tainted 3.7.0-rc7+ #40
[ 9.495239] Call Trace:
[ 9.495247] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.495250] [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[ 9.495254] [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.495257] [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.495263] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.495266] [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.495269] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.495274] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495277] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.495280] [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[ 9.495285] [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.495288] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495291] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495294] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.495298] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.495302] [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[ 9.495304] ---[ end trace e502c5d56c594e85 ]---
[ 9.495306] FH(0)]
[ 9.495308] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
[ 9.495323] filename: proc
[ 9.495330] entry->fh->size is 0 on: dev
[ 9.495332] ------------[ cut here ]------------
[ 9.495335] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[ 9.495336] Hardware name: Bochs
[ 9.495337] Modules linked in:
[ 9.495340] Pid: 655, comm: ls Tainted: G W 3.7.0-rc7+ #40
[ 9.495341] Call Trace:
[ 9.495345] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.495348] [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[ 9.495351] [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.495354] [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.495358] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.495361] [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.495364] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.495368] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495371] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.495373] [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[ 9.495377] [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.495380] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495383] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495387] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.495390] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.495393] [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[ 9.495395] ---[ end trace e502c5d56c594e86 ]---
[ 9.495396] FH(0)]
[ 9.495397] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 85 00 00 40 d6 39 e0 7d]
[ 9.495412] filename: dev
[ 9.495422] entry->fh->size is 0 on: sys
[ 9.495423] ------------[ cut here ]------------
[ 9.495426] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[ 9.495427] Hardware name: Bochs
[ 9.495428] Modules linked in:
[ 9.495430] Pid: 655, comm: ls Tainted: G W 3.7.0-rc7+ #40
[ 9.495431] Call Trace:
[ 9.495435] [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[ 9.495438] [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[ 9.495441] [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[ 9.495444] [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[ 9.495448] [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[ 9.495450] [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[ 9.495454] [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[ 9.495457] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495460] [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[ 9.495463] [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[ 9.495466] [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[ 9.495470] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495473] [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[ 9.495476] [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[ 9.495479] [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[ 9.495482] [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[ 9.495484] ---[ end trace e502c5d56c594e87 ]---
[ 9.495485] FH(0)]
[ 9.495486] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a3 0e 00 40 42 57 d3 90]
[ 9.495511] filename: sys

2012-11-30 02:34:02

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 06:00 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
>>> Very interesting. Do you have anything mounted on the corresponding
>>> directories on server? The picture looks like you are getting empty
>>> fhandles in readdir+ respons for exactly the same directories that happen
>>> to be mountpoints on client. In any case, we shouldn't do that blind
>>> d_drop() - empty fhandles can happen. The only remaining question is
>>> why do they happen on that set of entries. From my reading of
>>> encode_entryplus_baggage() it looks like we have compose_entry_fh()
>>> failing for those entries and those entries alone. One possible cause
>>> would be d_mountpoint(dchild) being true on server. If it is true, we
>>> can declare the case closed; if not, I really wonder what's going on.
>>
>> Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
>>
>> Unmounting those directories on the server does appear to stop the WARN_ON from triggering.
>
> OK, that settles it. WARN_ON() and printks in the area can be dropped;
> the right fix is below. However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself. Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks... Trond?
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
> nfs_refresh_inode(dentry->d_inode, entry->fattr);
> goto out;
> } else {
> - d_drop(dentry);
> + if (d_invalidate(dentry) != 0)
> + goto out;
> dput(dentry);
> }
> }

Excellent, thanks. Is there any chance this will make it to 3.7? Also we might want to cc stable@ on this as well since it is a regression in 3.6.

2012-11-30 01:36:31

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Thu, Nov 29, 2012 at 04:57:19PM -0800, Patrick McLean wrote:
> > Interesting... Server-side that should've been produced by
> > encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
> > which has explicit
> > if (d_mountpoint(dchild))
> > goto out;
> > resulting in ENOENT on everything that's overmounted on server.
> >
> > Do you, by any chance, have the server really exporting its own root
> > filesystem? Another thing to check: have nfs_prime_dcache() print
> > filename.name of everything that fails nfs_same_entry() and has
> > zero entry->fh->size, regardless of d_invalidate() results.
>
> The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).
>
> The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
> "ls /" at in single user mode.
>
> I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
> are triggering the WARN_ON, the relevant dmesg is below.

[the same /dev, /proc and /sys]

Very interesting. Do you have anything mounted on the corresponding
directories on server? The picture looks like you are getting empty
fhandles in readdir+ respons for exactly the same directories that happen
to be mountpoints on client. In any case, we shouldn't do that blind
d_drop() - empty fhandles can happen. The only remaining question is
why do they happen on that set of entries. From my reading of
encode_entryplus_baggage() it looks like we have compose_entry_fh()
failing for those entries and those entries alone. One possible cause
would be d_mountpoint(dchild) being true on server. If it is true, we
can declare the case closed; if not, I really wonder what's going on.

Note that if the same fs is mounted elsewhere, d_mountpoint() would mean
that something is mounted on top of that directory in _some_ instance;
not necessary the exported one. Can you slap printks on fs/nfsd/nfs3xdr.c
compose_entry_fh() failure exits and see which one triggers server-side?

2012-11-30 13:58:20

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

T24gRnJpLCAyMDEyLTExLTMwIGF0IDAyOjAwICswMDAwLCBBbCBWaXJvIHdyb3RlOg0KPiBPbiBU
aHUsIE5vdiAyOSwgMjAxMiBhdCAwNTo1NDowMlBNIC0wODAwLCBQYXRyaWNrIE1jTGVhbiB3cm90
ZToNCj4gPiA+IAlWZXJ5IGludGVyZXN0aW5nLiAgRG8geW91IGhhdmUgYW55dGhpbmcgbW91bnRl
ZCBvbiB0aGUgY29ycmVzcG9uZGluZw0KPiA+ID4gZGlyZWN0b3JpZXMgb24gc2VydmVyPyAgVGhl
IHBpY3R1cmUgbG9va3MgbGlrZSB5b3UgYXJlIGdldHRpbmcgZW1wdHkNCj4gPiA+IGZoYW5kbGVz
IGluIHJlYWRkaXIrIHJlc3BvbnMgZm9yIGV4YWN0bHkgdGhlIHNhbWUgZGlyZWN0b3JpZXMgdGhh
dCBoYXBwZW4NCj4gPiA+IHRvIGJlIG1vdW50cG9pbnRzIG9uIGNsaWVudC4gIEluIGFueSBjYXNl
LCB3ZSBzaG91bGRuJ3QgZG8gdGhhdCBibGluZA0KPiA+ID4gZF9kcm9wKCkgLSBlbXB0eSBmaGFu
ZGxlcyBjYW4gaGFwcGVuLiAgVGhlIG9ubHkgcmVtYWluaW5nIHF1ZXN0aW9uIGlzDQo+ID4gPiB3
aHkgZG8gdGhleSBoYXBwZW4gb24gdGhhdCBzZXQgb2YgZW50cmllcy4gIEZyb20gbXkgcmVhZGlu
ZyBvZg0KPiA+ID4gZW5jb2RlX2VudHJ5cGx1c19iYWdnYWdlKCkgaXQgbG9va3MgbGlrZSB3ZSBo
YXZlIGNvbXBvc2VfZW50cnlfZmgoKQ0KPiA+ID4gZmFpbGluZyBmb3IgdGhvc2UgZW50cmllcyBh
bmQgdGhvc2UgZW50cmllcyBhbG9uZS4gIE9uZSBwb3NzaWJsZSBjYXVzZQ0KPiA+ID4gd291bGQg
YmUgZF9tb3VudHBvaW50KGRjaGlsZCkgYmVpbmcgdHJ1ZSBvbiBzZXJ2ZXIuICBJZiBpdCBpcyB0
cnVlLCB3ZQ0KPiA+ID4gY2FuIGRlY2xhcmUgdGhlIGNhc2UgY2xvc2VkOyBpZiBub3QsIEkgcmVh
bGx5IHdvbmRlciB3aGF0J3MgZ29pbmcgb24uDQo+ID4gDQo+ID4gVGhvc2UgZGlyZWN0b3JpZXMg
ZG8gaGF2ZSB0aGUgc2VydmVyJ3Mgb3duIGNvcGllcyBvZiB0aGUgc2FpZCBkaXJlY3RvcmllcyBi
aW5kIG1vdW50ZWQgYXQgdGhlIG1vbWVudCBpbiBhIHNlcGFyYXRlIG1vdW50IG5hbWVzcGFjZS4N
Cj4gPiANCj4gPiBVbm1vdW50aW5nIHRob3NlIGRpcmVjdG9yaWVzIG9uIHRoZSBzZXJ2ZXIgZG9l
cyBhcHBlYXIgdG8gc3RvcCB0aGUgV0FSTl9PTiBmcm9tIHRyaWdnZXJpbmcuDQo+IA0KPiBPSywg
dGhhdCBzZXR0bGVzIGl0LiAgV0FSTl9PTigpIGFuZCBwcmludGtzIGluIHRoZSBhcmVhIGNhbiBi
ZSBkcm9wcGVkOw0KPiB0aGUgcmlnaHQgZml4IGlzIGJlbG93LiAgSG93ZXZlciwgdGhlcmUncyBh
IHNpbWlsYXIgcGxhY2UgaW4gY2lmcyB0aGF0DQo+IGFsc28gbmVlZHMgdG8gYmUgZGVhbHQgd2l0
aCBhbmQgSSByZWFsbHksIHJlYWxseSB3b25kZXIgd2h5IHRoZSBoZWxsIGRvDQo+IHdlIGRvIGRf
ZHJvcCgpIGluIG5mc19yZXZhbGlkYXRlX2xvb2t1cCgpLiAgSXQncyBub3QgcmVsZXZhbnQgaW4g
dGhpcw0KPiBidWcsIGJ1dCBJIHdvdWxkIGxpa2UgdG8gdW5kZXJzdGFuZCB3aGF0J3Mgd3Jvbmcg
d2l0aCBzaW1wbHkgcmV0dXJuaW5nDQo+IDAgZnJvbSAtPmRfcmV2YWxpZGF0ZSgpIGFuZCBsZXR0
aW5nIHRoZSBjYWxsZXIgKGluIGZzL25hbWVpLmMpIHRha2UgY2FyZQ0KPiBvZiB1bmhhc2hpbmcs
IGV0Yy4gaXRzZWxmLiAgV291bGQgbWFrZSBoYXZlX3N1Ym1vdW50cygpIGluIHRoZXJlIHBvaW50
bGVzcw0KPiBhcyB3ZWxsIC0gd2UgY291bGQganVzdCByZXR1cm4gMCBhbmQgbGV0IGRfaW52YWxp
ZGF0ZSgpIHRha2UgY2FyZSBvZiB0aGUNCj4gY2hlY2tzLi4uICBUcm9uZD8NCg0KVGhlIHJlYXNv
biBmb3IgdGhlIGNob2ljZSBvZiBkX2Ryb3Agb3ZlciBkX2ludmFsaWRhdGUoKSBpcyB0aGUgZF9j
b3VudA0KY2hlY2tzLiBJdCByZWFsbHkgZG9lc24ndCBtYXR0ZXIgd2hldGhlciBvciBub3QgdGhl
IGNsaWVudCB0aGlua3MgaXQgaGFzDQp1c2VycyBmb3IgYSBkaXJlY3RvcnkgaWYgdGhlIHNlcnZl
ciBpcyB0ZWxsaW5nIHlvdSB0aGF0IGl0IGlzIEVTVEFMRS4gU28NCndlIGZvcmNlIGEgZF9kcm9w
IHRvIHByZXZlbnQgZnVydGhlciBsb29rdXBzIGZyb20gZmluZGluZyBpdC4NCg0KSU9XOiBJdCBp
cyB0aGVyZSBpbiBvcmRlciB0byBmaXggdGhlIGNhc2Ugd2hlcmUgdGhlIHVzZXIgZG9lcw0KJ3Jt
ZGlyKCJmb28iKTsgbWtkaXIoImZvbyIpJyBvbiB0aGUgc2VydmVyLg0KDQoNCi0tIA0KVHJvbmQg
TXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5N
eWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg==

2012-12-01 02:39:17

by Simon Kirby

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Fri, Nov 30, 2012 at 02:00:48AM +0000, Al Viro wrote:

> OK, that settles it. WARN_ON() and printks in the area can be dropped;
> the right fix is below. However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup(). It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself. Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks... Trond?
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
> nfs_refresh_inode(dentry->d_inode, entry->fattr);
> goto out;
> } else {
> - d_drop(dentry);
> + if (d_invalidate(dentry) != 0)
> + goto out;
> dput(dentry);
> }
> }

Hello,

With your previous patch (with the WARN_ON), I hit the WARN_ON() in the
test case described here: https://patchwork.kernel.org/patch/1446851/ .
The __d_move()ing mountpoint case no longer hits, and there is no longer
an EBUSY, so this seems to work for me (in 3.6, where it broke).

Simon-

2012-11-30 01:54:05

by Patrick McLean

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On 29/11/12 05:36 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 04:57:19PM -0800, Patrick McLean wrote:
>>> Interesting... Server-side that should've been produced by
>>> encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
>>> which has explicit
>>> if (d_mountpoint(dchild))
>>> goto out;
>>> resulting in ENOENT on everything that's overmounted on server.
>>>
>>> Do you, by any chance, have the server really exporting its own root
>>> filesystem? Another thing to check: have nfs_prime_dcache() print
>>> filename.name of everything that fails nfs_same_entry() and has
>>> zero entry->fh->size, regardless of d_invalidate() results.
>>
>> The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).
>>
>> The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
>> "ls /" at in single user mode.
>>
>> I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
>> are triggering the WARN_ON, the relevant dmesg is below.
>
> [the same /dev, /proc and /sys]
>
> Very interesting. Do you have anything mounted on the corresponding
> directories on server? The picture looks like you are getting empty
> fhandles in readdir+ respons for exactly the same directories that happen
> to be mountpoints on client. In any case, we shouldn't do that blind
> d_drop() - empty fhandles can happen. The only remaining question is
> why do they happen on that set of entries. From my reading of
> encode_entryplus_baggage() it looks like we have compose_entry_fh()
> failing for those entries and those entries alone. One possible cause
> would be d_mountpoint(dchild) being true on server. If it is true, we
> can declare the case closed; if not, I really wonder what's going on.

Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.

Unmounting those directories on the server does appear to stop the WARN_ON from triggering.

> Note that if the same fs is mounted elsewhere, d_mountpoint() would mean
> that something is mounted on top of that directory in _some_ instance;
> not necessary the exported one. Can you slap printks on fs/nfsd/nfs3xdr.c
> compose_entry_fh() failure exits and see which one triggers server-side?

2012-12-01 21:40:06

by Al Viro

[permalink] [raw]
Subject: Re: Regression with initramfs and nfsroot (appears to be in the dcache)

On Fri, Nov 30, 2012 at 01:58:18PM +0000, Myklebust, Trond wrote:

> The reason for the choice of d_drop over d_invalidate() is the d_count
> checks. It really doesn't matter whether or not the client thinks it has
> users for a directory if the server is telling you that it is ESTALE. So
> we force a d_drop to prevent further lookups from finding it.
>
> IOW: It is there in order to fix the case where the user does
> 'rmdir("foo"); mkdir("foo")' on the server.

You do realize that your have_submounts() check in there is inherently
racy, right?