2007-07-13 02:44:21

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH] nfsd: fix possible read-ahead cache and export table corruption

From: J. Bruce Fields <[email protected]>

The value of nperbucket calculated here is too small--we should be
rounding up instead of down--with the result that the index j in the
following loop can overflow the raparm_hash array. At least in my case,
the next thing in memory turns out to be export_table, so the symptoms I
see are crashes in cache_put caused by the appearance of four zeroed-out
export entries in the first bucket of the hash table of exports (which
were actually entries in the readahead cache, a pointer to which had
been written to the export table in this initialization code).

It looks like the bug was probably introduced with commit
fce1456a19f5c08b688c29f00ef90fdfa074c79b ("knfsd: make the readahead
params cache SMP-friendly").

Cc: Greg Banks <[email protected]>
Signed-off-by: "J. Bruce Fields" <[email protected]>
---
fs/nfsd/vfs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

OK, that was painful. I need a less dumb debugging method than "test,
stare at the code for a long time, insert 20 more printk's, test
again...".

Anyway, it might explain a few other reports of export table corruption
that I hadn't previously been able to reproduce.

The code could be a little more straightforward here. Could we just
round up cache_size to a multiple of RAPARM_HASH_SIZE, make all the hash
chains the same length, and just use them as arrays? The p_next
pointers seem like a unnecessary complication.

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 7e50da0..dd3604e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1885,7 +1885,7 @@ nfsd_racache_init(int cache_size)
raparm_hash[i].pb_head = NULL;
spin_lock_init(&raparm_hash[i].pb_lock);
}
- nperbucket = cache_size >> RAPARM_HASH_BITS;
+ nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
for (i = 0; i < cache_size - 1; i++) {
if (i % nperbucket == 0)
raparm_hash[j++].pb_head = raparml + i;
--
1.5.3.rc0.63.gc956


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-07-24 14:40:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] nfsd: fix possible read-ahead cache and export table corruption

On Fri, Jul 13, 2007 at 01:46:25PM -0400, J. Bruce Fields wrote:
> On Fri, Jul 13, 2007 at 02:10:19PM +1000, Greg Banks wrote:
> > There's a couple of other problems with that code, for which I'll send
> > patches shortly. But I think the raparams cache needs a more thorough
> > renovation than that. I have a mostly-working patch which rips it out
> > entirely and replaces it with a cache of open struct files, an idea
> > of Neil's. This approach fixes a whole bunch of separate problems
> > at once, and will be really useful...
>
> OK! What are the other problems you were thinking of?

By the way, one other problem with keeping open files that I was
reminded of recently--currently you should be able to do something like

killall rpc.mountd
# flush export table:
exportfs -f
umount /exports
... do stuff
mount /exports
rpc.mountd

Also useful if you want to unmount a filesystem one one machine so you
can move it to another and mount it there. But that's only possible
because noone holds a file open for longer than the time required to
service a single request.

Well, no-one except lockd--hence Wendy's patches to drop all locks on a
filesystem. Or the nfsv4 server, which is holding files open as long as
clients have them open. Oops, my fault. I'm not sure quite what to do
about that.

--b.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-07-13 04:10:20

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH] nfsd: fix possible read-ahead cache and export table corruption

On Thu, Jul 12, 2007 at 10:44:20PM -0400, J. Bruce Fields wrote:
> From: J. Bruce Fields <[email protected]>
>
> The value of nperbucket calculated here is too small--we should be
> [...]
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 7e50da0..dd3604e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1885,7 +1885,7 @@ nfsd_racache_init(int cache_size)
> raparm_hash[i].pb_head = NULL;
> spin_lock_init(&raparm_hash[i].pb_lock);
> }
> - nperbucket = cache_size >> RAPARM_HASH_BITS;
> + nperbucket = DIV_ROUND_UP(cache_size, RAPARM_HASH_SIZE);
> for (i = 0; i < cache_size - 1; i++) {
> if (i % nperbucket == 0)
> raparm_hash[j++].pb_head = raparml + i;
> --
> 1.5.3.rc0.63.gc956

My bad. The code I added really only works properly if the number
of nfsd threads at startup is a multiple of 8, so that cache_size
is a multiple of 16 and the shift doesn't lose any nonzero bits.
As we use 128 threads in our product, I'd never noticed the bug :-(
Sorry for the trouble, Bruce.

There's a couple of other problems with that code, for which I'll send
patches shortly. But I think the raparams cache needs a more thorough
renovation than that. I have a mostly-working patch which rips it out
entirely and replaces it with a cache of open struct files, an idea
of Neil's. This approach fixes a whole bunch of separate problems
at once, and will be really useful...if I can ever find enough time
between super-urgent dramas to work on it ;-)

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere. Which MPHG character are you?
I don't speak for SGI.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-07-13 17:46:27

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] nfsd: fix possible read-ahead cache and export table corruption

On Fri, Jul 13, 2007 at 02:10:19PM +1000, Greg Banks wrote:
> My bad. The code I added really only works properly if the number
> of nfsd threads at startup is a multiple of 8, so that cache_size
> is a multiple of 16 and the shift doesn't lose any nonzero bits.
> As we use 128 threads in our product, I'd never noticed the bug :-(

It seems likely the distros all default to powers or 2, so that may make
it a little less likely to explain other bug reports I've seen. Oh
well. I'm not sure why I had this one test machine set to start 50
threads.

> There's a couple of other problems with that code, for which I'll send
> patches shortly. But I think the raparams cache needs a more thorough
> renovation than that. I have a mostly-working patch which rips it out
> entirely and replaces it with a cache of open struct files, an idea
> of Neil's. This approach fixes a whole bunch of separate problems
> at once, and will be really useful...

OK! What are the other problems you were thinking of?

--b.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs