2011-03-16 04:55:38

by NeilBrown

[permalink] [raw]
Subject: Use of READDIRPLUS on large directories


Hi Trond / Bryan et al.

Now that openSUSE 11.4 is out I have started getting a few reports
of regressions that can be traced to

commit 0715dc632a271fc0fedf3ef4779fe28ac1e53ef4
Author: Bryan Schumaker <[email protected]>
Date: Fri Sep 24 18:50:01 2010 -0400

NFS: remove readdir plus limit

We will now use readdir plus even on directories that are very large.

Signed-off-by: Bryan Schumaker <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>


This particularly affects users with their home directory over
NFS, and with largish maildir mail folders.

Where it used to take a smallish number of seconds for (e.g.)
xbiff to start up and read through the various directories, it now
takes multiple minutes.

I can confirm that the slow down is due to readdirplus by mounting the
filesystem with nordirplus.


While I can understand that there are sometime benefits in using
readdirplus for very large directories, there are also obviously real
costs. So I think we have to see this patch as a regression that should
be reverted.


It would quite possibly make sense to create a tunable (mount option or
sysctl I guess) to set the max size for directories to use readdirplus,
but I think it really should be an opt-in situation.

[[ It would also be really nice if the change-log for such a significant
change contained a little more justification.... :-( ]]

Thoughts?

Thanks,
NeilBrown


2011-03-16 14:14:32

by Anna Schumaker

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

I guess I misunderstood what to publish test results for? I know I included numbers on one of the patches (commit 82f2e5472e2304e531c2fa85e457f4a71070044e, copied below)... I'll find the numbers you're asking about and post them.

-Bryan

commit 82f2e5472e2304e531c2fa85e457f4a71070044e
Author: Bryan Schumaker <[email protected]>
Date: Thu Oct 21 16:33:18 2010 -0400

NFS: Readdir plus in v4

By requsting more attributes during a readdir, we can mimic the readdir plus
operation that was in NFSv3.

To test, I ran the command `ls -lU --color=none` on directories with various
numbers of files. Without readdir plus, I see this:

n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
--------+-----------+-----------+-----------+-----------+----------
real | 0m00.153s | 0m00.589s | 0m05.601s | 0m56.691s | 9m59.128s
user | 0m00.007s | 0m00.007s | 0m00.077s | 0m00.703s | 0m06.800s
sys | 0m00.010s | 0m00.070s | 0m00.633s | 0m06.423s | 1m10.005s
access | 3 | 1 | 1 | 4 | 31
getattr | 2 | 1 | 1 | 1 | 1
lookup | 104 | 1,003 | 10,003 | 100,003 | 1,000,003
readdir | 2 | 16 | 158 | 1,575 | 15,749
total | 111 | 1,021 | 10,163 | 101,583 | 1,015,784

With readdir plus enabled, I see this:

n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
--------+-----------+-----------+-----------+-----------+----------
real | 0m00.115s | 0m00.206s | 0m01.079s | 0m12.521s | 2m07.528s
user | 0m00.003s | 0m00.003s | 0m00.040s | 0m00.290s | 0m03.296s
sys | 0m00.007s | 0m00.020s | 0m00.120s | 0m01.357s | 0m17.556s
access | 3 | 1 | 1 | 1 | 7
getattr | 2 | 1 | 1 | 1 | 1
lookup | 4 | 3 | 3 | 3 | 3
readdir | 6 | 62 | 630 | 6,300 | 62,993
total | 15 | 67 | 635 | 6,305 | 63,004

Readdir plus disabled has about a 16x increase in the number of rpc calls an
is 4 - 5 times slower on large directories.

Signed-off-by: Bryan Schumaker <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>



On 03/16/2011 09:43 AM, Chuck Lever wrote:
>
> On Mar 16, 2011, at 12:55 AM, NeilBrown wrote:
>
>> Hi Trond / Bryan et al.
>>
>> Now that openSUSE 11.4 is out I have started getting a few reports
>> of regressions that can be traced to
>>
>> commit 0715dc632a271fc0fedf3ef4779fe28ac1e53ef4
>> Author: Bryan Schumaker <[email protected]>
>> Date: Fri Sep 24 18:50:01 2010 -0400
>>
>> NFS: remove readdir plus limit
>>
>> We will now use readdir plus even on directories that are very large.
>>
>> Signed-off-by: Bryan Schumaker <[email protected]>
>> Signed-off-by: Trond Myklebust <[email protected]>
>>
>>
>> This particularly affects users with their home directory over
>> NFS, and with largish maildir mail folders.
>>
>> Where it used to take a smallish number of seconds for (e.g.)
>> xbiff to start up and read through the various directories, it now
>> takes multiple minutes.
>>
>> I can confirm that the slow down is due to readdirplus by mounting the
>> filesystem with nordirplus.
>
> Back in the dark ages, I discovered that this kind of slowdown was often the result of server slowness. The problem is that a simple readdir is often a sequential read from physical media. When you include attribute information, the server has to pick up the inodes, which is a series of small random reads. It could cause each readdir request to become slower by a factor of 10. This is a problem on NFS servers where the inode cache is turning over often (small home directory servers, for instance).
>
> In addition, as more information per file is delivered by READDIRPLUS, each request can hold fewer entries, so more requests and more packets are needed to read a directory. We hold the request count down now by allowing multi-page directory reads, if the server supports it.
>
> In any event, applications will see this slow down immediately, but it can also be a significant scalability problem for servers.
>
>> While I can understand that there are sometime benefits in using
>> readdirplus for very large directories, there are also obviously real
>> costs. So I think we have to see this patch as a regression that should
>> be reverted.
>
> It would be useful to understand what it is about these workloads that is causing slow downs. Is it simply the size of the directory? Or is there a bug on the server or client that is causing the issue? Is it a problem only on certain servers or with certain configurations?
>
>> It would quite possibly make sense to create a tunable (mount option or
>> sysctl I guess) to set the max size for directories to use readdirplus,
>> but I think it really should be an opt-in situation.
>
> Giving users another knob usually results in higher support costs and confused users. ;-)
>
>> [[ It would also be really nice if the change-log for such a significant
>> change contained a little more justification.... :-( ]]
>
> I had asked, before this series was included in upstream, for some tests to discover where the knee of the performance curve between readdir and readdirplus was. Bryan, can you publish the results of those tests? I had hoped the test results would appear in the patch description to help justify this change.
>


2011-03-16 13:50:06

by Myklebust, Trond

[permalink] [raw]
Subject: RE: Use of READDIRPLUS on large directories

On Wed, 2011-03-16 at 08:30 -0400, [email protected] wrote:
> Perhaps the use of a heuristic that enables readdirplus only after the application has shown that it is interested in the attributes for each entry in the directory? Thus, if the application does readdir()/stat()/stat()/stat()/readdir()/... then the NFS client could use readdirplus to fill the caches. If the application is just reading the directory and looking at the names, then the client could just use readdir.
>
> ps

Yes, possibly.

The thing that convinced me that we should get rid of the limit was when
Bryan was testing directories with 10^6 entries, and was seeing an order
of magnitude improvement when comparing readdirplus vs. readdir on 'ls
-l' workloads. I wish he had published the actual numbers in the
changelog.

As I recall, the slowdown when comparing readdirplus vs readdir on 'ls'
workloads was far less.

You can easily test that yourself, using the "-onordirplus" mount option
to turn off readdirplus (which, btw, remains a workaround for people who
don't care about 'ls -l' workloads).

Cheers
Trond

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of NeilBrown
> Sent: Wednesday, March 16, 2011 12:55 AM
> To: Trond Myklebust; Bryan Schumaker
> Cc: [email protected]
> Subject: Use of READDIRPLUS on large directories
>
>
> Hi Trond / Bryan et al.
>
> Now that openSUSE 11.4 is out I have started getting a few reports
> of regressions that can be traced to
>
> commit 0715dc632a271fc0fedf3ef4779fe28ac1e53ef4
> Author: Bryan Schumaker <[email protected]>
> Date: Fri Sep 24 18:50:01 2010 -0400
>
> NFS: remove readdir plus limit
>
> We will now use readdir plus even on directories that are very large.
>
> Signed-off-by: Bryan Schumaker <[email protected]>
> Signed-off-by: Trond Myklebust <[email protected]>
>
>
> This particularly affects users with their home directory over
> NFS, and with largish maildir mail folders.
>
> Where it used to take a smallish number of seconds for (e.g.)
> xbiff to start up and read through the various directories, it now
> takes multiple minutes.
>
> I can confirm that the slow down is due to readdirplus by mounting the
> filesystem with nordirplus.
>
>
> While I can understand that there are sometime benefits in using
> readdirplus for very large directories, there are also obviously real
> costs. So I think we have to see this patch as a regression that should
> be reverted.
>
>
> It would quite possibly make sense to create a tunable (mount option or
> sysctl I guess) to set the max size for directories to use readdirplus,
> but I think it really should be an opt-in situation.
>
> [[ It would also be really nice if the change-log for such a significant
> change contained a little more justification.... :-( ]]
>
> Thoughts?
>
> Thanks,
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-03-16 21:40:47

by NeilBrown

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Wed, 16 Mar 2011 08:30:20 -0400 <[email protected]> wrote:

> Perhaps the use of a heuristic that enables readdirplus only after the application has shown that it is interested in the attributes for each entry in the directory? Thus, if the application does readdir()/stat()/stat()/stat()/readdir()/... then the NFS client could use readdirplus to fill the caches. If the application is just reading the directory and looking at the names, then the client could just use readdir.

I think this could work very well.
"ls -l" certainly calls 'stat' on each file after each 'getdents' call.

So we could arrange that the first readdir call on a directory
always uses the 'plus' version, and clears a "seen any getattr calls"
flag on the directory.

nfs_getattr then sets that flag on the parent

subsequent readdir calls only use 'plus' if the flag was set, and
clear the flag again.


There might be odd issues with multiple processes reading and stating
in the same directory, but they probably aren't very serious.

I'm might give this idea a try ... but I still think the original
switch to always use readdirplus is a regression and should be reverted.

Thanks,
NeilBrown


2011-03-16 12:30:45

by peter.staubach

[permalink] [raw]
Subject: RE: Use of READDIRPLUS on large directories

Perhaps the use of a heuristic that enables readdirplus only after the application has shown that it is interested in the attributes for each entry in the directory? Thus, if the application does readdir()/stat()/stat()/stat()/readdir()/... then the NFS client could use readdirplus to fill the caches. If the application is just reading the directory and looking at the names, then the client could just use readdir.

ps


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of NeilBrown
Sent: Wednesday, March 16, 2011 12:55 AM
To: Trond Myklebust; Bryan Schumaker
Cc: [email protected]
Subject: Use of READDIRPLUS on large directories


Hi Trond / Bryan et al.

Now that openSUSE 11.4 is out I have started getting a few reports
of regressions that can be traced to

commit 0715dc632a271fc0fedf3ef4779fe28ac1e53ef4
Author: Bryan Schumaker <[email protected]>
Date: Fri Sep 24 18:50:01 2010 -0400

NFS: remove readdir plus limit

We will now use readdir plus even on directories that are very large.

Signed-off-by: Bryan Schumaker <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>


This particularly affects users with their home directory over
NFS, and with largish maildir mail folders.

Where it used to take a smallish number of seconds for (e.g.)
xbiff to start up and read through the various directories, it now
takes multiple minutes.

I can confirm that the slow down is due to readdirplus by mounting the
filesystem with nordirplus.


While I can understand that there are sometime benefits in using
readdirplus for very large directories, there are also obviously real
costs. So I think we have to see this patch as a regression that should
be reverted.


It would quite possibly make sense to create a tunable (mount option or
sysctl I guess) to set the max size for directories to use readdirplus,
but I think it really should be an opt-in situation.

[[ It would also be really nice if the change-log for such a significant
change contained a little more justification.... :-( ]]

Thoughts?

Thanks,
NeilBrown

2011-03-16 21:42:36

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Thu, 2011-03-17 at 08:30 +1100, NeilBrown wrote:
> On Wed, 16 Mar 2011 10:20:03 -0400 Trond Myklebust
> <[email protected]> wrote:
>
> > On Wed, 2011-03-16 at 10:14 -0400, Bryan Schumaker wrote:
> > > I guess I misunderstood what to publish test results for? I know I included numbers on one of the patches (commit 82f2e5472e2304e531c2fa85e457f4a71070044e, copied below)... I'll find the numbers you're asking about and post them.
> > >
> > > -Bryan
> > >
> > > commit 82f2e5472e2304e531c2fa85e457f4a71070044e
> > > Author: Bryan Schumaker <[email protected]>
> > > Date: Thu Oct 21 16:33:18 2010 -0400
> > >
> > > NFS: Readdir plus in v4
> > >
> > > By requsting more attributes during a readdir, we can mimic the readdir plus
> > > operation that was in NFSv3.
> > >
> > > To test, I ran the command `ls -lU --color=none` on directories with various
> > > numbers of files. Without readdir plus, I see this:
> > >
> > > n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
> > > --------+-----------+-----------+-----------+-----------+----------
> > > real | 0m00.153s | 0m00.589s | 0m05.601s | 0m56.691s | 9m59.128s
> > > user | 0m00.007s | 0m00.007s | 0m00.077s | 0m00.703s | 0m06.800s
> > > sys | 0m00.010s | 0m00.070s | 0m00.633s | 0m06.423s | 1m10.005s
> > > access | 3 | 1 | 1 | 4 | 31
> > > getattr | 2 | 1 | 1 | 1 | 1
> > > lookup | 104 | 1,003 | 10,003 | 100,003 | 1,000,003
> > > readdir | 2 | 16 | 158 | 1,575 | 15,749
> > > total | 111 | 1,021 | 10,163 | 101,583 | 1,015,784
> > >
> > > With readdir plus enabled, I see this:
> > >
> > > n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
> > > --------+-----------+-----------+-----------+-----------+----------
> > > real | 0m00.115s | 0m00.206s | 0m01.079s | 0m12.521s | 2m07.528s
> > > user | 0m00.003s | 0m00.003s | 0m00.040s | 0m00.290s | 0m03.296s
> > > sys | 0m00.007s | 0m00.020s | 0m00.120s | 0m01.357s | 0m17.556s
> > > access | 3 | 1 | 1 | 1 | 7
> > > getattr | 2 | 1 | 1 | 1 | 1
> > > lookup | 4 | 3 | 3 | 3 | 3
> > > readdir | 6 | 62 | 630 | 6,300 | 62,993
> > > total | 15 | 67 | 635 | 6,305 | 63,004
> > >
> > > Readdir plus disabled has about a 16x increase in the number of rpc calls an
> > > is 4 - 5 times slower on large directories.
> >
> > Right. Those are the numbers that convinced me...
> >
> >
>
> Lies, Damn Lies, and ......
>
>
> while these are impressive numbers they only tell half the story.
>
> If a change makes one common operation 4 times faster, and another common
> operation 10 times slower, it is a good change? or even an acceptable change?
>
> (The "10 times" is not a definite statistic - it is a guess based on
> a low-detail report)
>
> So it is obvious that there is sometimes value in using readdirplus,
> it is equally obvious that there is sometimes a cost.
>
> Switching the default from "not paying the cost when it is big" to
> "always paying the cost" is wrong.

That's what the nordirplus mount flag is for. Keeping an arbitrary limit
in the face of evidence that it is hurting is equally wrong.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-03-18 04:27:19

by NeilBrown

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Thu, 17 Mar 2011 13:44:53 -0400 "J. Bruce Fields" <[email protected]>
wrote:

> On Thu, Mar 17, 2011 at 11:55:22AM +1100, NeilBrown wrote:
> > Strangely, when I try NFSv4 I don't get what I would expect.
> >
> > "ls" on an unpatched 2.6.38 takes over 5 seconds rather than around 4.
> > With the patch it does back down to about 2. (still NFSv3 at 1.5).
> > Why would NFSv4 be slower?
> > On v3 we make 44 READDIRPLUS calls and 284 READDIR calls - total of 328
> > READDIRPLUS have about 30 names, READDIR have about 100
> > On v4 we make 633 READDIR calls - nearly double.
> > Early packed contain about 19 name, later ones about 70
> >
> > Is nfsd (2.6.32) just not packing enough answers in the reply?
> > Client asks for a dircount of 16384 and a maxcount of 32768, and gets
> > packets which are about 4K long - I guess that is PAGE_SIZE ??
>
> >From nfsd4_encode_readdir():
>
> maxcount = PAGE_SIZE;
> if (maxcount > readdir->rd_maxcount)
> maxcount = readdir->rd_maxcount;
>
> Unfortunately, I don't think the xdr encoding is equipped to deal with
> page boundaries. It should be.

Bah humbug.
NFSv3 gets it right - it just encodes into the next page and then copies back.

Sounds like a simple afternoon's project .... now if only we could find
someone with a simple afternoon :-)
Getting a realistic upper limit on the size of the reply (which is more
variable for v4 than for v3) would be the only tricky bit..
Though nfsd4_encode_fattr looks fairly idempotent, so you could just
try to encode and if it doesn't fit:
allocate next page
encode into there
copy some into previous page
copy rest down.

NeilBrown

2011-03-16 22:40:28

by NeilBrown

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Wed, 16 Mar 2011 17:42:35 -0400 Trond Myklebust
<[email protected]> wrote:


> > So it is obvious that there is sometimes value in using readdirplus,
> > it is equally obvious that there is sometimes a cost.
> >
> > Switching the default from "not paying the cost when it is big" to
> > "always paying the cost" is wrong.
>
> That's what the nordirplus mount flag is for. Keeping an arbitrary limit
> in the face of evidence that it is hurting is equally wrong.
>

If people didn't need 'nordirplus' previously to get acceptable
performance, and do need it now, then that is a regression.

NeilBrown

2011-03-17 00:55:33

by NeilBrown

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Thu, 17 Mar 2011 08:40:38 +1100 NeilBrown <[email protected]> wrote:

> On Wed, 16 Mar 2011 08:30:20 -0400 <[email protected]> wrote:
>
> > Perhaps the use of a heuristic that enables readdirplus only after the application has shown that it is interested in the attributes for each entry in the directory? Thus, if the application does readdir()/stat()/stat()/stat()/readdir()/... then the NFS client could use readdirplus to fill the caches. If the application is just reading the directory and looking at the names, then the client could just use readdir.
>
> I think this could work very well.
> "ls -l" certainly calls 'stat' on each file after each 'getdents' call.
>
> So we could arrange that the first readdir call on a directory
> always uses the 'plus' version, and clears a "seen any getattr calls"
> flag on the directory.
>
> nfs_getattr then sets that flag on the parent
>
> subsequent readdir calls only use 'plus' if the flag was set, and
> clear the flag again.
>
>
> There might be odd issues with multiple processes reading and stating
> in the same directory, but they probably aren't very serious.
>
> I'm might give this idea a try ... but I still think the original
> switch to always use readdirplus is a regression and should be reverted.

I've been experimenting with this some more.
I've been using an other-wise unloaded NFS server (4 year old consumer
Linux box) with ordinary drives and networking etc.
Mounting with NFSv3 and default options (unless specified).

I created a directory with 30,000 small files.

echo 3 > /proc/sys/vm/drop_caches

on bother server and client before running a test.
All timing runs on client (of course).

% time ls --color=never > /dev/null

This takes about 4 seconds on 2.6.38, using READDIRPLUS.

With the patch below applied it takes about 1.5 seconds.
The first 44 requests are READDIRPLUS, which provide the 1024
entries requested by the getdents64 call. The remaining
requsts are READDIR. So on a big directory I get a factor-of-2 speed up.
On a more loaded NFS server the real numbers might be bigger(??)

% time ls -l --color=never > /dev/nulls

This takes about 25 seconds when using READDIRPLUS, either with or
without that patch the same sequence of requests are sent.
With READDIR (using -o nordirplus) it takes about 40 seconds.

Much of the 25 seconds is due to GETACL requests.


So while this only provides a 2 second speed-up for me, it is a real
speed up.

The only cost I can find is that the sequence:
ls
ls -l
becomes slower. The "ls -l" doesn't perform any READDIR as the
directory listing is in cache. So that means we need 30,000 GETATTR
calls and 30,000 GETACL calls, which all take a while.

What do people think?


Strangely, when I try NFSv4 I don't get what I would expect.

"ls" on an unpatched 2.6.38 takes over 5 seconds rather than around 4.
With the patch it does back down to about 2. (still NFSv3 at 1.5).
Why would NFSv4 be slower?
On v3 we make 44 READDIRPLUS calls and 284 READDIR calls - total of 328
READDIRPLUS have about 30 names, READDIR have about 100
On v4 we make 633 READDIR calls - nearly double.
Early packed contain about 19 name, later ones about 70

Is nfsd (2.6.32) just not packing enough answers in the reply?
Client asks for a dircount of 16384 and a maxcount of 32768, and gets
packets which are about 4K long - I guess that is PAGE_SIZE ??

"ls -l" still takes around 25 seconds - even though READDIR is asking
for and receiving all the 'plus' attributes, I see 30,000 "GETATTR" requests
for exactly the same set of attributes. Something is wrong there.

NeilBrown


From: NeilBrown <[email protected]>
Subject: Make selection of 'readdir-plus' adapt to usage patterns.

While the use of READDIRPLUS is significantly more efficient than
READDIR followed by many GETATTR calls, it is still less efficient
than just READDIR if the attributes are not required.

We can get a hint as to whether the application requires attr information
by looking at whether any ->getattr calls are made between
->readdir calls.
If there are any, then getting the attributes seems to be worth while.

This patch tracks whether there have been recent getattr calls on
children of a directory and uses that information to selectively
disable READDIRPLUS on that directory.

The first 'readdir' call is always served using READDIRPLUS.
Subsequent calls only use READDIRPLUS if there was a getattr on a child
in the mean time.

The locking of ->d_parent access needs to be reviewed.
As the bit is simply a hint, it isn't critical that it is set
on the "correct" parent if a rename is happening, but it is
critical that the 'set' doesn't set a bit in something that
isn't even an inode any more.

Signed-off-by: NeilBrown <[email protected]>

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2c3eb33..6882e14 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -804,6 +804,9 @@ static int nfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
desc->dir_cookie = &nfs_file_open_context(filp)->dir_cookie;
desc->decode = NFS_PROTO(inode)->decode_dirent;
desc->plus = NFS_USE_READDIRPLUS(inode);
+ if (filp->f_pos > 0 && !test_bit(NFS_INO_SEEN_GETATTR, &NFS_I(inode)->flags))
+ desc->plus = 0;
+ clear_bit(NFS_INO_SEEN_GETATTR, &NFS_I(inode)->flags);

nfs_block_sillyrename(dentry);
res = nfs_revalidate_mapping(inode, filp->f_mapping);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 2f8e618..4cb17df 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -505,6 +505,15 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
struct inode *inode = dentry->d_inode;
int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
int err;
+ struct dentry *p;
+ struct inode *pi;
+
+ rcu_read_lock();
+ p = dentry->d_parent;
+ pi = rcu_dereference(p)->d_inode;
+ if (pi && !test_bit(NFS_INO_SEEN_GETATTR, &NFS_I(pi)->flags))
+ set_bit(NFS_INO_SEEN_GETATTR, &NFS_I(pi)->flags);
+ rcu_read_unlock();

/* Flush out writes to the server in order to update c/mtime. */
if (S_ISREG(inode->i_mode)) {
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 6023efa..2a04ed5 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -219,6 +219,10 @@ struct nfs_inode {
#define NFS_INO_FSCACHE (5) /* inode can be cached by FS-Cache */
#define NFS_INO_FSCACHE_LOCK (6) /* FS-Cache cookie management lock */
#define NFS_INO_COMMIT (7) /* inode is committing unstable writes */
+#define NFS_INO_SEEN_GETATTR (8) /* flag to track if app is calling
+ * getattr in a directory during
+ * readdir
+ */

static inline struct nfs_inode *NFS_I(const struct inode *inode)
{

2011-03-17 17:44:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Thu, Mar 17, 2011 at 11:55:22AM +1100, NeilBrown wrote:
> Strangely, when I try NFSv4 I don't get what I would expect.
>
> "ls" on an unpatched 2.6.38 takes over 5 seconds rather than around 4.
> With the patch it does back down to about 2. (still NFSv3 at 1.5).
> Why would NFSv4 be slower?
> On v3 we make 44 READDIRPLUS calls and 284 READDIR calls - total of 328
> READDIRPLUS have about 30 names, READDIR have about 100
> On v4 we make 633 READDIR calls - nearly double.
> Early packed contain about 19 name, later ones about 70
>
> Is nfsd (2.6.32) just not packing enough answers in the reply?
> Client asks for a dircount of 16384 and a maxcount of 32768, and gets
> packets which are about 4K long - I guess that is PAGE_SIZE ??

>From nfsd4_encode_readdir():

maxcount = PAGE_SIZE;
if (maxcount > readdir->rd_maxcount)
maxcount = readdir->rd_maxcount;

Unfortunately, I don't think the xdr encoding is equipped to deal with
page boundaries. It should be.

--b.

2011-03-16 21:30:44

by NeilBrown

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Wed, 16 Mar 2011 10:20:03 -0400 Trond Myklebust
<[email protected]> wrote:

> On Wed, 2011-03-16 at 10:14 -0400, Bryan Schumaker wrote:
> > I guess I misunderstood what to publish test results for? I know I included numbers on one of the patches (commit 82f2e5472e2304e531c2fa85e457f4a71070044e, copied below)... I'll find the numbers you're asking about and post them.
> >
> > -Bryan
> >
> > commit 82f2e5472e2304e531c2fa85e457f4a71070044e
> > Author: Bryan Schumaker <[email protected]>
> > Date: Thu Oct 21 16:33:18 2010 -0400
> >
> > NFS: Readdir plus in v4
> >
> > By requsting more attributes during a readdir, we can mimic the readdir plus
> > operation that was in NFSv3.
> >
> > To test, I ran the command `ls -lU --color=none` on directories with various
> > numbers of files. Without readdir plus, I see this:
> >
> > n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
> > --------+-----------+-----------+-----------+-----------+----------
> > real | 0m00.153s | 0m00.589s | 0m05.601s | 0m56.691s | 9m59.128s
> > user | 0m00.007s | 0m00.007s | 0m00.077s | 0m00.703s | 0m06.800s
> > sys | 0m00.010s | 0m00.070s | 0m00.633s | 0m06.423s | 1m10.005s
> > access | 3 | 1 | 1 | 4 | 31
> > getattr | 2 | 1 | 1 | 1 | 1
> > lookup | 104 | 1,003 | 10,003 | 100,003 | 1,000,003
> > readdir | 2 | 16 | 158 | 1,575 | 15,749
> > total | 111 | 1,021 | 10,163 | 101,583 | 1,015,784
> >
> > With readdir plus enabled, I see this:
> >
> > n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
> > --------+-----------+-----------+-----------+-----------+----------
> > real | 0m00.115s | 0m00.206s | 0m01.079s | 0m12.521s | 2m07.528s
> > user | 0m00.003s | 0m00.003s | 0m00.040s | 0m00.290s | 0m03.296s
> > sys | 0m00.007s | 0m00.020s | 0m00.120s | 0m01.357s | 0m17.556s
> > access | 3 | 1 | 1 | 1 | 7
> > getattr | 2 | 1 | 1 | 1 | 1
> > lookup | 4 | 3 | 3 | 3 | 3
> > readdir | 6 | 62 | 630 | 6,300 | 62,993
> > total | 15 | 67 | 635 | 6,305 | 63,004
> >
> > Readdir plus disabled has about a 16x increase in the number of rpc calls an
> > is 4 - 5 times slower on large directories.
>
> Right. Those are the numbers that convinced me...
>
>

Lies, Damn Lies, and ......


while these are impressive numbers they only tell half the story.

If a change makes one common operation 4 times faster, and another common
operation 10 times slower, it is a good change? or even an acceptable change?

(The "10 times" is not a definite statistic - it is a guess based on
a low-detail report)

So it is obvious that there is sometimes value in using readdirplus,
it is equally obvious that there is sometimes a cost.

Switching the default from "not paying the cost when it is big" to
"always paying the cost" is wrong.


NeilBrown

2011-03-17 17:18:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Thu, Mar 17, 2011 at 09:40:19AM +1100, NeilBrown wrote:
> On Wed, 16 Mar 2011 17:42:35 -0400 Trond Myklebust
> <[email protected]> wrote:
>
>
> > > So it is obvious that there is sometimes value in using readdirplus,
> > > it is equally obvious that there is sometimes a cost.
> > >
> > > Switching the default from "not paying the cost when it is big" to
> > > "always paying the cost" is wrong.
> >
> > That's what the nordirplus mount flag is for. Keeping an arbitrary limit
> > in the face of evidence that it is hurting is equally wrong.
> >
>
> If people didn't need 'nordirplus' previously to get acceptable
> performance, and do need it now, then that is a regression.

Agreed.

Unfortunately, reversion at this point would also be a regression for a
different group of folks. A smaller one, since *their* problem was
fixed only more recently, but still there's probably no sensible way out
of this but forwards....

--b.

2011-03-16 13:43:28

by Chuck Lever III

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories


On Mar 16, 2011, at 12:55 AM, NeilBrown wrote:

> Hi Trond / Bryan et al.
>
> Now that openSUSE 11.4 is out I have started getting a few reports
> of regressions that can be traced to
>
> commit 0715dc632a271fc0fedf3ef4779fe28ac1e53ef4
> Author: Bryan Schumaker <[email protected]>
> Date: Fri Sep 24 18:50:01 2010 -0400
>
> NFS: remove readdir plus limit
>
> We will now use readdir plus even on directories that are very large.
>
> Signed-off-by: Bryan Schumaker <[email protected]>
> Signed-off-by: Trond Myklebust <[email protected]>
>
>
> This particularly affects users with their home directory over
> NFS, and with largish maildir mail folders.
>
> Where it used to take a smallish number of seconds for (e.g.)
> xbiff to start up and read through the various directories, it now
> takes multiple minutes.
>
> I can confirm that the slow down is due to readdirplus by mounting the
> filesystem with nordirplus.

Back in the dark ages, I discovered that this kind of slowdown was often the result of server slowness. The problem is that a simple readdir is often a sequential read from physical media. When you include attribute information, the server has to pick up the inodes, which is a series of small random reads. It could cause each readdir request to become slower by a factor of 10. This is a problem on NFS servers where the inode cache is turning over often (small home directory servers, for instance).

In addition, as more information per file is delivered by READDIRPLUS, each request can hold fewer entries, so more requests and more packets are needed to read a directory. We hold the request count down now by allowing multi-page directory reads, if the server supports it.

In any event, applications will see this slow down immediately, but it can also be a significant scalability problem for servers.

> While I can understand that there are sometime benefits in using
> readdirplus for very large directories, there are also obviously real
> costs. So I think we have to see this patch as a regression that should
> be reverted.

It would be useful to understand what it is about these workloads that is causing slow downs. Is it simply the size of the directory? Or is there a bug on the server or client that is causing the issue? Is it a problem only on certain servers or with certain configurations?

> It would quite possibly make sense to create a tunable (mount option or
> sysctl I guess) to set the max size for directories to use readdirplus,
> but I think it really should be an opt-in situation.

Giving users another knob usually results in higher support costs and confused users. ;-)

> [[ It would also be really nice if the change-log for such a significant
> change contained a little more justification.... :-( ]]

I had asked, before this series was included in upstream, for some tests to discover where the knee of the performance curve between readdir and readdirplus was. Bryan, can you publish the results of those tests? I had hoped the test results would appear in the patch description to help justify this change.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-03-16 14:20:05

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Wed, 2011-03-16 at 10:14 -0400, Bryan Schumaker wrote:
> I guess I misunderstood what to publish test results for? I know I included numbers on one of the patches (commit 82f2e5472e2304e531c2fa85e457f4a71070044e, copied below)... I'll find the numbers you're asking about and post them.
>
> -Bryan
>
> commit 82f2e5472e2304e531c2fa85e457f4a71070044e
> Author: Bryan Schumaker <[email protected]>
> Date: Thu Oct 21 16:33:18 2010 -0400
>
> NFS: Readdir plus in v4
>
> By requsting more attributes during a readdir, we can mimic the readdir plus
> operation that was in NFSv3.
>
> To test, I ran the command `ls -lU --color=none` on directories with various
> numbers of files. Without readdir plus, I see this:
>
> n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
> --------+-----------+-----------+-----------+-----------+----------
> real | 0m00.153s | 0m00.589s | 0m05.601s | 0m56.691s | 9m59.128s
> user | 0m00.007s | 0m00.007s | 0m00.077s | 0m00.703s | 0m06.800s
> sys | 0m00.010s | 0m00.070s | 0m00.633s | 0m06.423s | 1m10.005s
> access | 3 | 1 | 1 | 4 | 31
> getattr | 2 | 1 | 1 | 1 | 1
> lookup | 104 | 1,003 | 10,003 | 100,003 | 1,000,003
> readdir | 2 | 16 | 158 | 1,575 | 15,749
> total | 111 | 1,021 | 10,163 | 101,583 | 1,015,784
>
> With readdir plus enabled, I see this:
>
> n files | 100 | 1,000 | 10,000 | 100,000 | 1,000,000
> --------+-----------+-----------+-----------+-----------+----------
> real | 0m00.115s | 0m00.206s | 0m01.079s | 0m12.521s | 2m07.528s
> user | 0m00.003s | 0m00.003s | 0m00.040s | 0m00.290s | 0m03.296s
> sys | 0m00.007s | 0m00.020s | 0m00.120s | 0m01.357s | 0m17.556s
> access | 3 | 1 | 1 | 1 | 7
> getattr | 2 | 1 | 1 | 1 | 1
> lookup | 4 | 3 | 3 | 3 | 3
> readdir | 6 | 62 | 630 | 6,300 | 62,993
> total | 15 | 67 | 635 | 6,305 | 63,004
>
> Readdir plus disabled has about a 16x increase in the number of rpc calls an
> is 4 - 5 times slower on large directories.

Right. Those are the numbers that convinced me...



--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-04 20:14:54

by Anna Schumaker

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

I've done some more testing and posted my initial results here: https://wiki.linux-nfs.org/wiki/index.php/Readdir_performance_results. If anybody has suggestions for better ways to organize the data, please let me know. I'll also try to post some graphs in the next couple of days.

- Bryan

On 03/17/2011 01:18 PM, J. Bruce Fields wrote:
> On Thu, Mar 17, 2011 at 09:40:19AM +1100, NeilBrown wrote:
>> On Wed, 16 Mar 2011 17:42:35 -0400 Trond Myklebust
>> <[email protected]> wrote:
>>
>>
>>>> So it is obvious that there is sometimes value in using readdirplus,
>>>> it is equally obvious that there is sometimes a cost.
>>>>
>>>> Switching the default from "not paying the cost when it is big" to
>>>> "always paying the cost" is wrong.
>>>
>>> That's what the nordirplus mount flag is for. Keeping an arbitrary limit
>>> in the face of evidence that it is hurting is equally wrong.
>>>
>>
>> If people didn't need 'nordirplus' previously to get acceptable
>> performance, and do need it now, then that is a regression.
>
> Agreed.
>
> Unfortunately, reversion at this point would also be a regression for a
> different group of folks. A smaller one, since *their* problem was
> fixed only more recently, but still there's probably no sensible way out
> of this but forwards....
>
> --b.


2011-04-07 14:28:58

by Anna Schumaker

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On 04/05/2011 08:20 AM, NeilBrown wrote:
> On Mon, 04 Apr 2011 16:14:48 -0400 Bryan Schumaker <[email protected]>
> wrote:
>
>> I've done some more testing and posted my initial results here: https://wiki.linux-nfs.org/wiki/index.php/Readdir_performance_results. If anybody has suggestions for better ways to organize the data, please let me know. I'll also try to post some graphs in the next couple of days.
>
> I think graphs would certainly help.
> Also it might be good to be explicit about the server hardware/config as that
> can make a real performance difference.

I've aded this to the readdir performance page. Is there anything else I should put up about the server?

> No bright ideas about how to organise the graphs...
> I'd probably try just graphing the 'real' time against kernel version
> with one line for each different directory size.
>
> Then you get 16 graphs, 4 different configs (v3/v4 x rddirplus/norddirplus)
> and 4 different tests (ls -f, ls -lU, ls -U, rm -r... though I can't see how
> "ls -U" is different from "ls -f").

I've added graphs showing real time. I'll be putting up graphs showing sys time and total number of RPC calls throughout the day.

> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-04-05 12:20:30

by NeilBrown

[permalink] [raw]
Subject: Re: Use of READDIRPLUS on large directories

On Mon, 04 Apr 2011 16:14:48 -0400 Bryan Schumaker <[email protected]>
wrote:

> I've done some more testing and posted my initial results here: https://wiki.linux-nfs.org/wiki/index.php/Readdir_performance_results. If anybody has suggestions for better ways to organize the data, please let me know. I'll also try to post some graphs in the next couple of days.

I think graphs would certainly help.
Also it might be good to be explicit about the server hardware/config as that
can make a real performance difference.

No bright ideas about how to organise the graphs...
I'd probably try just graphing the 'real' time against kernel version
with one line for each different directory size.

Then you get 16 graphs, 4 different configs (v3/v4 x rddirplus/norddirplus)
and 4 different tests (ls -f, ls -lU, ls -U, rm -r... though I can't see how
"ls -U" is different from "ls -f").

NeilBrown