I was investigating some performance concerns from a customer and came across
something interesting. Whenever the directory that we're reading is changed on
the server, we start re-reading the directory from the beginning. On a LAN
connection or in a directory with a small number of entries, the impact isn't
too noticeable... but reading a directory with a large number of entries over a
WAN connection gets pretty bad.
For NFS v3, what happens is that after each on-the-wire READDIR we call
nfs_refresh_inode() and from there we get to nfs_update_inode(), where we wind
up setting NFS_INO_INVALID_DATA in the directory's cache_validity flags. Then
on a subsequent call to nfs_readdir() we call nfs_revalidate_mapping(), and
seeing that NFS_INO_INVALIDATE_DATA is set we call nfs_invalidate_mapping(),
flushing all our cached data for the directory.
So for each nfs_readdir() call, we wind up redoing all of the on-the-wire
readdir operations just to get back where we were, and then we're able to get
just one more operation's worth of entries on top of that. If the directory on
the NFS server is constantly being modified then this winds up being a lot of
extra READDIR ops. I had an idea that maybe we could call nfs_refresh_inode()
only when we've reached the end of the directory. I talked to Jeff about it and
he suggested that maybe we could only revalidate if we're at the beginning of
the directory or if nfs_attribute_cache_expired() for the dir. The attached
patches take that approach.
For example, on a test environment of two VMs, I have a directory of 100,000
entries that takes 981 READDIR operations to read if no modifications being made
to the directory at the same time. If I add a 35ms delay between the client and
the server and start a script on the server that repeatedly creates and removes
a file in the directory being listed I get the following results:
[root@localhost ~]# mount -t nfs -o nfsvers=3,nordirplus server:/export /mnt
[root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
real 29m52.594s
user 0m0.376s
sys 0m2.191s
[root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
READDIR:
49729 ops (99%) 0 retrans (0%) 0 major timeouts
avg bytes sent per op: 144 avg bytes received per op: 4196
backlog wait: 0.003620 RTT: 35.889501 total execute time: 35.925858 (milliseconds)
[root@localhost ~]#
With the patched kernel, that same test yields these results:
[root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
real 0m35.952s
user 0m0.460s
sys 0m0.100s
[root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
READDIR:
981 ops (98%) 0 retrans (0%) 0 major timeouts
avg bytes sent per op: 144 avg bytes received per op: 4194
backlog wait: 0.004077 RTT: 35.887870 total execute time: 35.926606 (milliseconds)
[root@localhost ~]#
For NFS v4, the situation is slightly different. We don't get post-op
attributes from each READDIR, so we're not calling
nfs_refresh_inode()/nfs_update_inode() after every operation and therefore not
updating read_cache_jiffies. If we don't manage to read through the whole
directory before the directory attributes from the initial GETATTR expire, then
we're going to wind up calling __nfs_revalidate_inode() from
nfs_revalidate_mapping(), and the attributes that we get from that GETATTR are
going ultimately to lead us into nfs_invalidate_mapping() anyway. If the
attached patches make sense then maybe it would be worthwhile to add a GETATTR
operation to the compound that gets sent for a READDIR so that
read_cache_jiffies stays updated.
Finally there's the question of the dentry cache. Any time we find that the
parent directory changes we're still going to wind up doing an on-the-wire
LOOKUP. I don't think there's anything that can be done about that, but at
least these patches prevent us doing the same LOOKUPs multiple times in the
course of reading through the directory.
-Scott
Scott Mayhew (2):
NFS: Make nfs_attribute_cache_expired() non-static
NFS: Make nfs_readdir revalidate less often
fs/nfs/dir.c | 5 +++--
fs/nfs/inode.c | 2 +-
include/linux/nfs_fs.h | 1 +
3 files changed, 5 insertions(+), 3 deletions(-)
--
1.7.11.7
NFS: Make nfs_attribute_cache_expired() non-static so we can call it from
nfs_readdir().
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/inode.c | 2 +-
include/linux/nfs_fs.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 1f94167..e49b16b 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -845,7 +845,7 @@ int nfs_attribute_timeout(struct inode *inode)
return !time_in_range_open(jiffies, nfsi->read_cache_jiffies, nfsi->read_cache_jiffies + nfsi->attrtimeo);
}
-static int nfs_attribute_cache_expired(struct inode *inode)
+int nfs_attribute_cache_expired(struct inode *inode)
{
if (nfs_have_delegated_attributes(inode))
return 0;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 1cc2568..086927f 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -339,6 +339,7 @@ extern int nfs_permission(struct inode *, int);
extern int nfs_open(struct inode *, struct file *);
extern int nfs_release(struct inode *, struct file *);
extern int nfs_attribute_timeout(struct inode *inode);
+extern int nfs_attribute_cache_expired(struct inode *inode);
extern int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode);
extern int __nfs_revalidate_inode(struct nfs_server *, struct inode *);
extern int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping);
--
1.7.11.7
On Wed, 05 Jun 2013, Myklebust, Trond wrote:
> On Wed, 2013-06-05 at 09:51 -0400, Scott Mayhew wrote:
> > I was investigating some performance concerns from a customer and came across
> > something interesting. Whenever the directory that we're reading is changed on
> > the server, we start re-reading the directory from the beginning. On a LAN
> > connection or in a directory with a small number of entries, the impact isn't
> > too noticeable... but reading a directory with a large number of entries over a
> > WAN connection gets pretty bad.
> >
> > For NFS v3, what happens is that after each on-the-wire READDIR we call
> > nfs_refresh_inode() and from there we get to nfs_update_inode(), where we wind
> > up setting NFS_INO_INVALID_DATA in the directory's cache_validity flags. Then
> > on a subsequent call to nfs_readdir() we call nfs_revalidate_mapping(), and
> > seeing that NFS_INO_INVALIDATE_DATA is set we call nfs_invalidate_mapping(),
> > flushing all our cached data for the directory.
> >
> > So for each nfs_readdir() call, we wind up redoing all of the on-the-wire
> > readdir operations just to get back where we were, and then we're able to get
> > just one more operation's worth of entries on top of that. If the directory on
> > the NFS server is constantly being modified then this winds up being a lot of
> > extra READDIR ops. I had an idea that maybe we could call nfs_refresh_inode()
> > only when we've reached the end of the directory. I talked to Jeff about it and
> > he suggested that maybe we could only revalidate if we're at the beginning of
> > the directory or if nfs_attribute_cache_expired() for the dir. The attached
> > patches take that approach.
> >
> > For example, on a test environment of two VMs, I have a directory of 100,000
> > entries that takes 981 READDIR operations to read if no modifications being made
> > to the directory at the same time. If I add a 35ms delay between the client and
> > the server and start a script on the server that repeatedly creates and removes
> > a file in the directory being listed I get the following results:
> >
> > [root@localhost ~]# mount -t nfs -o nfsvers=3,nordirplus server:/export /mnt
> > [root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
> >
> > real 29m52.594s
> > user 0m0.376s
> > sys 0m2.191s
> > [root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
> > READDIR:
> > 49729 ops (99%) 0 retrans (0%) 0 major timeouts
> > avg bytes sent per op: 144 avg bytes received per op: 4196
> > backlog wait: 0.003620 RTT: 35.889501 total execute time: 35.925858 (milliseconds)
> > [root@localhost ~]#
> >
> >
> > With the patched kernel, that same test yields these results:
> >
> > [root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
> >
> > real 0m35.952s
> > user 0m0.460s
> > sys 0m0.100s
> > [root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
> > READDIR:
> > 981 ops (98%) 0 retrans (0%) 0 major timeouts
> > avg bytes sent per op: 144 avg bytes received per op: 4194
> > backlog wait: 0.004077 RTT: 35.887870 total execute time: 35.926606 (milliseconds)
> > [root@localhost ~]#
> >
> >
> > For NFS v4, the situation is slightly different. We don't get post-op
> > attributes from each READDIR, so we're not calling
> > nfs_refresh_inode()/nfs_update_inode() after every operation and therefore not
> > updating read_cache_jiffies. If we don't manage to read through the whole
> > directory before the directory attributes from the initial GETATTR expire, then
> > we're going to wind up calling __nfs_revalidate_inode() from
> > nfs_revalidate_mapping(), and the attributes that we get from that GETATTR are
> > going ultimately to lead us into nfs_invalidate_mapping() anyway. If the
> > attached patches make sense then maybe it would be worthwhile to add a GETATTR
> > operation to the compound that gets sent for a READDIR so that
> > read_cache_jiffies stays updated.
> >
> > Finally there's the question of the dentry cache. Any time we find that the
> > parent directory changes we're still going to wind up doing an on-the-wire
> > LOOKUP. I don't think there's anything that can be done about that, but at
> > least these patches prevent us doing the same LOOKUPs multiple times in the
> > course of reading through the directory.
>
> Do these patches pass the 'rm -rf' torture test (i.e. creating lots of
> subdirectories containing random files, then calling 'rm -rf')?
Is there already a test that does this? If not, what would be a good
test (maybe a kernel git tree)?
>
> I'd also like to see if the above test (and others you may be using)
> play out well with different filesystems on the server side. The main
> interest is to see how the change is affected by different readdir
> cookie schemes...
>
I actually hadn't done much testing aside from the ones I mentioned
above, and the server filesystem was ext4. I was mainly trying to gauge
whether this was road already traveled before. I'd be happy to do more
testing with different backend filesystems.
-Scott
On Wed, 2013-06-05 at 09:51 -0400, Scott Mayhew wrote:
> I was investigating some performance concerns from a customer and came across
> something interesting. Whenever the directory that we're reading is changed on
> the server, we start re-reading the directory from the beginning. On a LAN
> connection or in a directory with a small number of entries, the impact isn't
> too noticeable... but reading a directory with a large number of entries over a
> WAN connection gets pretty bad.
>
> For NFS v3, what happens is that after each on-the-wire READDIR we call
> nfs_refresh_inode() and from there we get to nfs_update_inode(), where we wind
> up setting NFS_INO_INVALID_DATA in the directory's cache_validity flags. Then
> on a subsequent call to nfs_readdir() we call nfs_revalidate_mapping(), and
> seeing that NFS_INO_INVALIDATE_DATA is set we call nfs_invalidate_mapping(),
> flushing all our cached data for the directory.
>
> So for each nfs_readdir() call, we wind up redoing all of the on-the-wire
> readdir operations just to get back where we were, and then we're able to get
> just one more operation's worth of entries on top of that. If the directory on
> the NFS server is constantly being modified then this winds up being a lot of
> extra READDIR ops. I had an idea that maybe we could call nfs_refresh_inode()
> only when we've reached the end of the directory. I talked to Jeff about it and
> he suggested that maybe we could only revalidate if we're at the beginning of
> the directory or if nfs_attribute_cache_expired() for the dir. The attached
> patches take that approach.
>
> For example, on a test environment of two VMs, I have a directory of 100,000
> entries that takes 981 READDIR operations to read if no modifications being made
> to the directory at the same time. If I add a 35ms delay between the client and
> the server and start a script on the server that repeatedly creates and removes
> a file in the directory being listed I get the following results:
>
> [root@localhost ~]# mount -t nfs -o nfsvers=3,nordirplus server:/export /mnt
> [root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
>
> real 29m52.594s
> user 0m0.376s
> sys 0m2.191s
> [root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
> READDIR:
> 49729 ops (99%) 0 retrans (0%) 0 major timeouts
> avg bytes sent per op: 144 avg bytes received per op: 4196
> backlog wait: 0.003620 RTT: 35.889501 total execute time: 35.925858 (milliseconds)
> [root@localhost ~]#
>
>
> With the patched kernel, that same test yields these results:
>
> [root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
>
> real 0m35.952s
> user 0m0.460s
> sys 0m0.100s
> [root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
> READDIR:
> 981 ops (98%) 0 retrans (0%) 0 major timeouts
> avg bytes sent per op: 144 avg bytes received per op: 4194
> backlog wait: 0.004077 RTT: 35.887870 total execute time: 35.926606 (milliseconds)
> [root@localhost ~]#
>
>
> For NFS v4, the situation is slightly different. We don't get post-op
> attributes from each READDIR, so we're not calling
> nfs_refresh_inode()/nfs_update_inode() after every operation and therefore not
> updating read_cache_jiffies. If we don't manage to read through the whole
> directory before the directory attributes from the initial GETATTR expire, then
> we're going to wind up calling __nfs_revalidate_inode() from
> nfs_revalidate_mapping(), and the attributes that we get from that GETATTR are
> going ultimately to lead us into nfs_invalidate_mapping() anyway. If the
> attached patches make sense then maybe it would be worthwhile to add a GETATTR
> operation to the compound that gets sent for a READDIR so that
> read_cache_jiffies stays updated.
>
> Finally there's the question of the dentry cache. Any time we find that the
> parent directory changes we're still going to wind up doing an on-the-wire
> LOOKUP. I don't think there's anything that can be done about that, but at
> least these patches prevent us doing the same LOOKUPs multiple times in the
> course of reading through the directory.
Hi Scott,
Do these patches pass the 'rm -rf' torture test (i.e. creating lots of
subdirectories containing random files, then calling 'rm -rf')?
I'd also like to see if the above test (and others you may be using)
play out well with different filesystems on the server side. The main
interest is to see how the change is affected by different readdir
cookie schemes...
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Make nfs_readdir revalidate only when we're at the beginning of the directory or
if the cached attributes have expired.
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/dir.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f23f455..8b83fa9 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -807,7 +807,7 @@ static int nfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
nfs_readdir_descriptor_t my_desc,
*desc = &my_desc;
struct nfs_open_dir_context *dir_ctx = filp->private_data;
- int res;
+ int res = 0;
dfprintk(FILE, "NFS: readdir(%s/%s) starting at cookie %llu\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
@@ -828,7 +828,8 @@ static int nfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
desc->plus = nfs_use_readdirplus(inode, filp) ? 1 : 0;
nfs_block_sillyrename(dentry);
- res = nfs_revalidate_mapping(inode, filp->f_mapping);
+ if (filp->f_pos == 0 || nfs_attribute_cache_expired(inode))
+ res = nfs_revalidate_mapping(inode, filp->f_mapping);
if (res < 0)
goto out;
--
1.7.11.7
On Wed, 05 Jun 2013, Myklebust, Trond wrote:
> On Wed, 2013-06-05 at 09:51 -0400, Scott Mayhew wrote:
> > I was investigating some performance concerns from a customer and came across
> > something interesting. Whenever the directory that we're reading is changed on
> > the server, we start re-reading the directory from the beginning. On a LAN
> > connection or in a directory with a small number of entries, the impact isn't
> > too noticeable... but reading a directory with a large number of entries over a
> > WAN connection gets pretty bad.
> >
> > For NFS v3, what happens is that after each on-the-wire READDIR we call
> > nfs_refresh_inode() and from there we get to nfs_update_inode(), where we wind
> > up setting NFS_INO_INVALID_DATA in the directory's cache_validity flags. Then
> > on a subsequent call to nfs_readdir() we call nfs_revalidate_mapping(), and
> > seeing that NFS_INO_INVALIDATE_DATA is set we call nfs_invalidate_mapping(),
> > flushing all our cached data for the directory.
> >
> > So for each nfs_readdir() call, we wind up redoing all of the on-the-wire
> > readdir operations just to get back where we were, and then we're able to get
> > just one more operation's worth of entries on top of that. If the directory on
> > the NFS server is constantly being modified then this winds up being a lot of
> > extra READDIR ops. I had an idea that maybe we could call nfs_refresh_inode()
> > only when we've reached the end of the directory. I talked to Jeff about it and
> > he suggested that maybe we could only revalidate if we're at the beginning of
> > the directory or if nfs_attribute_cache_expired() for the dir. The attached
> > patches take that approach.
> >
> > For example, on a test environment of two VMs, I have a directory of 100,000
> > entries that takes 981 READDIR operations to read if no modifications being made
> > to the directory at the same time. If I add a 35ms delay between the client and
> > the server and start a script on the server that repeatedly creates and removes
> > a file in the directory being listed I get the following results:
> >
> > [root@localhost ~]# mount -t nfs -o nfsvers=3,nordirplus server:/export /mnt
> > [root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
> >
> > real 29m52.594s
> > user 0m0.376s
> > sys 0m2.191s
> > [root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
> > READDIR:
> > 49729 ops (99%) 0 retrans (0%) 0 major timeouts
> > avg bytes sent per op: 144 avg bytes received per op: 4196
> > backlog wait: 0.003620 RTT: 35.889501 total execute time: 35.925858 (milliseconds)
> > [root@localhost ~]#
> >
> >
> > With the patched kernel, that same test yields these results:
> >
> > [root@localhost ~]# time /bin/ls /mnt/bigdir >/dev/null
> >
> > real 0m35.952s
> > user 0m0.460s
> > sys 0m0.100s
> > [root@localhost ~]# mountstats --rpc /mnt | grep -A3 READDIR
> > READDIR:
> > 981 ops (98%) 0 retrans (0%) 0 major timeouts
> > avg bytes sent per op: 144 avg bytes received per op: 4194
> > backlog wait: 0.004077 RTT: 35.887870 total execute time: 35.926606 (milliseconds)
> > [root@localhost ~]#
> >
> >
> > For NFS v4, the situation is slightly different. We don't get post-op
> > attributes from each READDIR, so we're not calling
> > nfs_refresh_inode()/nfs_update_inode() after every operation and therefore not
> > updating read_cache_jiffies. If we don't manage to read through the whole
> > directory before the directory attributes from the initial GETATTR expire, then
> > we're going to wind up calling __nfs_revalidate_inode() from
> > nfs_revalidate_mapping(), and the attributes that we get from that GETATTR are
> > going ultimately to lead us into nfs_invalidate_mapping() anyway. If the
> > attached patches make sense then maybe it would be worthwhile to add a GETATTR
> > operation to the compound that gets sent for a READDIR so that
> > read_cache_jiffies stays updated.
> >
> > Finally there's the question of the dentry cache. Any time we find that the
> > parent directory changes we're still going to wind up doing an on-the-wire
> > LOOKUP. I don't think there's anything that can be done about that, but at
> > least these patches prevent us doing the same LOOKUPs multiple times in the
> > course of reading through the directory.
>
> Do these patches pass the 'rm -rf' torture test (i.e. creating lots of
> subdirectories containing random files, then calling 'rm -rf')?
>
> I'd also like to see if the above test (and others you may be using)
> play out well with different filesystems on the server side. The main
> interest is to see how the change is affected by different readdir
> cookie schemes...
>
Hi Trond,
I did an 'rm -rf' of a kernel git tree over NFS v3 and v4 (both with and without
-o nordirplus). I did the previously mentioned ls over WAN test. I also did
various other tests (cthon, iozone, kernel compile, fsstress, fsx-linux). All
of these were done using ext4, xfs, and btrfs as the backend filesystems.
If there's other tests you'd like me to run, please let me know.
-Scott