2020-11-10 21:48:40

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v5 00/22] Readdir enhancements

From: Trond Myklebust <[email protected]>

The following patch series performs a number of cleanups on the readdir
code.
It also adds support for 1MB readdir RPC calls on-the-wire, and modifies
the caching code to ensure that we cache the entire contents of that
1MB call (instead of discarding the data that doesn't fit into a single
page).
For filesystems that use ordered readdir cookie schemes (e.g. XFS), it
optimises searching for cookies in the client's page cache by skipping
over pages that contain cookie values that are not in the range we are
searching for.
Finally, it improves scalability when dealing with very large
directories by turning off caching when those directories are changing,
so as to avoid the need for a linear search on the client of the entire
directory when looking for the first entry pointed to by the current
file descriptor offset.

v2: Fix the handling of the NFSv3/v4 directory verifier.
v3: Optimise searching when the readdir cookies are seen to be ordered.
v4: Optimise performance for large directories that are changing.
Add in llseek dependency patches.
v5: Integrate Olga's patch for the READDIR security label handling.
Record more entries in the uncached readdir case. Bump the max
number of pages to 512, but allocate them on demand in case the
readdir RPC call returns fewer entries.

Olga Kornievskaia (1):
NFSv4.2: condition READDIR's mask for security label based on LSM
state

Trond Myklebust (21):
NFS: Remove unnecessary inode locking in nfs_llseek_dir()
NFS: Remove unnecessary inode lock in nfs_fsync_dir()
NFS: Ensure contents of struct nfs_open_dir_context are consistent
NFS: Clean up readdir struct nfs_cache_array
NFS: Clean up nfs_readdir_page_filler()
NFS: Clean up directory array handling
NFS: Don't discard readdir results
NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
NFS: Simplify struct nfs_cache_array_entry
NFS: Support larger readdir buffers
NFS: More readdir cleanups
NFS: nfs_do_filldir() does not return a value
NFS: Reduce readdir stack usage
NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
NFS: Allow the NFS generic code to pass in a verifier to readdir
NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
NFS: Improve handling of directory verifiers
NFS: Optimisations for monotonically increasing readdir cookies
NFS: Reduce number of RPC calls when doing uncached readdir
NFS: Do uncached readdir when we're seeking a cookie in an empty page
cache

fs/nfs/client.c | 4 +-
fs/nfs/dir.c | 734 +++++++++++++++++++++++++---------------
fs/nfs/inode.c | 7 -
fs/nfs/internal.h | 6 -
fs/nfs/nfs3proc.c | 35 +-
fs/nfs/nfs4proc.c | 48 +--
fs/nfs/proc.c | 18 +-
include/linux/nfs_fs.h | 9 +-
include/linux/nfs_xdr.h | 17 +-
9 files changed, 541 insertions(+), 337 deletions(-)

--
2.28.0


2020-11-11 22:16:42

by David Wysochanski

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Tue, Nov 10, 2020 at 4:48 PM <[email protected]> wrote:
>
> From: Trond Myklebust <[email protected]>
>
> The following patch series performs a number of cleanups on the readdir
> code.
> It also adds support for 1MB readdir RPC calls on-the-wire, and modifies
> the caching code to ensure that we cache the entire contents of that
> 1MB call (instead of discarding the data that doesn't fit into a single
> page).
> For filesystems that use ordered readdir cookie schemes (e.g. XFS), it
> optimises searching for cookies in the client's page cache by skipping
> over pages that contain cookie values that are not in the range we are
> searching for.
> Finally, it improves scalability when dealing with very large
> directories by turning off caching when those directories are changing,
> so as to avoid the need for a linear search on the client of the entire
> directory when looking for the first entry pointed to by the current
> file descriptor offset.
>
> v2: Fix the handling of the NFSv3/v4 directory verifier.
> v3: Optimise searching when the readdir cookies are seen to be ordered.
> v4: Optimise performance for large directories that are changing.
> Add in llseek dependency patches.
> v5: Integrate Olga's patch for the READDIR security label handling.
> Record more entries in the uncached readdir case. Bump the max
> number of pages to 512, but allocate them on demand in case the
> readdir RPC call returns fewer entries.
>
> Olga Kornievskaia (1):
> NFSv4.2: condition READDIR's mask for security label based on LSM
> state
>
> Trond Myklebust (21):
> NFS: Remove unnecessary inode locking in nfs_llseek_dir()
> NFS: Remove unnecessary inode lock in nfs_fsync_dir()
> NFS: Ensure contents of struct nfs_open_dir_context are consistent
> NFS: Clean up readdir struct nfs_cache_array
> NFS: Clean up nfs_readdir_page_filler()
> NFS: Clean up directory array handling
> NFS: Don't discard readdir results
> NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
> NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
> NFS: Simplify struct nfs_cache_array_entry
> NFS: Support larger readdir buffers
> NFS: More readdir cleanups
> NFS: nfs_do_filldir() does not return a value
> NFS: Reduce readdir stack usage
> NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
> NFS: Allow the NFS generic code to pass in a verifier to readdir
> NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
> NFS: Improve handling of directory verifiers
> NFS: Optimisations for monotonically increasing readdir cookies
> NFS: Reduce number of RPC calls when doing uncached readdir
> NFS: Do uncached readdir when we're seeking a cookie in an empty page
> cache
>
> fs/nfs/client.c | 4 +-
> fs/nfs/dir.c | 734 +++++++++++++++++++++++++---------------
> fs/nfs/inode.c | 7 -
> fs/nfs/internal.h | 6 -
> fs/nfs/nfs3proc.c | 35 +-
> fs/nfs/nfs4proc.c | 48 +--
> fs/nfs/proc.c | 18 +-
> include/linux/nfs_fs.h | 9 +-
> include/linux/nfs_xdr.h | 17 +-
> 9 files changed, 541 insertions(+), 337 deletions(-)
>
> --
> 2.28.0
>

I wrote a test script and ran this patchset against 5.10-rc2 in a
variety of scenarios listing 1 million files and various directory
modification scenarios. The test checked ops, runtime and cookie
resets (via tcpdump).

All runs performed very well on all scenarios I threw at it on all NFS
versions (I tested 3, 4.0, 4.1, and 4.2) with fairly dramatic
improvements vs the same runs on 5.10-rc2. One scenario I found
where a single 'ls -l' could take a long time (about 5 minutes, but
not unbounded time) was when the directory was being modified with
both file adds and removed, but this seemed due to other ops unrelated
to readdir performance. A second scenario that this patchset does
not fix is the scenario where the directory is large enough to exceed
the acdirmax, is being modified (adds and removes), and _multiple_
processes are listing it. In this scenario a lister can still have
unbounded time, but the existing readdir code also has this problem
and fixing that probably requires another patch / approach (such as
reworking the structure of the cache or an approach such as the flag
on per-process dir context).

I think these patches make NFS readdir dramatically better, and they
fix a lot of underlying issues with the existing code, like the cookie
resets when pagecache is dropped, single page rx data problem, etc.
Thank you for doing these patches.

Tested-by: Dave Wysochanski

2020-11-12 11:48:46

by David Wysochanski

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Wed, Nov 11, 2020 at 5:15 PM David Wysochanski <[email protected]> wrote:
>
> On Tue, Nov 10, 2020 at 4:48 PM <[email protected]> wrote:
> >
> > From: Trond Myklebust <[email protected]>
> >
> > The following patch series performs a number of cleanups on the readdir
> > code.
> > It also adds support for 1MB readdir RPC calls on-the-wire, and modifies
> > the caching code to ensure that we cache the entire contents of that
> > 1MB call (instead of discarding the data that doesn't fit into a single
> > page).
> > For filesystems that use ordered readdir cookie schemes (e.g. XFS), it
> > optimises searching for cookies in the client's page cache by skipping
> > over pages that contain cookie values that are not in the range we are
> > searching for.
> > Finally, it improves scalability when dealing with very large
> > directories by turning off caching when those directories are changing,
> > so as to avoid the need for a linear search on the client of the entire
> > directory when looking for the first entry pointed to by the current
> > file descriptor offset.
> >
> > v2: Fix the handling of the NFSv3/v4 directory verifier.
> > v3: Optimise searching when the readdir cookies are seen to be ordered.
> > v4: Optimise performance for large directories that are changing.
> > Add in llseek dependency patches.
> > v5: Integrate Olga's patch for the READDIR security label handling.
> > Record more entries in the uncached readdir case. Bump the max
> > number of pages to 512, but allocate them on demand in case the
> > readdir RPC call returns fewer entries.
> >
> > Olga Kornievskaia (1):
> > NFSv4.2: condition READDIR's mask for security label based on LSM
> > state
> >
> > Trond Myklebust (21):
> > NFS: Remove unnecessary inode locking in nfs_llseek_dir()
> > NFS: Remove unnecessary inode lock in nfs_fsync_dir()
> > NFS: Ensure contents of struct nfs_open_dir_context are consistent
> > NFS: Clean up readdir struct nfs_cache_array
> > NFS: Clean up nfs_readdir_page_filler()
> > NFS: Clean up directory array handling
> > NFS: Don't discard readdir results
> > NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
> > NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
> > NFS: Simplify struct nfs_cache_array_entry
> > NFS: Support larger readdir buffers
> > NFS: More readdir cleanups
> > NFS: nfs_do_filldir() does not return a value
> > NFS: Reduce readdir stack usage
> > NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
> > NFS: Allow the NFS generic code to pass in a verifier to readdir
> > NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
> > NFS: Improve handling of directory verifiers
> > NFS: Optimisations for monotonically increasing readdir cookies
> > NFS: Reduce number of RPC calls when doing uncached readdir
> > NFS: Do uncached readdir when we're seeking a cookie in an empty page
> > cache
> >
> > fs/nfs/client.c | 4 +-
> > fs/nfs/dir.c | 734 +++++++++++++++++++++++++---------------
> > fs/nfs/inode.c | 7 -
> > fs/nfs/internal.h | 6 -
> > fs/nfs/nfs3proc.c | 35 +-
> > fs/nfs/nfs4proc.c | 48 +--
> > fs/nfs/proc.c | 18 +-
> > include/linux/nfs_fs.h | 9 +-
> > include/linux/nfs_xdr.h | 17 +-
> > 9 files changed, 541 insertions(+), 337 deletions(-)
> >
> > --
> > 2.28.0
> >
>
> I wrote a test script and ran this patchset against 5.10-rc2 in a
> variety of scenarios listing 1 million files and various directory
> modification scenarios. The test checked ops, runtime and cookie
> resets (via tcpdump).
>
> All runs performed very well on all scenarios I threw at it on all NFS
> versions (I tested 3, 4.0, 4.1, and 4.2) with fairly dramatic
> improvements vs the same runs on 5.10-rc2. One scenario I found
> where a single 'ls -l' could take a long time (about 5 minutes, but
> not unbounded time) was when the directory was being modified with
> both file adds and removed, but this seemed due to other ops unrelated
> to readdir performance. A second scenario that this patchset does
> not fix is the scenario where the directory is large enough to exceed
> the acdirmax, is being modified (adds and removes), and _multiple_
> processes are listing it. In this scenario a lister can still have
> unbounded time, but the existing readdir code also has this problem
> and fixing that probably requires another patch / approach (such as
> reworking the structure of the cache or an approach such as the flag
> on per-process dir context).
>

I think I am mistaken in my previous email that said multiple listers
will not make forward progress. After more testing I could not
reproduce any indefinite wait but it was many minutes, and I think
there is proof in the current code (patch 22) that an indefinite wait
should not occur with any process (multiple processes does not change
this). Note that I did not trace this, but below is an attempt at a
more detailed explanation from the code. The summary of the argument
is that periodically nfs_readdir_dont_search_cache() will be true for
one iteration of nfs_readdir(), and this ensures forward progress,
even if it may be slow.

Some notation for the below explanation.
Tu: Time to list unmodified (idle) directory
Ax: acdirmax
Assume Tu > acdirmax (time to list idle directory is larger than
acdirmax) and i_size_read(dir) > NFS_SERVER(dir)->dtsize
PL1: pid1 that is listing the directory
DC1, DC2 and DC3: values of nfs_open_dir_context.dir_cookie for PL1
PL1_dir_cookie: the current value of PL1's nfs_open_dir_context.dir_cookie

Then consider the following timeline, with "Tn" various points on a
timeline of PL1 listing a very large directory.

T0: PL1 starts listing directory with repeated calls to nfs_readdir()
T1: acdirmax is exceeded, and we drop the pagecache (mapping->nrpages
== 0); at this point, PL1_dir_cookie = DC1 != 0
T2: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
true (due to mapping->nrpages == 0), so enters uncached_readdir();
thus, PL1_dir_cookie makes forward progress so PL1_dir_cookie is ahead
of DC1 and the exit of nfs_readdir()
T3: PL1 calls nfs_readdir() and nfs_readdir_dont_search_cache() is
false, so it must restart filling the pagecache from cookie == 0 (we
send READDIRs over the wire); at this point PL1_dir_cookie forward
progress is stalled, and PL1's calls to nfs_readdir() will have to
fetch entries and fill pages in the pagecache that does not pertain to
its listing (they've already been returned to userspace in a previous
nfs_readdir call)
T4: acdirmax is exceeded, and we drop the pagecache again, setting
mapping->nrpages == 0; at this point though, PL1_dir_cookie = DC2,
where DC2 is ahead of DC1 (we made forward progress last time)
T5: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
true, so enters uncached_readdir(); thus, PL1_dir_cookie makes forward
progress again for this one call to nfs_readdir(), and PL1_dir_cookie
= DC3 is ahead of DC2

Thus, it seems PL1 will eventually complete, even if it is delayed a
bit, and even if it has to re-fill pages in the page cache that is
"extra work" that doesn't help PL1. Forward progress is guaranteed
due to the periodic calls to nfs_readdir() when
nfs_readdir_dont_search_cache() returns true due to mapping->nrpages
== 0 portion of the condition inside nfs_readdir_dont_search_cache()
+static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
+{
+ struct address_space *mapping = desc->file->f_mapping;
+ struct inode *dir = file_inode(desc->file);
+ unsigned int dtsize = NFS_SERVER(dir)->dtsize;
+ loff_t size = i_size_read(dir);
+
+ /*
+ * Default to uncached readdir if the page cache is empty, and
+ * we're looking for a non-zero cookie in a large directory.
+ */
+ return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
+}




> I think these patches make NFS readdir dramatically better, and they
> fix a lot of underlying issues with the existing code, like the cookie
> resets when pagecache is dropped, single page rx data problem, etc.
> Thank you for doing these patches.
>
> Tested-by: Dave Wysochanski

2020-11-12 15:35:27

by guy keren

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

just a general question: since the cache seems to cause many problems
when dealing with very large directories, and since all solutions
proposed until now don't seem to fully solve those problems, won't an
approach such as "if the directory entries count exceeded X - stop
using the cache completely" - where X is proportional to the size of
the directory entries cache size limit - make the code simpler, and
less prone to bugs of this sort?

i *think* we can understand that for a directory with millions of
files, we'll not have efficient caching on the client side, while
limiting ourselves to reasonable RAM consumption?

--guy keren
Vast data

On Thu, Nov 12, 2020 at 1:46 PM David Wysochanski <[email protected]> wrote:
>
> On Wed, Nov 11, 2020 at 5:15 PM David Wysochanski <[email protected]> wrote:
> >
> > On Tue, Nov 10, 2020 at 4:48 PM <[email protected]> wrote:
> > >
> > > From: Trond Myklebust <[email protected]>
> > >
> > > The following patch series performs a number of cleanups on the readdir
> > > code.
> > > It also adds support for 1MB readdir RPC calls on-the-wire, and modifies
> > > the caching code to ensure that we cache the entire contents of that
> > > 1MB call (instead of discarding the data that doesn't fit into a single
> > > page).
> > > For filesystems that use ordered readdir cookie schemes (e.g. XFS), it
> > > optimises searching for cookies in the client's page cache by skipping
> > > over pages that contain cookie values that are not in the range we are
> > > searching for.
> > > Finally, it improves scalability when dealing with very large
> > > directories by turning off caching when those directories are changing,
> > > so as to avoid the need for a linear search on the client of the entire
> > > directory when looking for the first entry pointed to by the current
> > > file descriptor offset.
> > >
> > > v2: Fix the handling of the NFSv3/v4 directory verifier.
> > > v3: Optimise searching when the readdir cookies are seen to be ordered.
> > > v4: Optimise performance for large directories that are changing.
> > > Add in llseek dependency patches.
> > > v5: Integrate Olga's patch for the READDIR security label handling.
> > > Record more entries in the uncached readdir case. Bump the max
> > > number of pages to 512, but allocate them on demand in case the
> > > readdir RPC call returns fewer entries.
> > >
> > > Olga Kornievskaia (1):
> > > NFSv4.2: condition READDIR's mask for security label based on LSM
> > > state
> > >
> > > Trond Myklebust (21):
> > > NFS: Remove unnecessary inode locking in nfs_llseek_dir()
> > > NFS: Remove unnecessary inode lock in nfs_fsync_dir()
> > > NFS: Ensure contents of struct nfs_open_dir_context are consistent
> > > NFS: Clean up readdir struct nfs_cache_array
> > > NFS: Clean up nfs_readdir_page_filler()
> > > NFS: Clean up directory array handling
> > > NFS: Don't discard readdir results
> > > NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
> > > NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
> > > NFS: Simplify struct nfs_cache_array_entry
> > > NFS: Support larger readdir buffers
> > > NFS: More readdir cleanups
> > > NFS: nfs_do_filldir() does not return a value
> > > NFS: Reduce readdir stack usage
> > > NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
> > > NFS: Allow the NFS generic code to pass in a verifier to readdir
> > > NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
> > > NFS: Improve handling of directory verifiers
> > > NFS: Optimisations for monotonically increasing readdir cookies
> > > NFS: Reduce number of RPC calls when doing uncached readdir
> > > NFS: Do uncached readdir when we're seeking a cookie in an empty page
> > > cache
> > >
> > > fs/nfs/client.c | 4 +-
> > > fs/nfs/dir.c | 734 +++++++++++++++++++++++++---------------
> > > fs/nfs/inode.c | 7 -
> > > fs/nfs/internal.h | 6 -
> > > fs/nfs/nfs3proc.c | 35 +-
> > > fs/nfs/nfs4proc.c | 48 +--
> > > fs/nfs/proc.c | 18 +-
> > > include/linux/nfs_fs.h | 9 +-
> > > include/linux/nfs_xdr.h | 17 +-
> > > 9 files changed, 541 insertions(+), 337 deletions(-)
> > >
> > > --
> > > 2.28.0
> > >
> >
> > I wrote a test script and ran this patchset against 5.10-rc2 in a
> > variety of scenarios listing 1 million files and various directory
> > modification scenarios. The test checked ops, runtime and cookie
> > resets (via tcpdump).
> >
> > All runs performed very well on all scenarios I threw at it on all NFS
> > versions (I tested 3, 4.0, 4.1, and 4.2) with fairly dramatic
> > improvements vs the same runs on 5.10-rc2. One scenario I found
> > where a single 'ls -l' could take a long time (about 5 minutes, but
> > not unbounded time) was when the directory was being modified with
> > both file adds and removed, but this seemed due to other ops unrelated
> > to readdir performance. A second scenario that this patchset does
> > not fix is the scenario where the directory is large enough to exceed
> > the acdirmax, is being modified (adds and removes), and _multiple_
> > processes are listing it. In this scenario a lister can still have
> > unbounded time, but the existing readdir code also has this problem
> > and fixing that probably requires another patch / approach (such as
> > reworking the structure of the cache or an approach such as the flag
> > on per-process dir context).
> >
>
> I think I am mistaken in my previous email that said multiple listers
> will not make forward progress. After more testing I could not
> reproduce any indefinite wait but it was many minutes, and I think
> there is proof in the current code (patch 22) that an indefinite wait
> should not occur with any process (multiple processes does not change
> this). Note that I did not trace this, but below is an attempt at a
> more detailed explanation from the code. The summary of the argument
> is that periodically nfs_readdir_dont_search_cache() will be true for
> one iteration of nfs_readdir(), and this ensures forward progress,
> even if it may be slow.
>
> Some notation for the below explanation.
> Tu: Time to list unmodified (idle) directory
> Ax: acdirmax
> Assume Tu > acdirmax (time to list idle directory is larger than
> acdirmax) and i_size_read(dir) > NFS_SERVER(dir)->dtsize
> PL1: pid1 that is listing the directory
> DC1, DC2 and DC3: values of nfs_open_dir_context.dir_cookie for PL1
> PL1_dir_cookie: the current value of PL1's nfs_open_dir_context.dir_cookie
>
> Then consider the following timeline, with "Tn" various points on a
> timeline of PL1 listing a very large directory.
>
> T0: PL1 starts listing directory with repeated calls to nfs_readdir()
> T1: acdirmax is exceeded, and we drop the pagecache (mapping->nrpages
> == 0); at this point, PL1_dir_cookie = DC1 != 0
> T2: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
> true (due to mapping->nrpages == 0), so enters uncached_readdir();
> thus, PL1_dir_cookie makes forward progress so PL1_dir_cookie is ahead
> of DC1 and the exit of nfs_readdir()
> T3: PL1 calls nfs_readdir() and nfs_readdir_dont_search_cache() is
> false, so it must restart filling the pagecache from cookie == 0 (we
> send READDIRs over the wire); at this point PL1_dir_cookie forward
> progress is stalled, and PL1's calls to nfs_readdir() will have to
> fetch entries and fill pages in the pagecache that does not pertain to
> its listing (they've already been returned to userspace in a previous
> nfs_readdir call)
> T4: acdirmax is exceeded, and we drop the pagecache again, setting
> mapping->nrpages == 0; at this point though, PL1_dir_cookie = DC2,
> where DC2 is ahead of DC1 (we made forward progress last time)
> T5: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
> true, so enters uncached_readdir(); thus, PL1_dir_cookie makes forward
> progress again for this one call to nfs_readdir(), and PL1_dir_cookie
> = DC3 is ahead of DC2
>
> Thus, it seems PL1 will eventually complete, even if it is delayed a
> bit, and even if it has to re-fill pages in the page cache that is
> "extra work" that doesn't help PL1. Forward progress is guaranteed
> due to the periodic calls to nfs_readdir() when
> nfs_readdir_dont_search_cache() returns true due to mapping->nrpages
> == 0 portion of the condition inside nfs_readdir_dont_search_cache()
> +static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
> +{
> + struct address_space *mapping = desc->file->f_mapping;
> + struct inode *dir = file_inode(desc->file);
> + unsigned int dtsize = NFS_SERVER(dir)->dtsize;
> + loff_t size = i_size_read(dir);
> +
> + /*
> + * Default to uncached readdir if the page cache is empty, and
> + * we're looking for a non-zero cookie in a large directory.
> + */
> + return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
> +}
>
>
>
>
> > I think these patches make NFS readdir dramatically better, and they
> > fix a lot of underlying issues with the existing code, like the cookie
> > resets when pagecache is dropped, single page rx data problem, etc.
> > Thank you for doing these patches.
> >
> > Tested-by: Dave Wysochanski
>

2020-11-12 16:51:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Thu, 2020-11-12 at 06:41 -0500, David Wysochanski wrote:
> On Wed, Nov 11, 2020 at 5:15 PM David Wysochanski <
> [email protected]> wrote:
> >
> > On Tue, Nov 10, 2020 at 4:48 PM <[email protected]> wrote:
> > >
> > > From: Trond Myklebust <[email protected]>
> > >
> > > The following patch series performs a number of cleanups on the
> > > readdir
> > > code.
> > > It also adds support for 1MB readdir RPC calls on-the-wire, and
> > > modifies
> > > the caching code to ensure that we cache the entire contents of
> > > that
> > > 1MB call (instead of discarding the data that doesn't fit into a
> > > single
> > > page).
> > > For filesystems that use ordered readdir cookie schemes (e.g.
> > > XFS), it
> > > optimises searching for cookies in the client's page cache by
> > > skipping
> > > over pages that contain cookie values that are not in the range
> > > we are
> > > searching for.
> > > Finally, it improves scalability when dealing with very large
> > > directories by turning off caching when those directories are
> > > changing,
> > > so as to avoid the need for a linear search on the client of the
> > > entire
> > > directory when looking for the first entry pointed to by the
> > > current
> > > file descriptor offset.
> > >
> > > v2: Fix the handling of the NFSv3/v4 directory verifier.
> > > v3: Optimise searching when the readdir cookies are seen to be
> > > ordered.
> > > v4: Optimise performance for large directories that are changing.
> > >     Add in llseek dependency patches.
> > > v5: Integrate Olga's patch for the READDIR security label
> > > handling.
> > >     Record more entries in the uncached readdir case. Bump the
> > > max
> > >     number of pages to 512, but allocate them on demand in case
> > > the
> > >     readdir RPC call returns fewer entries.
> > >
> > > Olga Kornievskaia (1):
> > >   NFSv4.2: condition READDIR's mask for security label based on
> > > LSM
> > >     state
> > >
> > > Trond Myklebust (21):
> > >   NFS: Remove unnecessary inode locking in nfs_llseek_dir()
> > >   NFS: Remove unnecessary inode lock in nfs_fsync_dir()
> > >   NFS: Ensure contents of struct nfs_open_dir_context are
> > > consistent
> > >   NFS: Clean up readdir struct nfs_cache_array
> > >   NFS: Clean up nfs_readdir_page_filler()
> > >   NFS: Clean up directory array handling
> > >   NFS: Don't discard readdir results
> > >   NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
> > >   NFS: Replace kmap() with kmap_atomic() in
> > > nfs_readdir_search_array()
> > >   NFS: Simplify struct nfs_cache_array_entry
> > >   NFS: Support larger readdir buffers
> > >   NFS: More readdir cleanups
> > >   NFS: nfs_do_filldir() does not return a value
> > >   NFS: Reduce readdir stack usage
> > >   NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
> > >   NFS: Allow the NFS generic code to pass in a verifier to
> > > readdir
> > >   NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir
> > > calls
> > >   NFS: Improve handling of directory verifiers
> > >   NFS: Optimisations for monotonically increasing readdir cookies
> > >   NFS: Reduce number of RPC calls when doing uncached readdir
> > >   NFS: Do uncached readdir when we're seeking a cookie in an
> > > empty page
> > >     cache
> > >
> > >  fs/nfs/client.c         |   4 +-
> > >  fs/nfs/dir.c            | 734 +++++++++++++++++++++++++---------
> > > ------
> > >  fs/nfs/inode.c          |   7 -
> > >  fs/nfs/internal.h       |   6 -
> > >  fs/nfs/nfs3proc.c       |  35 +-
> > >  fs/nfs/nfs4proc.c       |  48 +--
> > >  fs/nfs/proc.c           |  18 +-
> > >  include/linux/nfs_fs.h  |   9 +-
> > >  include/linux/nfs_xdr.h |  17 +-
> > >  9 files changed, 541 insertions(+), 337 deletions(-)
> > >
> > > --
> > > 2.28.0
> > >
> >
> > I wrote a test script and ran this patchset against 5.10-rc2 in a
> > variety of scenarios listing 1 million files and various directory
> > modification scenarios. The test checked ops, runtime and cookie
> > resets (via tcpdump).
> >
> > All runs performed very well on all scenarios I threw at it on all
> > NFS
> > versions (I tested 3, 4.0, 4.1, and 4.2) with fairly dramatic
> > improvements vs the same runs on 5.10-rc2.   One scenario I found
> > where a single 'ls -l' could take a long time (about 5 minutes, but
> > not unbounded time) was when the directory was being modified with
> > both file adds and removed, but this seemed due to other ops
> > unrelated
> > to readdir performance.   A second scenario that this patchset does
> > not fix is the scenario where the directory is large enough to
> > exceed
> > the acdirmax, is being modified (adds and removes), and _multiple_
> > processes are listing it.  In this scenario a lister can still have
> > unbounded time, but the existing readdir code also has this problem
> > and fixing that probably requires another patch / approach (such as
> > reworking the structure of the cache or an approach such as the
> > flag
> > on per-process dir context).
> >
>
> I think I am mistaken in my previous email that said multiple listers
> will not make forward progress.  After more testing I could not
> reproduce any indefinite wait but it was many minutes, and I think
> there is proof in the current code (patch 22) that an indefinite wait
> should not occur with any process (multiple processes does not change
> this).  Note that I did not trace this, but below is an attempt at a
> more detailed explanation from the code.  The summary of the argument
> is that periodically nfs_readdir_dont_search_cache() will be true for
> one iteration of nfs_readdir(), and this ensures forward progress,
> even if it may be slow.
>
> Some notation for the below explanation.
> Tu: Time to list unmodified (idle) directory
> Ax: acdirmax
> Assume Tu > acdirmax (time to list idle directory is larger than
> acdirmax) and i_size_read(dir) > NFS_SERVER(dir)->dtsize
> PL1: pid1 that is listing the directory
> DC1, DC2 and DC3: values of nfs_open_dir_context.dir_cookie for PL1
> PL1_dir_cookie: the current value of PL1's
> nfs_open_dir_context.dir_cookie
>
> Then consider the following timeline, with "Tn" various points on a
> timeline of PL1 listing a very large directory.
>
> T0: PL1 starts listing directory with repeated calls to nfs_readdir()
> T1: acdirmax is exceeded, and we drop the pagecache (mapping->nrpages
> == 0); at this point, PL1_dir_cookie = DC1 != 0
> T2: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
> true (due to mapping->nrpages == 0), so enters uncached_readdir();
> thus, PL1_dir_cookie makes forward progress so PL1_dir_cookie is
> ahead
> of DC1 and the exit of nfs_readdir()
> T3: PL1 calls nfs_readdir() and nfs_readdir_dont_search_cache() is
> false, so it must restart filling the pagecache from cookie == 0 (we
> send READDIRs over the wire); at this point PL1_dir_cookie forward
> progress is stalled, and PL1's calls to nfs_readdir() will have to
> fetch entries and fill pages in the pagecache that does not pertain
> to
> its listing (they've already been returned to userspace in a previous
> nfs_readdir call)
> T4: acdirmax is exceeded, and we drop the pagecache again, setting
> mapping->nrpages == 0; at this point though, PL1_dir_cookie = DC2,
> where DC2 is ahead of DC1 (we made forward progress last time)
> T5: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
> true, so enters uncached_readdir(); thus, PL1_dir_cookie makes
> forward
> progress again for this one call to nfs_readdir(), and PL1_dir_cookie
> = DC3 is ahead of DC2
>
> Thus, it seems PL1 will eventually complete, even if it is delayed a
> bit, and even if it has to re-fill pages in the page cache that is
> "extra work" that doesn't help PL1.  Forward progress is guaranteed
> due to the periodic calls to nfs_readdir() when
> nfs_readdir_dont_search_cache() returns true due to mapping->nrpages
> == 0 portion of the condition inside nfs_readdir_dont_search_cache()
> +static bool nfs_readdir_dont_search_cache(struct
> nfs_readdir_descriptor *desc)
> +{
> +       struct address_space *mapping = desc->file->f_mapping;
> +       struct inode *dir = file_inode(desc->file);
> +       unsigned int dtsize = NFS_SERVER(dir)->dtsize;
> +       loff_t size = i_size_read(dir);
> +
> +       /*
> +        * Default to uncached readdir if the page cache is empty,
> and
> +        * we're looking for a non-zero cookie in a large directory.
> +        */
> +       return desc->dir_cookie != 0 && mapping->nrpages == 0 && size
> > dtsize;
> +}
>

Cool.

I was going to ask you if perhaps reverting Scott's commit 07b5ce8ef2d8
("NFS: Make nfs_readdir revalidate less often") might help here?
My thinking is that will trigger more cache invalidations when the
directory is changing underneath us, and will now trigger uncached
readdir in those situations.

> > I think these patches make NFS readdir dramatically better, and
> > they
> > fix a lot of underlying issues with the existing code, like the
> > cookie
> > resets when pagecache is dropped, single page rx data problem, etc.
> > Thank you for doing these patches.
> >
> > Tested-by: Dave Wysochanski


Thanks very much for the discussion and testing!

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-11-12 16:55:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Thu, 2020-11-12 at 17:34 +0200, Guy Keren wrote:
> just a general question: since the cache seems to cause many problems
> when dealing with very large directories, and since all solutions
> proposed until now don't seem to fully solve those problems, won't an
> approach such as "if the directory entries count exceeded X - stop
> using the cache completely" - where X is proportional to the size of
> the directory entries cache size limit - make the code simpler, and
> less prone to bugs of this sort?
>
> i *think* we can understand that for a directory with millions of
> files, we'll not have efficient caching on the client side, while
> limiting ourselves to reasonable RAM consumption?
>

Again, I disagree.

If you have a mostly-read directory with millions of files (e.g. data
pool) and lots of processes searching, then caching is both useful and
appropriate.

>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-11-12 18:29:19

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
>
> I was going to ask you if perhaps reverting Scott's commit
> 07b5ce8ef2d8
> ("NFS: Make nfs_readdir revalidate less often") might help here?
> My thinking is that will trigger more cache invalidations when the
> directory is changing underneath us, and will now trigger uncached
> readdir in those situations.
>
> >

IOW, the suggestion would be to apply something like the following on
top of the existing readdir patchset:

---
fs/nfs/dir.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 3f70697729d8..384a4663f742 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -956,10 +956,10 @@ static int readdir_search_pagecache(struct
nfs_readdir_descriptor *desc)
{
int res;

- if (nfs_readdir_dont_search_cache(desc))
- return -EBADCOOKIE;
-
do {
+ if (nfs_readdir_dont_search_cache(desc))
+ return -EBADCOOKIE;
+
if (desc->page_index == 0) {
desc->current_index = 0;
desc->prev_index = 0;
@@ -1082,11 +1082,9 @@ static int nfs_readdir(struct file *file, struct
dir_context *ctx)
* to either find the entry with the appropriate number or
* revalidate the cookie.
*/
- if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
- res = nfs_revalidate_mapping(inode, file->f_mapping);
- if (res < 0)
- goto out;
- }
+ res = nfs_revalidate_mapping(inode, file->f_mapping);
+ if (res < 0)
+ goto out;

res = -ENOMEM;
desc = kzalloc(sizeof(*desc), GFP_KERNEL);


--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-11-12 18:39:37

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements



On 12 Nov 2020, at 13:26, Trond Myklebust wrote:

> On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
>>
>> I was going to ask you if perhaps reverting Scott's commit
>> 07b5ce8ef2d8
>> ("NFS: Make nfs_readdir revalidate less often") might help here?
>> My thinking is that will trigger more cache invalidations when the
>> directory is changing underneath us, and will now trigger uncached
>> readdir in those situations.
>>
>>>
>
> IOW, the suggestion would be to apply something like the following on
> top of the existing readdir patchset:

I'm all for this approach, but - I'm rarely seeing the mapping->nrpages == 0
since the cache is dropped by a process in nfs_readdir() that immediately
starts filling the cache again.

It would make a lot more sense to me if we could do something like stash
desc->page_index << PAGE_SHIFT in f_pos after each nfs_readdir, then the
hueristic could check f_pos >> PAGE_SHIFT > nrpages.

Yes, f_pos is growing many different meanings in NFS directories, maybe we
can stash it on the directory context.

Ben

>
> ---
> fs/nfs/dir.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 3f70697729d8..384a4663f742 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -956,10 +956,10 @@ static int readdir_search_pagecache(struct
> nfs_readdir_descriptor *desc)
> {
> int res;
>
> - if (nfs_readdir_dont_search_cache(desc))
> - return -EBADCOOKIE;
> -
> do {
> + if (nfs_readdir_dont_search_cache(desc))
> + return -EBADCOOKIE;
> +
> if (desc->page_index == 0) {
> desc->current_index = 0;
> desc->prev_index = 0;
> @@ -1082,11 +1082,9 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
> * to either find the entry with the appropriate number or
> * revalidate the cookie.
> */
> - if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
> - res = nfs_revalidate_mapping(inode, file->f_mapping);
> - if (res < 0)
> - goto out;
> - }
> + res = nfs_revalidate_mapping(inode, file->f_mapping);
> + if (res < 0)
> + goto out;
>
> res = -ENOMEM;
> desc = kzalloc(sizeof(*desc), GFP_KERNEL);
>
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]

2020-11-12 18:51:32

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On 12 Nov 2020, at 13:39, Benjamin Coddington wrote:

> On 12 Nov 2020, at 13:26, Trond Myklebust wrote:
>
>> On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
>>>
>>> I was going to ask you if perhaps reverting Scott's commit
>>> 07b5ce8ef2d8
>>> ("NFS: Make nfs_readdir revalidate less often") might help here?
>>> My thinking is that will trigger more cache invalidations when the
>>> directory is changing underneath us, and will now trigger uncached
>>> readdir in those situations.
>>>
>>>>
>>
>> IOW, the suggestion would be to apply something like the following on
>> top of the existing readdir patchset:
>
> I'm all for this approach, but - I'm rarely seeing the
> mapping->nrpages == 0
> since the cache is dropped by a process in nfs_readdir() that
> immediately
> starts filling the cache again.
>
> It would make a lot more sense to me if we could do something like
> stash
> desc->page_index << PAGE_SHIFT in f_pos after each nfs_readdir, then
> the
> hueristic could check f_pos >> PAGE_SHIFT > nrpages.
>
> Yes, f_pos is growing many different meanings in NFS directories,
> maybe we
> can stash it on the directory context.

Ignore the f_pos suggestion -- iterate_dir sets it for us.. would have
to use
a private value.

Ben

2020-11-12 19:05:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Thu, 2020-11-12 at 13:39 -0500, Benjamin Coddington wrote:
>
>
> On 12 Nov 2020, at 13:26, Trond Myklebust wrote:
>
> > On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
> > >
> > > I was going to ask you if perhaps reverting Scott's commit
> > > 07b5ce8ef2d8
> > > ("NFS: Make nfs_readdir revalidate less often") might help here?
> > > My thinking is that will trigger more cache invalidations when
> > > the
> > > directory is changing underneath us, and will now trigger
> > > uncached
> > > readdir in those situations.
> > >
> > > >
> >
> > IOW, the suggestion would be to apply something like the following
> > on
> > top of the existing readdir patchset:
>
> I'm all for this approach, but - I'm rarely seeing the mapping-
> >nrpages == 0
> since the cache is dropped by a process in nfs_readdir() that
> immediately
> starts filling the cache again.

That's why I moved the check in readdir_search_pagecache. Unless that
process has set desc->dir_cookie == 0, then that should prevent the
refilling.

> It would make a lot more sense to me if we could do something like
> stash
> desc->page_index << PAGE_SHIFT in f_pos after each nfs_readdir, then
> the
> hueristic could check f_pos >> PAGE_SHIFT > nrpages.
>
> Yes, f_pos is growing many different meanings in NFS directories,
> maybe we
> can stash it on the directory context.
>
> Ben
>
> >
> > ---
> >  fs/nfs/dir.c | 14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 3f70697729d8..384a4663f742 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -956,10 +956,10 @@ static int readdir_search_pagecache(struct
> > nfs_readdir_descriptor *desc)
> >  {
> >         int res;
> >
> > -       if (nfs_readdir_dont_search_cache(desc))
> > -               return -EBADCOOKIE;
> > -
> >         do {
> > +               if (nfs_readdir_dont_search_cache(desc))
> > +                       return -EBADCOOKIE;
> > +
> >                 if (desc->page_index == 0) {
> >                         desc->current_index = 0;
> >                         desc->prev_index = 0;
> > @@ -1082,11 +1082,9 @@ static int nfs_readdir(struct file *file,
> > struct
> > dir_context *ctx)
> >          * to either find the entry with the appropriate number or
> >          * revalidate the cookie.
> >          */
> > -       if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
> > -               res = nfs_revalidate_mapping(inode, file-
> > >f_mapping);
> > -               if (res < 0)
> > -                       goto out;
> > -       }
> > +       res = nfs_revalidate_mapping(inode, file->f_mapping);
> > +       if (res < 0)
> > +               goto out;
> >
> >         res = -ENOMEM;
> >         desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> >
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-11-12 19:10:19

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On 12 Nov 2020, at 14:04, Trond Myklebust wrote:

> On Thu, 2020-11-12 at 13:39 -0500, Benjamin Coddington wrote:
>>
>>
>> On 12 Nov 2020, at 13:26, Trond Myklebust wrote:
>>
>>> On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
>>>>
>>>> I was going to ask you if perhaps reverting Scott's commit
>>>> 07b5ce8ef2d8
>>>> ("NFS: Make nfs_readdir revalidate less often") might help here?
>>>> My thinking is that will trigger more cache invalidations when
>>>> the
>>>> directory is changing underneath us, and will now trigger
>>>> uncached
>>>> readdir in those situations.
>>>>
>>>>>
>>>
>>> IOW, the suggestion would be to apply something like the following
>>> on
>>> top of the existing readdir patchset:
>>
>> I'm all for this approach, but - I'm rarely seeing the mapping-
>>> nrpages == 0
>> since the cache is dropped by a process in nfs_readdir() that
>> immediately
>> starts filling the cache again.
>
> That's why I moved the check in readdir_search_pagecache. Unless that
> process has set desc->dir_cookie == 0, then that should prevent the
> refilling.

My pathological benchmarking does send another process in with
desc->dir_cookie == 0.

I'm doing fork(), while(getdents) every second with acdirmin=1 and a listing
that takes longer than 1 second to complete uncached, while touching the
directory every second.

Ben

2020-11-12 20:27:01

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On 12 Nov 2020, at 14:09, Benjamin Coddington wrote:

> On 12 Nov 2020, at 14:04, Trond Myklebust wrote:
>
>> On Thu, 2020-11-12 at 13:39 -0500, Benjamin Coddington wrote:
>>>
>>>
>>> On 12 Nov 2020, at 13:26, Trond Myklebust wrote:
>>>
>>>> On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
>>>>>
>>>>> I was going to ask you if perhaps reverting Scott's commit
>>>>> 07b5ce8ef2d8
>>>>> ("NFS: Make nfs_readdir revalidate less often") might help here?
>>>>> My thinking is that will trigger more cache invalidations when
>>>>> the
>>>>> directory is changing underneath us, and will now trigger
>>>>> uncached
>>>>> readdir in those situations.
>>>>>
>>>>>>
>>>>
>>>> IOW, the suggestion would be to apply something like the following
>>>> on
>>>> top of the existing readdir patchset:
>>>
>>> I'm all for this approach, but - I'm rarely seeing the mapping-
>>>> nrpages == 0
>>> since the cache is dropped by a process in nfs_readdir() that
>>> immediately
>>> starts filling the cache again.
>>
>> That's why I moved the check in readdir_search_pagecache. Unless that
>> process has set desc->dir_cookie == 0, then that should prevent the
>> refilling.
>
> My pathological benchmarking does send another process in with
> desc->dir_cookie == 0.
>
> I'm doing fork(), while(getdents) every second with acdirmin=1 and a
> listing
> that takes longer than 1 second to complete uncached, while touching
> the
> directory every second.


As long as I use a buffer > 16k for getdents(), this hack can keep up
with the testcase:

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d7a9efd31ecd..ae687662112a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -78,6 +78,7 @@ static struct nfs_open_dir_context
*alloc_nfs_open_dir_context(struct inode *dir
ctx->attr_gencount = nfsi->attr_gencount;
ctx->dir_cookie = 0;
ctx->dup_cookie = 0;
+ ctx->page_index = 0;
spin_lock(&dir->i_lock);
if (list_empty(&nfsi->open_files) &&
(nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
@@ -150,6 +151,7 @@ struct nfs_readdir_descriptor {
struct page *page;
struct dir_context *ctx;
pgoff_t page_index;
+ pgoff_t last_page_index;
u64 dir_cookie;
u64 last_cookie;
u64 dup_cookie;
@@ -928,7 +930,12 @@ static bool nfs_readdir_dont_search_cache(struct
nfs_readdir_descriptor *desc)
* Default to uncached readdir if the page cache is empty, and
* we're looking for a non-zero cookie in a large directory.
*/
- return desc->dir_cookie != 0 && mapping->nrpages == 0 && size >
dtsize;
+ if (desc->dir_cookie == 0)
+ return false;
+ else if (mapping->nrpages < desc->last_page_index && size >
dtsize)
+ return true;
+
+ return false;
}

/* Search for desc->dir_cookie from the beginning of the page cache */
@@ -936,10 +943,10 @@ static int readdir_search_pagecache(struct
nfs_readdir_descriptor *desc)
{
int res;

- if (nfs_readdir_dont_search_cache(desc))
- return -EBADCOOKIE;
-
do {
+ if (nfs_readdir_dont_search_cache(desc))
+ return -EBADCOOKIE;
+
if (desc->page_index == 0) {
desc->current_index = 0;
desc->prev_index = 0;
@@ -1062,11 +1069,9 @@ static int nfs_readdir(struct file *file, struct
dir_context *ctx)
* to either find the entry with the appropriate number or
* revalidate the cookie.
*/
- if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
- res = nfs_revalidate_mapping(inode, file->f_mapping);
- if (res < 0)
- goto out;
- }
+ res = nfs_revalidate_mapping(inode, file->f_mapping);
+ if (res < 0)
+ goto out;

res = -ENOMEM;
desc = kzalloc(sizeof(*desc), GFP_KERNEL);
@@ -1082,6 +1087,7 @@ static int nfs_readdir(struct file *file, struct
dir_context *ctx)
desc->duped = dir_ctx->duped;
desc->attr_gencount = dir_ctx->attr_gencount;
memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
+ desc->last_page_index = dir_ctx->page_index;
spin_unlock(&file->f_lock);

do {
@@ -1121,6 +1127,8 @@ static int nfs_readdir(struct file *file, struct
dir_context *ctx)
dir_ctx->duped = desc->duped;
dir_ctx->attr_gencount = desc->attr_gencount;
memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
+ if (desc->page_index > dir_ctx->page_index)
+ dir_ctx->page_index = desc->page_index;
spin_unlock(&file->f_lock);

kfree(desc);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 681ed98e4ba8..ad1f8e9a22e6 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -98,6 +98,7 @@ struct nfs_open_dir_context {
__u64 dir_cookie;
__u64 dup_cookie;
signed char duped;
+ pgoff_t page_index;
};

/*



Here's the testcase I'm using:
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/syscall.h>
#include <sys/time.h>

/*
* mkdir /exports/10k_dentries
* for i in {1..10000}; do printf "/exports/10k_dentries/file%.12d\n"
$i; done | xargs touch
*/
#define DIR "/mnt/fedora/10k_dentries"
#define BUF_SIZE 1024 * 16

void evict_pagecache() {
int dir_fd = open(DIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
posix_fadvise(dir_fd, 0, 0, POSIX_FADV_DONTNEED);
close(dir_fd);
}

/* call getdents64() until no more, return time taken */
int listdir() {
struct timeval tvs, tve;
char buf[BUF_SIZE];
int dir_fd = open(DIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);

gettimeofday(&tvs, NULL);
while (syscall(SYS_getdents, dir_fd, buf, BUF_SIZE)) { }
gettimeofday(&tve, NULL);
return tve.tv_sec - tvs.tv_sec;
}

int main(int argc, char **argv)
{
evict_pagecache();

while (1) {
if (!fork()) {
int secs;
printf("pid %d starts..\n", getpid());
fflush(stdout);

secs = listdir();

printf("pid %d done %d seconds\n", getpid(), secs);
fflush(stdout);
return 0;
}
sleep (1);
}
}

Ben

2020-11-13 11:13:00

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements



Hi Trond,

I can confirm that with latest patchset we see 3x improvement
on directories listings with 100K and 500K files.

The downside is that our users will start to create directories
with even more files :). This weeks record 2.5M files at
200 files-per-second rate....

Thanks again,
Tigran.

----- Original Message -----
> From: "Benjamin Coddington" <[email protected]>
> To: "trondmy" <[email protected]>
> Cc: "linux-nfs" <[email protected]>, "David Wysochanski" <[email protected]>
> Sent: Thursday, 12 November, 2020 21:23:32
> Subject: Re: [PATCH v5 00/22] Readdir enhancements

> On 12 Nov 2020, at 14:09, Benjamin Coddington wrote:
>
>> On 12 Nov 2020, at 14:04, Trond Myklebust wrote:
>>
>>> On Thu, 2020-11-12 at 13:39 -0500, Benjamin Coddington wrote:
>>>>
>>>>
>>>> On 12 Nov 2020, at 13:26, Trond Myklebust wrote:
>>>>
>>>>> On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
>>>>>>
>>>>>> I was going to ask you if perhaps reverting Scott's commit
>>>>>> 07b5ce8ef2d8
>>>>>> ("NFS: Make nfs_readdir revalidate less often") might help here?
>>>>>> My thinking is that will trigger more cache invalidations when
>>>>>> the
>>>>>> directory is changing underneath us, and will now trigger
>>>>>> uncached
>>>>>> readdir in those situations.
>>>>>>
>>>>>>>
>>>>>
>>>>> IOW, the suggestion would be to apply something like the following
>>>>> on
>>>>> top of the existing readdir patchset:
>>>>
>>>> I'm all for this approach, but - I'm rarely seeing the mapping-
>>>>> nrpages == 0
>>>> since the cache is dropped by a process in nfs_readdir() that
>>>> immediately
>>>> starts filling the cache again.
>>>
>>> That's why I moved the check in readdir_search_pagecache. Unless that
>>> process has set desc->dir_cookie == 0, then that should prevent the
>>> refilling.
>>
>> My pathological benchmarking does send another process in with
>> desc->dir_cookie == 0.
>>
>> I'm doing fork(), while(getdents) every second with acdirmin=1 and a
>> listing
>> that takes longer than 1 second to complete uncached, while touching
>> the
>> directory every second.
>
>
> As long as I use a buffer > 16k for getdents(), this hack can keep up
> with the testcase:
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index d7a9efd31ecd..ae687662112a 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> *alloc_nfs_open_dir_context(struct inode *dir
> ctx->attr_gencount = nfsi->attr_gencount;
> ctx->dir_cookie = 0;
> ctx->dup_cookie = 0;
> + ctx->page_index = 0;
> spin_lock(&dir->i_lock);
> if (list_empty(&nfsi->open_files) &&
> (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
> @@ -150,6 +151,7 @@ struct nfs_readdir_descriptor {
> struct page *page;
> struct dir_context *ctx;
> pgoff_t page_index;
> + pgoff_t last_page_index;
> u64 dir_cookie;
> u64 last_cookie;
> u64 dup_cookie;
> @@ -928,7 +930,12 @@ static bool nfs_readdir_dont_search_cache(struct
> nfs_readdir_descriptor *desc)
> * Default to uncached readdir if the page cache is empty, and
> * we're looking for a non-zero cookie in a large directory.
> */
> - return desc->dir_cookie != 0 && mapping->nrpages == 0 && size >
> dtsize;
> + if (desc->dir_cookie == 0)
> + return false;
> + else if (mapping->nrpages < desc->last_page_index && size >
> dtsize)
> + return true;
> +
> + return false;
> }
>
> /* Search for desc->dir_cookie from the beginning of the page cache */
> @@ -936,10 +943,10 @@ static int readdir_search_pagecache(struct
> nfs_readdir_descriptor *desc)
> {
> int res;
>
> - if (nfs_readdir_dont_search_cache(desc))
> - return -EBADCOOKIE;
> -
> do {
> + if (nfs_readdir_dont_search_cache(desc))
> + return -EBADCOOKIE;
> +
> if (desc->page_index == 0) {
> desc->current_index = 0;
> desc->prev_index = 0;
> @@ -1062,11 +1069,9 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
> * to either find the entry with the appropriate number or
> * revalidate the cookie.
> */
> - if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
> - res = nfs_revalidate_mapping(inode, file->f_mapping);
> - if (res < 0)
> - goto out;
> - }
> + res = nfs_revalidate_mapping(inode, file->f_mapping);
> + if (res < 0)
> + goto out;
>
> res = -ENOMEM;
> desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> @@ -1082,6 +1087,7 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
> desc->duped = dir_ctx->duped;
> desc->attr_gencount = dir_ctx->attr_gencount;
> memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
> + desc->last_page_index = dir_ctx->page_index;
> spin_unlock(&file->f_lock);
>
> do {
> @@ -1121,6 +1127,8 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
> dir_ctx->duped = desc->duped;
> dir_ctx->attr_gencount = desc->attr_gencount;
> memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
> + if (desc->page_index > dir_ctx->page_index)
> + dir_ctx->page_index = desc->page_index;
> spin_unlock(&file->f_lock);
>
> kfree(desc);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 681ed98e4ba8..ad1f8e9a22e6 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -98,6 +98,7 @@ struct nfs_open_dir_context {
> __u64 dir_cookie;
> __u64 dup_cookie;
> signed char duped;
> + pgoff_t page_index;
> };
>
> /*
>
>
>
> Here's the testcase I'm using:
> #include <stdio.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/syscall.h>
> #include <sys/time.h>
>
> /*
> * mkdir /exports/10k_dentries
> * for i in {1..10000}; do printf "/exports/10k_dentries/file%.12d\n"
> $i; done | xargs touch
> */
> #define DIR "/mnt/fedora/10k_dentries"
> #define BUF_SIZE 1024 * 16
>
> void evict_pagecache() {
> int dir_fd = open(DIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
> posix_fadvise(dir_fd, 0, 0, POSIX_FADV_DONTNEED);
> close(dir_fd);
> }
>
> /* call getdents64() until no more, return time taken */
> int listdir() {
> struct timeval tvs, tve;
> char buf[BUF_SIZE];
> int dir_fd = open(DIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
>
> gettimeofday(&tvs, NULL);
> while (syscall(SYS_getdents, dir_fd, buf, BUF_SIZE)) { }
> gettimeofday(&tve, NULL);
> return tve.tv_sec - tvs.tv_sec;
> }
>
> int main(int argc, char **argv)
> {
> evict_pagecache();
>
> while (1) {
> if (!fork()) {
> int secs;
> printf("pid %d starts..\n", getpid());
> fflush(stdout);
>
> secs = listdir();
>
> printf("pid %d done %d seconds\n", getpid(), secs);
> fflush(stdout);
> return 0;
> }
> sleep (1);
> }
> }
>
> Ben

2020-11-14 13:34:44

by David Wysochanski

[permalink] [raw]
Subject: Re: [PATCH v5 00/22] Readdir enhancements

On Thu, Nov 12, 2020 at 1:26 PM Trond Myklebust <[email protected]> wrote:
>
> On Thu, 2020-11-12 at 16:51 +0000, Trond Myklebust wrote:
> >
> > I was going to ask you if perhaps reverting Scott's commit
> > 07b5ce8ef2d8
> > ("NFS: Make nfs_readdir revalidate less often") might help here?
> > My thinking is that will trigger more cache invalidations when the
> > directory is changing underneath us, and will now trigger uncached
> > readdir in those situations.
> >
> > >
>
> IOW, the suggestion would be to apply something like the following on
> top of the existing readdir patchset:
>
> ---
> fs/nfs/dir.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 3f70697729d8..384a4663f742 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -956,10 +956,10 @@ static int readdir_search_pagecache(struct
> nfs_readdir_descriptor *desc)
> {
> int res;
>
> - if (nfs_readdir_dont_search_cache(desc))
> - return -EBADCOOKIE;
> -
> do {
> + if (nfs_readdir_dont_search_cache(desc))
> + return -EBADCOOKIE;
> +
> if (desc->page_index == 0) {
> desc->current_index = 0;
> desc->prev_index = 0;
> @@ -1082,11 +1082,9 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
> * to either find the entry with the appropriate number or
> * revalidate the cookie.
> */
> - if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
> - res = nfs_revalidate_mapping(inode, file->f_mapping);
> - if (res < 0)
> - goto out;
> - }
> + res = nfs_revalidate_mapping(inode, file->f_mapping);
> + if (res < 0)
> + goto out;
>
> res = -ENOMEM;
> desc = kzalloc(sizeof(*desc), GFP_KERNEL);
>
>

If you want to pursue this let me know and I will run through some
tests with this patch (or some other patch) and all NFS versions.
I've mostly been testing with NFSv4 because that has been the
problematic case, but this would affect NFSv3 too.

I too had been looking at 07b5ce8ef2d8 because it causes NFSv3 to
behave differently than NFSv4 (I saw with NFSv3,
nfs_attribute_cache_expired() was always false while a directory is
being listed). I pondered whether this was the right direction for
NFSv4 as well since that causes us to avoid dropping the pagecache and
kinda "neuters" it for a period of time while a large directory
listing was in progress. However I didn't see a way to achieve the
NFSv3 behavior for NFSv4 and I wasn't convinced NFSv3 behavior was
necessarily the best approach.