LinuxLists.cc - [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

2020-11-02 13:53:00

Subject: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

Note these patches are also on github at:
https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir

This patchset is mostly tracepoints but patch 9 fixes a performance
problem with NFSv4.x when listing a directory that is being modified.
After the NFSv4.x patch, for consistency, patch 11 is included which
makes NFSv3 behave the same as NFSv4, and given the inclusion of
patch 9 does not re-introduce the regression fixed in previous commit
79f687a3de9e. As described in patch 11, NFSv3 does not currently have
the NFSv4 problem because it does not drop the pagecache when a
large directory read is in progress.

As background, currently any process that is listing a directory
must search starting at cookie 0 at each entry to nfs_readdir(),
regardless of the previous cookie returned by the server, regardless of
NFS version, and regardless of whether the directory is being modified
or cache expired. This means, for example, if a large directory listing
requires many getdents64/nfs_readdir calls, the kernel will search
through the pagecache with an increasing penalty as the size of the
directory increases and the process attempts to obtain the latter
entries of the directory. When the directory is being modified, and
when acdirmax is exceeded, and nfs_attribute_cache_expired() returns
true, with NFSv4.x we drop the pagecache for the directory entries,
so the process cannot make forward progress.

I investigated using a similar approach as was done with NFSv3 in
commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory
attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode()
unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode()
is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired()
to always return false, leaving the pagecache in tact despite
NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3
approach could be done for NFSv4 to achieve the same behavior as with
NFSv3. In addition, the current NFSv3 approach leaves in place the
aforementioned penalty, which for large directories can be substantial.
So rather than this approach, the approach taken with NFSv4 leaves
in place the dropping of the pagecache when acdirmax expires, and
relies on the fact that a process can trust the last cookie returned
by the server and continue at that point in the pagecache without
forcing the current process to start over from cookie 0 upon the
next entry to nfs_readdir(). As far as I could tell there is no
POSIX requirement to start from 0 again when a directory is being
modified during an in-progress directory read, and so the current
behavior is only an implementation limitation.

The NFSv4 performance problem can be seen by exporting a directory
with a larger number of files such that the uncached time to list the
directory exceeds acdirmax. I have an explicit reproducer script
I can provide, but a simple reproducer outline is as follows:

1. First export a directory that contains enough files that the
listing exceeds acdirmax. In my testing, the NFS server and client
were the same system, there were 500,000 files in the directory,
and I set acdirmin=10,acdirmax=20.

2. Mount client1 (writer) and client2 (reader). Note below I use
the loopback IP address, and a 'wsize' parameter on the writer mount
so we get different superblocks:
mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1
mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2

3. Start client1 (writer):
echo 3 > /proc/sys/vm/drop_caches
i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done > /dev/null 2>&1 &

4. Start client2 (reader):
while true; do time ls -1f $MNT1 | wc -l; done &

The output from my reproducer script is:
./t0_bz1889478.sh 4.1 127.0.0.1 500000
Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory is idle
Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files
Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete
Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the directory of 500000 files
Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory being modified
Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2
Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1 file/sec at /mnt/nfs/bz1889478-mnt2
Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete
Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the directory of 500000 files
Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
./t0_bz1889478.sh: line 124: 5973 Killed while true; do
echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1;
done > /dev/null 2>&1
Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1

For diagnostics and verification, of course a tcpdump can be
used, or even READDIR ops and time can be compared as in the
reproducer, but also the included tracepoints can be used. For
the tracepoints, before step #2 above use the below trace-cmd
to trace the listing and see whether the problem occurs or not,
evidenced by either the presence of nfs_invalidate_mapping*
trace events or multiple nfs_readdir_enter calls with
"cookie=0x00000000":
trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_*

Dave Wysochanski (11):
NFSv4: Improve nfs4_readdir tracepoint by adding additional fields
NFS: Replace dfprintk statements with trace events in nfs_readdir
NFS: Move nfs_readdir_descriptor_t into internal header
NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t
NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir
NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit
NFS: Add tracepoint to entry and exit of nfs_do_filldir
NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints
NFS: Improve performance of listing directories being modified
NFS: Add page_index to nfs_readdir enter and exit tracepoints
NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir()

fs/nfs/dir.c | 101 +++++++++---------
fs/nfs/internal.h | 18 ++++
fs/nfs/nfs3xdr.c | 2 +-
fs/nfs/nfs4proc.c | 2 +-
fs/nfs/nfs4trace.h | 44 +++++++-
fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/nfs_fs.h | 1 +
7 files changed, 394 insertions(+), 51 deletions(-)

--
1.8.3.1

2020-11-02 13:53:05

by David Wysochanski

[permalink] [raw]

Subject: [PATCH 02/11] NFS: Replace dfprintk statements with trace events in nfs_readdir

Remove a couple dfprintks in nfs_readdir and replace them with
trace events on entry/exit that contain more information.

Signed-off-by: Dave Wysochanski <[email protected]>
---
fs/nfs/dir.c | 7 +++--
fs/nfs/nfstrace.h | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index cb52db9a0cfb..116493e66922 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -900,9 +900,9 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
*desc = &my_desc;
int res = 0;

- dfprintk(FILE, "NFS: readdir(%pD2) starting at cookie %llu\n",
- file, (long long)ctx->pos);
nfs_inc_stats(inode, NFSIOS_VFSGETDENTS);
+ trace_nfs_readdir_enter(inode, ctx->pos, dir_ctx->dir_cookie,
+ NFS_SERVER(inode)->dtsize, my_desc.plus);

/*
* ctx->pos points to the dirent entry number.
@@ -949,7 +949,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
out:
if (res > 0)
res = 0;
- dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
+ trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx->dir_cookie,
+ NFS_SERVER(inode)->dtsize, my_desc.plus, res);
return res;
}

diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 5a59dcdce0b2..6bbe0aa221f2 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -668,6 +668,90 @@
DEFINE_NFS_DIRECTORY_EVENT(nfs_symlink_enter);
DEFINE_NFS_DIRECTORY_EVENT_DONE(nfs_symlink_exit);

+TRACE_EVENT(nfs_readdir_enter,
+ TP_PROTO(
+ const struct inode *inode,
+ u64 cookie,
+ u64 dir_cookie,
+ unsigned int count,
+ bool plus
+ ),
+
+ TP_ARGS(inode, cookie, dir_cookie, count, plus),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(u32, fhandle)
+ __field(u64, fileid)
+ __field(u64, cookie)
+ __field(u64, dir_cookie)
+ __field(u64, count)
+ __field(bool, plus)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->fileid = NFS_FILEID(inode);
+ __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+ __entry->cookie = cookie;
+ __entry->dir_cookie = dir_cookie;
+ __entry->count = count;
+ __entry->plus = plus;
+ ),
+
+ TP_printk(
+ "fileid=%02x:%02x:%llu fhandle=0x%08x cookie=0x%08llx dir_cookie=0x%08llx count=%llu plus=%s",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long long)__entry->fileid,
+ __entry->fhandle, __entry->cookie, __entry->dir_cookie,
+ __entry->count, __entry->plus ? "true" : "false"
+ )
+);
+
+TRACE_EVENT(nfs_readdir_exit,
+ TP_PROTO(
+ const struct inode *inode,
+ u64 cookie,
+ u64 dir_cookie,
+ unsigned int count,
+ bool plus,
+ int error
+ ),
+
+ TP_ARGS(inode, cookie, dir_cookie, count, plus, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(u32, fhandle)
+ __field(u64, fileid)
+ __field(unsigned long, error)
+ __field(u64, cookie)
+ __field(u64, dir_cookie)
+ __field(u64, count)
+ __field(bool, plus)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->fileid = NFS_FILEID(inode);
+ __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+ __entry->error = error;
+ __entry->cookie = cookie;
+ __entry->dir_cookie = dir_cookie;
+ __entry->count = count;
+ __entry->plus = plus;
+ ),
+
+ TP_printk(
+ "error=%ld fileid=%02x:%02x:%llu fhandle=0x%08x cookie=0x%08llx dir_cookie=0x%08llx count=%llu plus=%s",
+ __entry->error,
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long long)__entry->fileid,
+ __entry->fhandle, __entry->cookie, __entry->dir_cookie,
+ __entry->count, __entry->plus ? "true" : "false"
+ )
+);
+
TRACE_EVENT(nfs_link_enter,
TP_PROTO(
const struct inode *inode,
--
1.8.3.1

2020-11-02 13:53:19

by David Wysochanski

[permalink] [raw]

Subject: [PATCH 05/11] NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir

Add a few more directory operation tracepoints and remove a few
more dfprintks.

Signed-off-by: Dave Wysochanski <[email protected]>
---
fs/nfs/dir.c | 10 +++++---
fs/nfs/nfstrace.h | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 227cddc12983..e4cd9789ebb5 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -109,7 +109,7 @@ static void put_nfs_open_dir_context(struct inode *dir, struct nfs_open_dir_cont
int res = 0;
struct nfs_open_dir_context *ctx;

- dfprintk(FILE, "NFS: open dir(%pD2)\n", filp);
+ trace_nfs_opendir_enter(file_inode(filp), file_dentry(filp));

nfs_inc_stats(inode, NFSIOS_VFSOPEN);

@@ -120,13 +120,16 @@ static void put_nfs_open_dir_context(struct inode *dir, struct nfs_open_dir_cont
}
filp->private_data = ctx;
out:
+ trace_nfs_opendir_exit(file_inode(filp), file_dentry(filp), res);
return res;
}

static int
nfs_closedir(struct inode *inode, struct file *filp)
{
+ trace_nfs_closedir_enter(file_inode(filp), file_dentry(filp));
put_nfs_open_dir_context(file_inode(filp), filp->private_data);
+ trace_nfs_closedir_exit(file_inode(filp), file_dentry(filp), 0);
return 0;
}

@@ -943,8 +946,7 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
struct inode *inode = file_inode(filp);
struct nfs_open_dir_context *dir_ctx = filp->private_data;

- dfprintk(FILE, "NFS: llseek dir(%pD2, %lld, %d)\n",
- filp, offset, whence);
+ trace_nfs_llseek_dir_enter(inode, offset, whence);

switch (whence) {
default:
@@ -985,7 +987,7 @@ static int nfs_fsync_dir(struct file *filp, loff_t start, loff_t end,
{
struct inode *inode = file_inode(filp);

- dfprintk(FILE, "NFS: fsync dir(%pD2) datasync %d\n", filp, datasync);
+ trace_nfs_fsync_dir_enter(inode, start, end, datasync);

inode_lock(inode);
nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index e6a946b83330..0ed330f323bb 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -667,6 +667,79 @@
DEFINE_NFS_DIRECTORY_EVENT_DONE(nfs_unlink_exit);
DEFINE_NFS_DIRECTORY_EVENT(nfs_symlink_enter);
DEFINE_NFS_DIRECTORY_EVENT_DONE(nfs_symlink_exit);
+DEFINE_NFS_DIRECTORY_EVENT(nfs_opendir_enter);
+DEFINE_NFS_DIRECTORY_EVENT_DONE(nfs_opendir_exit);
+DEFINE_NFS_DIRECTORY_EVENT(nfs_closedir_enter);
+DEFINE_NFS_DIRECTORY_EVENT_DONE(nfs_closedir_exit);
+
+TRACE_EVENT(nfs_llseek_dir_enter,
+ TP_PROTO(
+ const struct inode *inode,
+ u64 offset,
+ int whence
+ ),
+
+ TP_ARGS(inode, offset, whence),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(u32, fhandle)
+ __field(u64, fileid)
+ __field(u64, offset)
+ __field(unsigned long, whence)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->fileid = NFS_FILEID(inode);
+ __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+ __entry->offset = offset;
+ __entry->whence = whence;
+ ),
+
+ TP_printk(
+ "fileid=%02x:%02x:%llu handle=0x%08x offset=0x%08llx whence=%lu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long long)__entry->fileid,
+ __entry->fhandle, __entry->offset, __entry->whence
+ )
+);
+
+TRACE_EVENT(nfs_fsync_dir_enter,
+ TP_PROTO(
+ const struct inode *inode,
+ u64 start,
+ u64 end,
+ int datasync
+ ),
+
+ TP_ARGS(inode, start, end, datasync),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(u32, fhandle)
+ __field(u64, fileid)
+ __field(u64, start)
+ __field(u64, end)
+ __field(unsigned long, datasync)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->fileid = NFS_FILEID(inode);
+ __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+ __entry->start = start;
+ __entry->end = end;
+ __entry->datasync = datasync;
+ ),
+
+ TP_printk(
+ "fileid=%02x:%02x:%llu handle=0x%08x start=0x%08llx end=0x%08llx datasync=%lu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long long)__entry->fileid,
+ __entry->fhandle, __entry->start, __entry->end, __entry->datasync
+ )
+);

TRACE_EVENT(nfs_readdir_enter,
TP_PROTO(
--
1.8.3.1

2020-11-02 13:53:22

by David Wysochanski

[permalink] [raw]

Subject: [PATCH 03/11] NFS: Move nfs_readdir_descriptor_t into internal header

In preparation for readdir tracing involving nfs_readdir_descriptor_t
move the definition into the "internal.h" header file, and ensure
internal.h is included before nfstrace.h to avoid build problems.

Signed-off-by: Dave Wysochanski <[email protected]>
---
fs/nfs/dir.c | 18 ------------------
fs/nfs/internal.h | 18 ++++++++++++++++++
fs/nfs/nfs3xdr.c | 2 +-
3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 116493e66922..145393188f6a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -144,24 +144,6 @@ struct nfs_cache_array {
struct nfs_cache_array_entry array[];
};

-typedef struct {
- struct file *file;
- struct page *page;
- struct dir_context *ctx;
- unsigned long page_index;
- u64 *dir_cookie;
- u64 last_cookie;
- loff_t current_index;
- loff_t prev_index;
-
- unsigned long dir_verifier;
- unsigned long timestamp;
- unsigned long gencount;
- unsigned int cache_entry_index;
- bool plus;
- bool eof;
-} nfs_readdir_descriptor_t;
-
static
void nfs_readdir_init_array(struct page *page)
{
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 6673a77884d9..68ade987370d 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -62,6 +62,24 @@ static inline bool nfs_lookup_is_soft_revalidate(const struct dentry *dentry)
*/
#define NFS_MAX_READDIR_PAGES 8

+typedef struct {
+ struct file *file;
+ struct page *page;
+ struct dir_context *ctx;
+ unsigned long page_index;
+ u64 *dir_cookie;
+ u64 last_cookie;
+ loff_t current_index;
+ loff_t prev_index;
+
+ unsigned long dir_verifier;
+ unsigned long timestamp;
+ unsigned long gencount;
+ unsigned int cache_entry_index;
+ bool plus;
+ bool eof;
+} nfs_readdir_descriptor_t;
+
struct nfs_client_initdata {
unsigned long init_flags;
const char *hostname; /* Hostname of the server */
diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index 69971f6c840d..972759325de0 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -21,8 +21,8 @@
#include <linux/nfs3.h>
#include <linux/nfs_fs.h>
#include <linux/nfsacl.h>
-#include "nfstrace.h"
#include "internal.h"
+#include "nfstrace.h"

#define NFSDBG_FACILITY NFSDBG_XDR

--
1.8.3.1

2020-11-02 13:54:23

by David Wysochanski

[permalink] [raw]

Subject: [PATCH 07/11] NFS: Add tracepoint to entry and exit of nfs_do_filldir

Add tracepoints to nfs_do_filldir and remove one more dfprintk.

Signed-off-by: Dave Wysochanski <[email protected]>
---
fs/nfs/dir.c | 4 ++--
fs/nfs/nfstrace.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d1e9ba28c4a0..d8c3c3fdea75 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -794,6 +794,7 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc)
struct nfs_cache_array *array = NULL;
struct nfs_open_dir_context *ctx = file->private_data;

+ trace_nfs_do_filldir_enter(desc);
array = kmap(desc->page);
for (i = desc->cache_entry_index; i < array->size; i++) {
struct nfs_cache_array_entry *ent;
@@ -819,8 +820,7 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc)
desc->eof = true;

kunmap(desc->page);
- dfprintk(DIRCACHE, "NFS: nfs_do_filldir() filling ended @ cookie %Lu; returning = %d\n",
- (unsigned long long)*desc->dir_cookie, res);
+ trace_nfs_do_filldir_exit(desc, res);
return res;
}

diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 5dbadd2718e3..e28551f70eab 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -928,6 +928,8 @@
DEFINE_NFS_READDIR_DESCRIPTOR_EVENT_EXIT(nfs_readdir_search_array_exit);
DEFINE_NFS_READDIR_DESCRIPTOR_EVENT(nfs_readdir_xdr_filler_enter);
DEFINE_NFS_READDIR_DESCRIPTOR_EVENT_EXIT(nfs_readdir_xdr_filler_exit);
+DEFINE_NFS_READDIR_DESCRIPTOR_EVENT(nfs_do_filldir_enter);
+DEFINE_NFS_READDIR_DESCRIPTOR_EVENT_EXIT(nfs_do_filldir_exit);

TRACE_EVENT(nfs_link_enter,
TP_PROTO(
--
1.8.3.1

2020-11-02 14:29:57

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

> On Nov 2, 2020, at 8:50 AM, Dave Wysochanski <[email protected]> wrote:
>
> Note these patches are also on github at:
> https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir
>
> This patchset is mostly tracepoints but patch 9 fixes a performance
> problem with NFSv4.x when listing a directory that is being modified.
> After the NFSv4.x patch, for consistency, patch 11 is included which
> makes NFSv3 behave the same as NFSv4, and given the inclusion of
> patch 9 does not re-introduce the regression fixed in previous commit
> 79f687a3de9e. As described in patch 11, NFSv3 does not currently have
> the NFSv4 problem because it does not drop the pagecache when a
> large directory read is in progress.

Hi Dave-

Replacing our community's deep dependence on wire capture is a good
thing.

My initial concern with the tracepoint patches is that there are
so many new tracepoints. In cases where we might want to enable all
of them on an intensive workload, that generates an enormous amount
of tracepoint traffic, which might overwhelm local file storage or
analysis tools.

Also, much of this infrastructure is useful for developers, but
how much will be used in the field? The point of adding permanent
tracepoints should be to capture just enough to enable sustaining
engineers to diagnose problems remotely. Developers can add very
specific tracepoints as they need to, but also they can use tools
like systemtap or eBPF.

Boiled down, I'm wondering if there's a way to prune some of this.
Are the dprintk call sites you're replacing all that useful, for
instance?

And: Will it be easy to backport the fixes at the end? Maybe those
should go earlier in the patch set.

> As background, currently any process that is listing a directory
> must search starting at cookie 0 at each entry to nfs_readdir(),
> regardless of the previous cookie returned by the server, regardless of
> NFS version, and regardless of whether the directory is being modified
> or cache expired. This means, for example, if a large directory listing
> requires many getdents64/nfs_readdir calls, the kernel will search
> through the pagecache with an increasing penalty as the size of the
> directory increases and the process attempts to obtain the latter
> entries of the directory. When the directory is being modified, and
> when acdirmax is exceeded, and nfs_attribute_cache_expired() returns
> true, with NFSv4.x we drop the pagecache for the directory entries,
> so the process cannot make forward progress.
>
> I investigated using a similar approach as was done with NFSv3 in
> commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory
> attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode()
> unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode()
> is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired()
> to always return false, leaving the pagecache in tact despite
> NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3
> approach could be done for NFSv4 to achieve the same behavior as with
> NFSv3. In addition, the current NFSv3 approach leaves in place the
> aforementioned penalty, which for large directories can be substantial.
> So rather than this approach, the approach taken with NFSv4 leaves
> in place the dropping of the pagecache when acdirmax expires, and
> relies on the fact that a process can trust the last cookie returned
> by the server and continue at that point in the pagecache without
> forcing the current process to start over from cookie 0 upon the
> next entry to nfs_readdir(). As far as I could tell there is no
> POSIX requirement to start from 0 again when a directory is being
> modified during an in-progress directory read, and so the current
> behavior is only an implementation limitation.
>
> The NFSv4 performance problem can be seen by exporting a directory
> with a larger number of files such that the uncached time to list the
> directory exceeds acdirmax. I have an explicit reproducer script
> I can provide, but a simple reproducer outline is as follows:
>
> 1. First export a directory that contains enough files that the
> listing exceeds acdirmax. In my testing, the NFS server and client
> were the same system, there were 500,000 files in the directory,
> and I set acdirmin=10,acdirmax=20.
>
> 2. Mount client1 (writer) and client2 (reader). Note below I use
> the loopback IP address, and a 'wsize' parameter on the writer mount
> so we get different superblocks:
> mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1
> mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2
>
> 3. Start client1 (writer):
> echo 3 > /proc/sys/vm/drop_caches
> i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done > /dev/null 2>&1 &
>
> 4. Start client2 (reader):
> while true; do time ls -1f $MNT1 | wc -l; done &
>
> The output from my reproducer script is:
> ./t0_bz1889478.sh 4.1 127.0.0.1 500000
> Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory is idle
> Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
> Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files
> Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete
> Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the directory of 500000 files
> Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
> Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory being modified
> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
> Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1 file/sec at /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete
> Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the directory of 500000 files
> Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
> ./t0_bz1889478.sh: line 124: 5973 Killed while true; do
> echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1;
> done > /dev/null 2>&1
> Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
> PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1
>
> For diagnostics and verification, of course a tcpdump can be
> used, or even READDIR ops and time can be compared as in the
> reproducer, but also the included tracepoints can be used. For
> the tracepoints, before step #2 above use the below trace-cmd
> to trace the listing and see whether the problem occurs or not,
> evidenced by either the presence of nfs_invalidate_mapping*
> trace events or multiple nfs_readdir_enter calls with
> "cookie=0x00000000":
> trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_*
>
>
> Dave Wysochanski (11):
> NFSv4: Improve nfs4_readdir tracepoint by adding additional fields
> NFS: Replace dfprintk statements with trace events in nfs_readdir
> NFS: Move nfs_readdir_descriptor_t into internal header
> NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t
> NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir
> NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit
> NFS: Add tracepoint to entry and exit of nfs_do_filldir
> NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints
> NFS: Improve performance of listing directories being modified
> NFS: Add page_index to nfs_readdir enter and exit tracepoints
> NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir()
>
> fs/nfs/dir.c | 101 +++++++++---------
> fs/nfs/internal.h | 18 ++++
> fs/nfs/nfs3xdr.c | 2 +-
> fs/nfs/nfs4proc.c | 2 +-
> fs/nfs/nfs4trace.h | 44 +++++++-
> fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/nfs_fs.h | 1 +
> 7 files changed, 394 insertions(+), 51 deletions(-)
>
> --
> 1.8.3.1
>

--
Chuck Lever

2020-11-02 15:09:06

by David Wysochanski

[permalink] [raw]

Subject: Re: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

On Mon, Nov 2, 2020 at 9:29 AM Chuck Lever <[email protected]> wrote:
>
>
>
> > On Nov 2, 2020, at 8:50 AM, Dave Wysochanski <[email protected]> wrote:
> >
> > Note these patches are also on github at:
> > https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir
> >
> > This patchset is mostly tracepoints but patch 9 fixes a performance
> > problem with NFSv4.x when listing a directory that is being modified.
> > After the NFSv4.x patch, for consistency, patch 11 is included which
> > makes NFSv3 behave the same as NFSv4, and given the inclusion of
> > patch 9 does not re-introduce the regression fixed in previous commit
> > 79f687a3de9e. As described in patch 11, NFSv3 does not currently have
> > the NFSv4 problem because it does not drop the pagecache when a
> > large directory read is in progress.
>
> Hi Dave-
>
> Replacing our community's deep dependence on wire capture is a good
> thing.
>
Chuck, I agree and thanks for your early response here.

> My initial concern with the tracepoint patches is that there are
> so many new tracepoints. In cases where we might want to enable all
> of them on an intensive workload, that generates an enormous amount
> of tracepoint traffic, which might overwhelm local file storage or
> analysis tools.
>
> Also, much of this infrastructure is useful for developers, but
> how much will be used in the field? The point of adding permanent
> tracepoints should be to capture just enough to enable sustaining
> engineers to diagnose problems remotely. Developers can add very
> specific tracepoints as they need to, but also they can use tools
> like systemtap or eBPF.
>
> Boiled down, I'm wondering if there's a way to prune some of this.
> Are the dprintk call sites you're replacing all that useful, for
> instance?
>
Yes this is a good point. I struggled with this too, I actually had even
more tracepoints than was included here, and even looked
at the size of nfs.ko before and after (it is about 22k difference).
# size /lib/modules/$(uname -r)/kernel/fs/nfs/nfs.ko
text data bss dec hex filename
250820 67272 408 318500 4dc24
/lib/modules/5.9.0-nfs-readdir+/kernel/fs/nfs/nfs.ko
# size /lib/modules/5.9.0/kernel/fs/nfs/nfs.ko
text data bss dec hex filename
240232 55496 408 296136 484c8 /lib/modules/5.9.0/kernel/fs/nfs/nfs.ko

I can surely take another look at this and see what are the most
important, but off the top of my head, I'm thinking nfs_readdir() entry
and exit, as well as tracing the actual READDIR ops, probably something
searching the NFS readdir cache. Are there tracepoints that look
unnecessary to you, or are you mainly asking me for more justification
for each one?

> And: Will it be easy to backport the fixes at the end? Maybe those
> should go earlier in the patch set.
>
Yes it could be done that way. I also thought of separating the tracing
patches into one set, then the fixes another, but decided on one post
maybe due to the intro patch which ties everything together.

FWIW, there is only one tiny conflicts on 5.9 with the two fixes and
it's with patch 9:
863c92f4328a... NFS: Improve performance of listing directories being modified
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@@ -949,7 -941,9 +952,13 @@@ static int nfs_readdir(struct file *fil
out:
if (res > 0)
res = 0;
++<<<<<<< HEAD
+ dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
++=======
+ dir_ctx->page_index = desc->page_index;
+ trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx->dir_cookie,
+ NFS_SERVER(inode)->dtsize, my_desc.plus, res);
++>>>>>>> 863c92f4328a... NFS: Improve performance of listing
directories being modified
return res;
}

>
> > As background, currently any process that is listing a directory
> > must search starting at cookie 0 at each entry to nfs_readdir(),
> > regardless of the previous cookie returned by the server, regardless of
> > NFS version, and regardless of whether the directory is being modified
> > or cache expired. This means, for example, if a large directory listing
> > requires many getdents64/nfs_readdir calls, the kernel will search
> > through the pagecache with an increasing penalty as the size of the
> > directory increases and the process attempts to obtain the latter
> > entries of the directory. When the directory is being modified, and
> > when acdirmax is exceeded, and nfs_attribute_cache_expired() returns
> > true, with NFSv4.x we drop the pagecache for the directory entries,
> > so the process cannot make forward progress.
> >
> > I investigated using a similar approach as was done with NFSv3 in
> > commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory
> > attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode()
> > unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode()
> > is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired()
> > to always return false, leaving the pagecache in tact despite
> > NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3
> > approach could be done for NFSv4 to achieve the same behavior as with
> > NFSv3. In addition, the current NFSv3 approach leaves in place the
> > aforementioned penalty, which for large directories can be substantial.
> > So rather than this approach, the approach taken with NFSv4 leaves
> > in place the dropping of the pagecache when acdirmax expires, and
> > relies on the fact that a process can trust the last cookie returned
> > by the server and continue at that point in the pagecache without
> > forcing the current process to start over from cookie 0 upon the
> > next entry to nfs_readdir(). As far as I could tell there is no
> > POSIX requirement to start from 0 again when a directory is being
> > modified during an in-progress directory read, and so the current
> > behavior is only an implementation limitation.
> >
> > The NFSv4 performance problem can be seen by exporting a directory
> > with a larger number of files such that the uncached time to list the
> > directory exceeds acdirmax. I have an explicit reproducer script
> > I can provide, but a simple reproducer outline is as follows:
> >
> > 1. First export a directory that contains enough files that the
> > listing exceeds acdirmax. In my testing, the NFS server and client
> > were the same system, there were 500,000 files in the directory,
> > and I set acdirmin=10,acdirmax=20.
> >
> > 2. Mount client1 (writer) and client2 (reader). Note below I use
> > the loopback IP address, and a 'wsize' parameter on the writer mount
> > so we get different superblocks:
> > mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1
> > mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2
> >
> > 3. Start client1 (writer):
> > echo 3 > /proc/sys/vm/drop_caches
> > i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done > /dev/null 2>&1 &
> >
> > 4. Start client2 (reader):
> > while true; do time ls -1f $MNT1 | wc -l; done &
> >
> > The output from my reproducer script is:
> > ./t0_bz1889478.sh 4.1 127.0.0.1 500000
> > Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory is idle
> > Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
> > Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
> > Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files
> > Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
> > Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete
> > Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the directory of 500000 files
> > Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
> > Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
> > Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory being modified
> > Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
> > Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2
> > Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
> > Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1 file/sec at /mnt/nfs/bz1889478-mnt2
> > Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
> > Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete
> > Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the directory of 500000 files
> > Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
> > ./t0_bz1889478.sh: line 124: 5973 Killed while true; do
> > echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1;
> > done > /dev/null 2>&1
> > Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
> > PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1
> >
> > For diagnostics and verification, of course a tcpdump can be
> > used, or even READDIR ops and time can be compared as in the
> > reproducer, but also the included tracepoints can be used. For
> > the tracepoints, before step #2 above use the below trace-cmd
> > to trace the listing and see whether the problem occurs or not,
> > evidenced by either the presence of nfs_invalidate_mapping*
> > trace events or multiple nfs_readdir_enter calls with
> > "cookie=0x00000000":
> > trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_*
> >
> >
> > Dave Wysochanski (11):
> > NFSv4: Improve nfs4_readdir tracepoint by adding additional fields
> > NFS: Replace dfprintk statements with trace events in nfs_readdir
> > NFS: Move nfs_readdir_descriptor_t into internal header
> > NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t
> > NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir
> > NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit
> > NFS: Add tracepoint to entry and exit of nfs_do_filldir
> > NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints
> > NFS: Improve performance of listing directories being modified
> > NFS: Add page_index to nfs_readdir enter and exit tracepoints
> > NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir()
> >
> > fs/nfs/dir.c | 101 +++++++++---------
> > fs/nfs/internal.h | 18 ++++
> > fs/nfs/nfs3xdr.c | 2 +-
> > fs/nfs/nfs4proc.c | 2 +-
> > fs/nfs/nfs4trace.h | 44 +++++++-
> > fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/nfs_fs.h | 1 +
> > 7 files changed, 394 insertions(+), 51 deletions(-)
> >
> > --
> > 1.8.3.1
> >
>
> --
> Chuck Lever
>
>
>

2020-11-02 15:16:43

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

> On Nov 2, 2020, at 10:07 AM, David Wysochanski <[email protected]> wrote:
>
> On Mon, Nov 2, 2020 at 9:29 AM Chuck Lever <[email protected]> wrote:
>>
>>
>>
>>> On Nov 2, 2020, at 8:50 AM, Dave Wysochanski <[email protected]> wrote:
>>>
>>> Note these patches are also on github at:
>>> https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir
>>>
>>> This patchset is mostly tracepoints but patch 9 fixes a performance
>>> problem with NFSv4.x when listing a directory that is being modified.
>>> After the NFSv4.x patch, for consistency, patch 11 is included which
>>> makes NFSv3 behave the same as NFSv4, and given the inclusion of
>>> patch 9 does not re-introduce the regression fixed in previous commit
>>> 79f687a3de9e. As described in patch 11, NFSv3 does not currently have
>>> the NFSv4 problem because it does not drop the pagecache when a
>>> large directory read is in progress.
>>
>> Hi Dave-
>>
>> Replacing our community's deep dependence on wire capture is a good
>> thing.
>>
> Chuck, I agree and thanks for your early response here.
>
>> My initial concern with the tracepoint patches is that there are
>> so many new tracepoints. In cases where we might want to enable all
>> of them on an intensive workload, that generates an enormous amount
>> of tracepoint traffic, which might overwhelm local file storage or
>> analysis tools.
>>
>> Also, much of this infrastructure is useful for developers, but
>> how much will be used in the field? The point of adding permanent
>> tracepoints should be to capture just enough to enable sustaining
>> engineers to diagnose problems remotely. Developers can add very
>> specific tracepoints as they need to, but also they can use tools
>> like systemtap or eBPF.
>>
>> Boiled down, I'm wondering if there's a way to prune some of this.
>> Are the dprintk call sites you're replacing all that useful, for
>> instance?
>>
> Yes this is a good point. I struggled with this too, I actually had even
> more tracepoints than was included here, and even looked
> at the size of nfs.ko before and after (it is about 22k difference).
> # size /lib/modules/$(uname -r)/kernel/fs/nfs/nfs.ko
> text data bss dec hex filename
> 250820 67272 408 318500 4dc24
> /lib/modules/5.9.0-nfs-readdir+/kernel/fs/nfs/nfs.ko
> # size /lib/modules/5.9.0/kernel/fs/nfs/nfs.ko
> text data bss dec hex filename
> 240232 55496 408 296136 484c8 /lib/modules/5.9.0/kernel/fs/nfs/nfs.ko
>
> I can surely take another look at this and see what are the most
> important, but off the top of my head, I'm thinking nfs_readdir() entry
> and exit, as well as tracing the actual READDIR ops, probably something
> searching the NFS readdir cache. Are there tracepoints that look
> unnecessary to you, or are you mainly asking me for more justification
> for each one?

Entry and exit tracepoints are usually easily replaced by
"function_graph" or "function" tracing or with eBPF scripts.
So those especially would need some justification for keeping
them in the form of permanent tracepoints.

IMO any place where the tracepoint is simply "I got here" is
a place where I would raise a question of whether it is needed
in common case troubleshooting.

>> And: Will it be easy to backport the fixes at the end? Maybe those
>> should go earlier in the patch set.
>>
> Yes it could be done that way. I also thought of separating the tracing
> patches into one set, then the fixes another, but decided on one post
> maybe due to the intro patch which ties everything together.
>
> FWIW, there is only one tiny conflicts on 5.9 with the two fixes and
> it's with patch 9:
> 863c92f4328a... NFS: Improve performance of listing directories being modified
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@@ -949,7 -941,9 +952,13 @@@ static int nfs_readdir(struct file *fil
> out:
> if (res > 0)
> res = 0;
> ++<<<<<<< HEAD
> + dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
> ++=======
> + dir_ctx->page_index = desc->page_index;
> + trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx->dir_cookie,
> + NFS_SERVER(inode)->dtsize, my_desc.plus, res);
> ++>>>>>>> 863c92f4328a... NFS: Improve performance of listing
> directories being modified
> return res;
> }
>
>
>>
>>> As background, currently any process that is listing a directory
>>> must search starting at cookie 0 at each entry to nfs_readdir(),
>>> regardless of the previous cookie returned by the server, regardless of
>>> NFS version, and regardless of whether the directory is being modified
>>> or cache expired. This means, for example, if a large directory listing
>>> requires many getdents64/nfs_readdir calls, the kernel will search
>>> through the pagecache with an increasing penalty as the size of the
>>> directory increases and the process attempts to obtain the latter
>>> entries of the directory. When the directory is being modified, and
>>> when acdirmax is exceeded, and nfs_attribute_cache_expired() returns
>>> true, with NFSv4.x we drop the pagecache for the directory entries,
>>> so the process cannot make forward progress.
>>>
>>> I investigated using a similar approach as was done with NFSv3 in
>>> commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory
>>> attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode()
>>> unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode()
>>> is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired()
>>> to always return false, leaving the pagecache in tact despite
>>> NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3
>>> approach could be done for NFSv4 to achieve the same behavior as with
>>> NFSv3. In addition, the current NFSv3 approach leaves in place the
>>> aforementioned penalty, which for large directories can be substantial.
>>> So rather than this approach, the approach taken with NFSv4 leaves
>>> in place the dropping of the pagecache when acdirmax expires, and
>>> relies on the fact that a process can trust the last cookie returned
>>> by the server and continue at that point in the pagecache without
>>> forcing the current process to start over from cookie 0 upon the
>>> next entry to nfs_readdir(). As far as I could tell there is no
>>> POSIX requirement to start from 0 again when a directory is being
>>> modified during an in-progress directory read, and so the current
>>> behavior is only an implementation limitation.
>>>
>>> The NFSv4 performance problem can be seen by exporting a directory
>>> with a larger number of files such that the uncached time to list the
>>> directory exceeds acdirmax. I have an explicit reproducer script
>>> I can provide, but a simple reproducer outline is as follows:
>>>
>>> 1. First export a directory that contains enough files that the
>>> listing exceeds acdirmax. In my testing, the NFS server and client
>>> were the same system, there were 500,000 files in the directory,
>>> and I set acdirmin=10,acdirmax=20.
>>>
>>> 2. Mount client1 (writer) and client2 (reader). Note below I use
>>> the loopback IP address, and a 'wsize' parameter on the writer mount
>>> so we get different superblocks:
>>> mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1
>>> mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2
>>>
>>> 3. Start client1 (writer):
>>> echo 3 > /proc/sys/vm/drop_caches
>>> i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done > /dev/null 2>&1 &
>>>
>>> 4. Start client2 (reader):
>>> while true; do time ls -1f $MNT1 | wc -l; done &
>>>
>>> The output from my reproducer script is:
>>> ./t0_bz1889478.sh 4.1 127.0.0.1 500000
>>> Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory is idle
>>> Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
>>> Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
>>> Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files
>>> Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
>>> Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete
>>> Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the directory of 500000 files
>>> Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
>>> Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
>>> Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and files 500000 directory being modified
>>> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt1
>>> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2
>>> Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 > /proc/sys/vm/drop_caches
>>> Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1 file/sec at /mnt/nfs/bz1889478-mnt2
>>> Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf /mnt/nfs/bz1889478-mnt1
>>> Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete
>>> Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the directory of 500000 files
>>> Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0 after = 4902)
>>> ./t0_bz1889478.sh: line 124: 5973 Killed while true; do
>>> echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1;
>>> done > /dev/null 2>&1
>>> Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1 /mnt/nfs/bz1889478-mnt2
>>> PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1
>>>
>>> For diagnostics and verification, of course a tcpdump can be
>>> used, or even READDIR ops and time can be compared as in the
>>> reproducer, but also the included tracepoints can be used. For
>>> the tracepoints, before step #2 above use the below trace-cmd
>>> to trace the listing and see whether the problem occurs or not,
>>> evidenced by either the presence of nfs_invalidate_mapping*
>>> trace events or multiple nfs_readdir_enter calls with
>>> "cookie=0x00000000":
>>> trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_*
>>>
>>>
>>> Dave Wysochanski (11):
>>> NFSv4: Improve nfs4_readdir tracepoint by adding additional fields
>>> NFS: Replace dfprintk statements with trace events in nfs_readdir
>>> NFS: Move nfs_readdir_descriptor_t into internal header
>>> NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t
>>> NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir
>>> NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit
>>> NFS: Add tracepoint to entry and exit of nfs_do_filldir
>>> NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints
>>> NFS: Improve performance of listing directories being modified
>>> NFS: Add page_index to nfs_readdir enter and exit tracepoints
>>> NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir()
>>>
>>> fs/nfs/dir.c | 101 +++++++++---------
>>> fs/nfs/internal.h | 18 ++++
>>> fs/nfs/nfs3xdr.c | 2 +-
>>> fs/nfs/nfs4proc.c | 2 +-
>>> fs/nfs/nfs4trace.h | 44 +++++++-
>>> fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/nfs_fs.h | 1 +
>>> 7 files changed, 394 insertions(+), 51 deletions(-)
>>>
>>> --
>>> 1.8.3.1
>>>
>>
>> --
>> Chuck Lever

--
Chuck Lever

2020-11-02 15:59:54

by Mkrtchyan, Tigran

[permalink] [raw]

Subject: Re: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

Hi Dave,

thanks for addressing this issue!

The readdir on large directory that's permanently changes is a pain in a b...
We have solved it by creating a snapshot of a directory on the server and
identifying this snapshot by cookie verifier. This is probably it not posix
compliant, but works for us.

I am happy to run some tests.

Tigran.

----- Original Message -----
> From: "David Wysochanski" <[email protected]>
> To: "trondmy" <[email protected]>, "Anna Schumaker" <[email protected]>
> Cc: "linux-nfs" <[email protected]>
> Sent: Monday, 2 November, 2020 14:50:00
> Subject: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories

> Note these patches are also on github at:
> https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir
>
> This patchset is mostly tracepoints but patch 9 fixes a performance
> problem with NFSv4.x when listing a directory that is being modified.
> After the NFSv4.x patch, for consistency, patch 11 is included which
> makes NFSv3 behave the same as NFSv4, and given the inclusion of
> patch 9 does not re-introduce the regression fixed in previous commit
> 79f687a3de9e. As described in patch 11, NFSv3 does not currently have
> the NFSv4 problem because it does not drop the pagecache when a
> large directory read is in progress.
>
> As background, currently any process that is listing a directory
> must search starting at cookie 0 at each entry to nfs_readdir(),
> regardless of the previous cookie returned by the server, regardless of
> NFS version, and regardless of whether the directory is being modified
> or cache expired. This means, for example, if a large directory listing
> requires many getdents64/nfs_readdir calls, the kernel will search
> through the pagecache with an increasing penalty as the size of the
> directory increases and the process attempts to obtain the latter
> entries of the directory. When the directory is being modified, and
> when acdirmax is exceeded, and nfs_attribute_cache_expired() returns
> true, with NFSv4.x we drop the pagecache for the directory entries,
> so the process cannot make forward progress.
>
> I investigated using a similar approach as was done with NFSv3 in
> commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory
> attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode()
> unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode()
> is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired()
> to always return false, leaving the pagecache in tact despite
> NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3
> approach could be done for NFSv4 to achieve the same behavior as with
> NFSv3. In addition, the current NFSv3 approach leaves in place the
> aforementioned penalty, which for large directories can be substantial.
> So rather than this approach, the approach taken with NFSv4 leaves
> in place the dropping of the pagecache when acdirmax expires, and
> relies on the fact that a process can trust the last cookie returned
> by the server and continue at that point in the pagecache without
> forcing the current process to start over from cookie 0 upon the
> next entry to nfs_readdir(). As far as I could tell there is no
> POSIX requirement to start from 0 again when a directory is being
> modified during an in-progress directory read, and so the current
> behavior is only an implementation limitation.
>
> The NFSv4 performance problem can be seen by exporting a directory
> with a larger number of files such that the uncached time to list the
> directory exceeds acdirmax. I have an explicit reproducer script
> I can provide, but a simple reproducer outline is as follows:
>
> 1. First export a directory that contains enough files that the
> listing exceeds acdirmax. In my testing, the NFS server and client
> were the same system, there were 500,000 files in the directory,
> and I set acdirmin=10,acdirmax=20.
>
> 2. Mount client1 (writer) and client2 (reader). Note below I use
> the loopback IP address, and a 'wsize' parameter on the writer mount
> so we get different superblocks:
> mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1
> mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2
>
> 3. Start client1 (writer):
> echo 3 > /proc/sys/vm/drop_caches
> i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done >
> /dev/null 2>&1 &
>
> 4. Start client2 (reader):
> while true; do time ls -1f $MNT1 | wc -l; done &
>
> The output from my reproducer script is:
> ./t0_bz1889478.sh 4.1 127.0.0.1 500000
> Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and
> files 500000 directory is idle
> Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 >
> /proc/sys/vm/drop_caches
> Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files
> Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete
> Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the
> directory of 500000 files
> Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0
> after = 4902)
> Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1
> /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and
> files 500000 directory being modified
> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536
> 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 >
> /proc/sys/vm/drop_caches
> Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1
> file/sec at /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete
> Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the
> directory of 500000 files
> Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0
> after = 4902)
> ./t0_bz1889478.sh: line 124: 5973 Killed while true; do
> echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1;
> done > /dev/null 2>&1
> Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1
> /mnt/nfs/bz1889478-mnt2
> PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1
>
> For diagnostics and verification, of course a tcpdump can be
> used, or even READDIR ops and time can be compared as in the
> reproducer, but also the included tracepoints can be used. For
> the tracepoints, before step #2 above use the below trace-cmd
> to trace the listing and see whether the problem occurs or not,
> evidenced by either the presence of nfs_invalidate_mapping*
> trace events or multiple nfs_readdir_enter calls with
> "cookie=0x00000000":
> trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e
> nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_*
>
>
> Dave Wysochanski (11):
> NFSv4: Improve nfs4_readdir tracepoint by adding additional fields
> NFS: Replace dfprintk statements with trace events in nfs_readdir
> NFS: Move nfs_readdir_descriptor_t into internal header
> NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t
> NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir
> NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit
> NFS: Add tracepoint to entry and exit of nfs_do_filldir
> NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints
> NFS: Improve performance of listing directories being modified
> NFS: Add page_index to nfs_readdir enter and exit tracepoints
> NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir()
>
> fs/nfs/dir.c | 101 +++++++++---------
> fs/nfs/internal.h | 18 ++++
> fs/nfs/nfs3xdr.c | 2 +-
> fs/nfs/nfs4proc.c | 2 +-
> fs/nfs/nfs4trace.h | 44 +++++++-
> fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/nfs_fs.h | 1 +
> 7 files changed, 394 insertions(+), 51 deletions(-)
>
> --
> 1.8.3.1