2019-12-20 04:02:04

by Dai Ngo

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

Hi Anna, Trond,

I made a mistake with the 5.5 numbers. The VM that runs 5.5 has some
problems. There is no regression with 5.5, here are the new numbers:

Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:  10m35.789s
Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:  3m55.003s

My apologies for the mistake.

Now there is no regression with 5.5, I'd like to get your opinion
regarding the change to revert the call from invalidate_mapping_pages
to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
not quite sure about the intention of the prior change from
nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
seeking advise. Or do you have any suggestions to achieve the same?

Thanks,
-Dai

On 12/17/19 4:34 PM, Dai Ngo wrote:
> Hi,
>
> I'd like to report an issue with 'ls -lrt' on NFSv3 client takes
> a very long time to display the content of a large directory
> (100k - 200k files) while the directory is being modified by
> another NFSv3 client.
>
> The problem can be reproduced using 3 systems. One system serves
> as the NFS server, one system runs as the client that doing the
> 'ls -lrt' and another system runs the client that creates files
> on the server.
>     Client1 creates files using this simple script:
>
>> #!/bin/sh
>>
>> if [ $# -lt 2 ]; then
>>         echo "Usage: $0 number_of_files base_filename"
>>         exit
>> fi    nfiles=$1
>> fname=$2
>> echo "creating $nfiles files using filename[$fname]..."
>> i=0         while [ i -lt $nfiles ] ;
>> do            i=`expr $i + 1`
>>         echo "xyz" > $fname$i
>>         echo "$fname$i" done
>
> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
>
> The network traces and dtrace probes showed numerous READDIRPLUS3
> requests restarting  from cookie 0 which seemed to indicate the
> cached pages of the directory were invalidated causing the pages
> to be refilled starting from cookie 0 until the current requested
> cookie.  The cached page invalidation were tracked to
> nfs_force_use_readdirplus().  To verify, I made the below
> modification, ran the test for various kernel versions and
> captured the results shown below.
>
> The modification is:
>
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index a73e2f8bd8ec..5d4a64555fa7 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct inode *dir)
>>      if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>          !list_empty(&nfsi->open_files)) {
>>          set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
>> -        invalidate_mapping_pages(dir->i_mapping, 0, -1);
>> +        nfs_zap_mapping(dir, dir->i_mapping);
>>      }
>>  }
>
> Note that after this change, I did not see READDIRPLUS3 restarting
> with cookie 0 anymore.
>
> Below are the summary results of 'ls -lrt'.  For each kernel version
> to be compared, one row for the original kernel and one row for the
> kernel with the above modification.
>
> I cloned dtrace-linux from here:
> github.com/oracle/dtrace-linux-kernel
>
> dtrace-linux 5.1.0-rc4 [ORI] 89191: 2m59.32s   193071: 6m7.810s
> dtrace-linux 5.1.0-rc4 [MOD] 98771: 1m55.900s  191322: 3m48.668s
>
> I cloned upstream Linux from here:
> git.kernel.org/pub/scm/linux/kernel/git/tovards/linux.git
>
> Upstream Linux 5.5.0-rc1 [ORI] 87891: 5m11.089s  160974: 14m4.384s
> Upstream Linux 5.5.0-rc1 [MOD] 87075: 5m2.057s   161421: 14m33.615s
>
> Please note that these are relative performance numbers and are used
> to illustrate the issue only.
>
> For reference, on the original dtrace-linux it takes about 9s for
> 'ls -ltr' to complete on a directory with 200k files if the directory
> is not modified while 'ls' is running.
>
> The number of the original Upstream Linux is *really* bad, and the
> modification did not seem to have any effect, not sure why...
> it could be something else is going on here.
>
> The cache invalidation in nfs_force_use_readdirplus seems too
> drastic and might need to be reviewed. Even though this change
> helps but it did not get the 'ls' performance to where it's
> expected to be. I think even though READDIRPLUS3 was used, the
> attribute cache was invalidated due to the directory modification,
> causing attribute cache misses resulting in the calls to
> nfs_force_use_readdirplus as shown in this stack trace:
>
>   0  17586     page_cache_tree_delete:entry
>               vmlinux`remove_mapping+0x14
>               vmlinux`invalidate_inode_page+0x7c
>               vmlinux`invalidate_mapping_pages+0x1dd
>               nfs`nfs_force_use_readdirplus+0x47
>               nfs`__dta_nfs_lookup_revalidate_478+0x5dd
>               vmlinux`d_revalidate.part.24+0x10
>               vmlinux`lookup_fast+0x254
>               vmlinux`walk_component+0x49
>               vmlinux`path_lookupat+0x79
>               vmlinux`filename_lookup+0xaf
>               vmlinux`user_path_at_empty+0x36
>               vmlinux`vfs_statx+0x77
>               vmlinux`SYSC_newlstat+0x3d
>               vmlinux`SyS_newlstat+0xe
>               vmlinux`do_syscall_64+0x79
>               vmlinux`entry_SYSCALL_64+0x18d
>
> Besides the overhead of refilling the page caches from cookie 0,
> I think the reason 'ls' still takes so long to compete because the
> client has to send a bunch of additional LOOKUP/ACCESS requests
> over the wire to service the stat(2) calls from 'ls' due to the
> attribute cache misses.
>
> Please let me know you what you think and if there is any addition
> information is needed.
>
> Thanks,
> -Dai
>
>


2020-01-15 18:11:44

by Dai Ngo

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

Hi Anna, Trond,

Would you please let me know your opinion regarding reverting the change in
nfs_force_use_readdirplus to call nfs_zap_mapping instead of invalidate_mapping_pages.
This change is to prevent the cookie of the READDIRPLUS to be reset to 0 while
an instance of 'ls' is running and the directory is being modified.

> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
> a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
> b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
> nfs_force_use_readdirplus(struct inode *dir)      if
> (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&         
> !list_empty(&nfsi->open_files)) {         
> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -       
> invalidate_mapping_pages(dir->i_mapping, 0, -1); +       
> nfs_zap_mapping(dir, dir->i_mapping);      }  }


Thanks,
-Dai

On 12/19/19 8:01 PM, Dai Ngo wrote:
> Hi Anna, Trond,
>
> I made a mistake with the 5.5 numbers. The VM that runs 5.5 has some
> problems. There is no regression with 5.5, here are the new numbers:
>
> Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s  197891: 10m35.789s
> Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s  192801: 3m55.003s
>
> My apologies for the mistake.
>
> Now there is no regression with 5.5, I'd like to get your opinion
> regarding the change to revert the call from invalidate_mapping_pages
> to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
> current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
> not quite sure about the intention of the prior change from
> nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
> seeking advise. Or do you have any suggestions to achieve the same?
>
> Thanks,
> -Dai
>
> On 12/17/19 4:34 PM, Dai Ngo wrote:
>> Hi,
>>
>> I'd like to report an issue with 'ls -lrt' on NFSv3 client takes
>> a very long time to display the content of a large directory
>> (100k - 200k files) while the directory is being modified by
>> another NFSv3 client.
>>
>> The problem can be reproduced using 3 systems. One system serves
>> as the NFS server, one system runs as the client that doing the
>> 'ls -lrt' and another system runs the client that creates files
>> on the server.
>>     Client1 creates files using this simple script:
>>
>>> #!/bin/sh
>>>
>>> if [ $# -lt 2 ]; then
>>>         echo "Usage: $0 number_of_files base_filename"
>>>         exit
>>> fi    nfiles=$1
>>> fname=$2
>>> echo "creating $nfiles files using filename[$fname]..."
>>> i=0         while [ i -lt $nfiles ] ;
>>> do            i=`expr $i + 1`
>>>         echo "xyz" > $fname$i
>>>         echo "$fname$i" done
>>
>> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
>>
>> The network traces and dtrace probes showed numerous READDIRPLUS3
>> requests restarting  from cookie 0 which seemed to indicate the
>> cached pages of the directory were invalidated causing the pages
>> to be refilled starting from cookie 0 until the current requested
>> cookie.  The cached page invalidation were tracked to
>> nfs_force_use_readdirplus().  To verify, I made the below
>> modification, ran the test for various kernel versions and
>> captured the results shown below.
>>
>> The modification is:
>>
>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>> index a73e2f8bd8ec..5d4a64555fa7 100644
>>> --- a/fs/nfs/dir.c
>>> +++ b/fs/nfs/dir.c
>>> @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct inode *dir)
>>>      if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>          !list_empty(&nfsi->open_files)) {
>>>          set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
>>> -        invalidate_mapping_pages(dir->i_mapping, 0, -1);
>>> +        nfs_zap_mapping(dir, dir->i_mapping);
>>>      }
>>>  }
>>
>> Note that after this change, I did not see READDIRPLUS3 restarting
>> with cookie 0 anymore.
>>
>> Below are the summary results of 'ls -lrt'.  For each kernel version
>> to be compared, one row for the original kernel and one row for the
>> kernel with the above modification.
>>
>> I cloned dtrace-linux from here:
>> github.com/oracle/dtrace-linux-kernel
>>
>> dtrace-linux 5.1.0-rc4 [ORI] 89191: 2m59.32s   193071: 6m7.810s
>> dtrace-linux 5.1.0-rc4 [MOD] 98771: 1m55.900s  191322: 3m48.668s
>>
>> I cloned upstream Linux from here:
>> git.kernel.org/pub/scm/linux/kernel/git/tovards/linux.git
>>
>> Upstream Linux 5.5.0-rc1 [ORI] 87891: 5m11.089s  160974: 14m4.384s
>> Upstream Linux 5.5.0-rc1 [MOD] 87075: 5m2.057s   161421: 14m33.615s
>>
>> Please note that these are relative performance numbers and are used
>> to illustrate the issue only.
>>
>> For reference, on the original dtrace-linux it takes about 9s for
>> 'ls -ltr' to complete on a directory with 200k files if the directory
>> is not modified while 'ls' is running.
>>
>> The number of the original Upstream Linux is *really* bad, and the
>> modification did not seem to have any effect, not sure why...
>> it could be something else is going on here.
>>
>> The cache invalidation in nfs_force_use_readdirplus seems too
>> drastic and might need to be reviewed. Even though this change
>> helps but it did not get the 'ls' performance to where it's
>> expected to be. I think even though READDIRPLUS3 was used, the
>> attribute cache was invalidated due to the directory modification,
>> causing attribute cache misses resulting in the calls to
>> nfs_force_use_readdirplus as shown in this stack trace:
>>
>>   0  17586     page_cache_tree_delete:entry
>>               vmlinux`remove_mapping+0x14
>>               vmlinux`invalidate_inode_page+0x7c
>>               vmlinux`invalidate_mapping_pages+0x1dd
>>               nfs`nfs_force_use_readdirplus+0x47
>>               nfs`__dta_nfs_lookup_revalidate_478+0x5dd
>>               vmlinux`d_revalidate.part.24+0x10
>>               vmlinux`lookup_fast+0x254
>>               vmlinux`walk_component+0x49
>>               vmlinux`path_lookupat+0x79
>>               vmlinux`filename_lookup+0xaf
>>               vmlinux`user_path_at_empty+0x36
>>               vmlinux`vfs_statx+0x77
>>               vmlinux`SYSC_newlstat+0x3d
>>               vmlinux`SyS_newlstat+0xe
>>               vmlinux`do_syscall_64+0x79
>>               vmlinux`entry_SYSCALL_64+0x18d
>>
>> Besides the overhead of refilling the page caches from cookie 0,
>> I think the reason 'ls' still takes so long to compete because the
>> client has to send a bunch of additional LOOKUP/ACCESS requests
>> over the wire to service the stat(2) calls from 'ls' due to the
>> attribute cache misses.
>>
>> Please let me know you what you think and if there is any addition
>> information is needed.
>>
>> Thanks,
>> -Dai
>>
>>

2020-01-15 18:54:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
> Hi Anna, Trond,
>
> Would you please let me know your opinion regarding reverting the
> change in
> nfs_force_use_readdirplus to call nfs_zap_mapping instead of
> invalidate_mapping_pages.
> This change is to prevent the cookie of the READDIRPLUS to be reset
> to 0 while
> an instance of 'ls' is running and the directory is being modified.
>
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
> > a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
> > b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
> > nfs_force_use_readdirplus(struct inode *dir) if
> > (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > !list_empty(&nfsi->open_files)) {
> > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
> > invalidate_mapping_pages(dir->i_mapping, 0, -1); +
> > nfs_zap_mapping(dir, dir->i_mapping); } }
>
> Thanks,
> -Dai
>
> On 12/19/19 8:01 PM, Dai Ngo wrote:
> > Hi Anna, Trond,
> >
> > I made a mistake with the 5.5 numbers. The VM that runs 5.5 has
> > some
> > problems. There is no regression with 5.5, here are the new
> > numbers:
> >
> > Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891: 10m35.789s
> > Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801: 3m55.003s
> >
> > My apologies for the mistake.
> >
> > Now there is no regression with 5.5, I'd like to get your opinion
> > regarding the change to revert the call from
> > invalidate_mapping_pages
> > to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
> > current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
> > not quite sure about the intention of the prior change from
> > nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
> > seeking advise. Or do you have any suggestions to achieve the same?
> >
> > Thanks,
> > -Dai
> >
> > On 12/17/19 4:34 PM, Dai Ngo wrote:
> > > Hi,
> > >
> > > I'd like to report an issue with 'ls -lrt' on NFSv3 client takes
> > > a very long time to display the content of a large directory
> > > (100k - 200k files) while the directory is being modified by
> > > another NFSv3 client.
> > >
> > > The problem can be reproduced using 3 systems. One system serves
> > > as the NFS server, one system runs as the client that doing the
> > > 'ls -lrt' and another system runs the client that creates files
> > > on the server.
> > > Client1 creates files using this simple script:
> > >
> > > > #!/bin/sh
> > > >
> > > > if [ $# -lt 2 ]; then
> > > > echo "Usage: $0 number_of_files base_filename"
> > > > exit
> > > > fi nfiles=$1
> > > > fname=$2
> > > > echo "creating $nfiles files using filename[$fname]..."
> > > > i=0 while [ i -lt $nfiles ] ;
> > > > do i=`expr $i + 1`
> > > > echo "xyz" > $fname$i
> > > > echo "$fname$i" done
> > >
> > > Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
> > >
> > > The network traces and dtrace probes showed numerous READDIRPLUS3
> > > requests restarting from cookie 0 which seemed to indicate the
> > > cached pages of the directory were invalidated causing the pages
> > > to be refilled starting from cookie 0 until the current requested
> > > cookie. The cached page invalidation were tracked to
> > > nfs_force_use_readdirplus(). To verify, I made the below
> > > modification, ran the test for various kernel versions and
> > > captured the results shown below.
> > >
> > > The modification is:
> > >
> > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > index a73e2f8bd8ec..5d4a64555fa7 100644
> > > > --- a/fs/nfs/dir.c
> > > > +++ b/fs/nfs/dir.c
> > > > @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct inode
> > > > *dir)
> > > > if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > > > !list_empty(&nfsi->open_files)) {
> > > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> > > > - invalidate_mapping_pages(dir->i_mapping, 0, -1);
> > > > + nfs_zap_mapping(dir, dir->i_mapping);
> > > > }
> > > > }

This change is only reverting part of commit 79f687a3de9e. My problem
with that is as follows:

RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers must match
those returned by previous READDIRPLUS calls, and READDIR cookies and
verifiers must match those returned by previous READDIR calls. It says
nothing about being able to assume cookies from READDIR and READDIRPLUS
calls are interchangeable. So the only reason I can see for the
invalidate_mapping_pages() is to ensure that we do separate the two
cookie caches.

OTOH, for NFSv4, there is no separate READDIRPLUS function, so there
really does not appear to be any reason to clear the page cache at all
as we're switching between requesting attributes or not.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-01-15 19:06:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
> On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
> > Hi Anna, Trond,
> >
> > Would you please let me know your opinion regarding reverting the
> > change in
> > nfs_force_use_readdirplus to call nfs_zap_mapping instead of
> > invalidate_mapping_pages.
> > This change is to prevent the cookie of the READDIRPLUS to be reset
> > to 0 while
> > an instance of 'ls' is running and the directory is being modified.
> >
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
> > > a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
> > > b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
> > > nfs_force_use_readdirplus(struct inode *dir) if
> > > (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > > !list_empty(&nfsi->open_files)) {
> > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
> > > invalidate_mapping_pages(dir->i_mapping, 0, -1); +
> > > nfs_zap_mapping(dir, dir->i_mapping); } }
> >
> > Thanks,
> > -Dai
> >
> > On 12/19/19 8:01 PM, Dai Ngo wrote:
> > > Hi Anna, Trond,
> > >
> > > I made a mistake with the 5.5 numbers. The VM that runs 5.5 has
> > > some
> > > problems. There is no regression with 5.5, here are the new
> > > numbers:
> > >
> > > Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
> > > 10m35.789s
> > > Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
> > > 3m55.003s
> > >
> > > My apologies for the mistake.
> > >
> > > Now there is no regression with 5.5, I'd like to get your opinion
> > > regarding the change to revert the call from
> > > invalidate_mapping_pages
> > > to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
> > > current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
> > > not quite sure about the intention of the prior change from
> > > nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
> > > seeking advise. Or do you have any suggestions to achieve the
> > > same?
> > >
> > > Thanks,
> > > -Dai
> > >
> > > On 12/17/19 4:34 PM, Dai Ngo wrote:
> > > > Hi,
> > > >
> > > > I'd like to report an issue with 'ls -lrt' on NFSv3 client
> > > > takes
> > > > a very long time to display the content of a large directory
> > > > (100k - 200k files) while the directory is being modified by
> > > > another NFSv3 client.
> > > >
> > > > The problem can be reproduced using 3 systems. One system
> > > > serves
> > > > as the NFS server, one system runs as the client that doing the
> > > > 'ls -lrt' and another system runs the client that creates files
> > > > on the server.
> > > > Client1 creates files using this simple script:
> > > >
> > > > > #!/bin/sh
> > > > >
> > > > > if [ $# -lt 2 ]; then
> > > > > echo "Usage: $0 number_of_files base_filename"
> > > > > exit
> > > > > fi nfiles=$1
> > > > > fname=$2
> > > > > echo "creating $nfiles files using filename[$fname]..."
> > > > > i=0 while [ i -lt $nfiles ] ;
> > > > > do i=`expr $i + 1`
> > > > > echo "xyz" > $fname$i
> > > > > echo "$fname$i" done
> > > >
> > > > Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
> > > >
> > > > The network traces and dtrace probes showed numerous
> > > > READDIRPLUS3
> > > > requests restarting from cookie 0 which seemed to indicate the
> > > > cached pages of the directory were invalidated causing the
> > > > pages
> > > > to be refilled starting from cookie 0 until the current
> > > > requested
> > > > cookie. The cached page invalidation were tracked to
> > > > nfs_force_use_readdirplus(). To verify, I made the below
> > > > modification, ran the test for various kernel versions and
> > > > captured the results shown below.
> > > >
> > > > The modification is:
> > > >
> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > index a73e2f8bd8ec..5d4a64555fa7 100644
> > > > > --- a/fs/nfs/dir.c
> > > > > +++ b/fs/nfs/dir.c
> > > > > @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct
> > > > > inode
> > > > > *dir)
> > > > > if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > > > > !list_empty(&nfsi->open_files)) {
> > > > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> > > > > - invalidate_mapping_pages(dir->i_mapping, 0, -1);
> > > > > + nfs_zap_mapping(dir, dir->i_mapping);
> > > > > }
> > > > > }
>
> This change is only reverting part of commit 79f687a3de9e. My problem
> with that is as follows:
>
> RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers must
> match
> those returned by previous READDIRPLUS calls, and READDIR cookies and
> verifiers must match those returned by previous READDIR calls. It
> says
> nothing about being able to assume cookies from READDIR and
> READDIRPLUS
> calls are interchangeable. So the only reason I can see for the
> invalidate_mapping_pages() is to ensure that we do separate the two
> cookie caches.
>
> OTOH, for NFSv4, there is no separate READDIRPLUS function, so there
> really does not appear to be any reason to clear the page cache at
> all
> as we're switching between requesting attributes or not.
>

Sorry... To spell out my objection to this change more clearly: The
call to nfs_zap_mapping() makes no sense in either case.
* It defers the cache invalidation until the next call to
rewinddir()/opendir(), so it does not address the NFSv3 concern.
* It would appear to be entirely superfluous for the NFSv4 case.

So a change that might be acceptable would be to keep the existing call
to invalidate_mapping_pages() for NFSv3, but to remove it for NFSv4.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-01-15 19:30:35

by Dai Ngo

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

On 1/15/20 11:06 AM, Trond Myklebust wrote:
> On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
>> On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
>>> Hi Anna, Trond,
>>>
>>> Would you please let me know your opinion regarding reverting the
>>> change in
>>> nfs_force_use_readdirplus to call nfs_zap_mapping instead of
>>> invalidate_mapping_pages.
>>> This change is to prevent the cookie of the READDIRPLUS to be reset
>>> to 0 while
>>> an instance of 'ls' is running and the directory is being modified.
>>>
>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
>>>> a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
>>>> b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
>>>> nfs_force_use_readdirplus(struct inode *dir) if
>>>> (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>> !list_empty(&nfsi->open_files)) {
>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
>>>> invalidate_mapping_pages(dir->i_mapping, 0, -1); +
>>>> nfs_zap_mapping(dir, dir->i_mapping); } }
>>> Thanks,
>>> -Dai
>>>
>>> On 12/19/19 8:01 PM, Dai Ngo wrote:
>>>> Hi Anna, Trond,
>>>>
>>>> I made a mistake with the 5.5 numbers. The VM that runs 5.5 has
>>>> some
>>>> problems. There is no regression with 5.5, here are the new
>>>> numbers:
>>>>
>>>> Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
>>>> 10m35.789s
>>>> Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
>>>> 3m55.003s
>>>>
>>>> My apologies for the mistake.
>>>>
>>>> Now there is no regression with 5.5, I'd like to get your opinion
>>>> regarding the change to revert the call from
>>>> invalidate_mapping_pages
>>>> to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
>>>> current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
>>>> not quite sure about the intention of the prior change from
>>>> nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
>>>> seeking advise. Or do you have any suggestions to achieve the
>>>> same?
>>>>
>>>> Thanks,
>>>> -Dai
>>>>
>>>> On 12/17/19 4:34 PM, Dai Ngo wrote:
>>>>> Hi,
>>>>>
>>>>> I'd like to report an issue with 'ls -lrt' on NFSv3 client
>>>>> takes
>>>>> a very long time to display the content of a large directory
>>>>> (100k - 200k files) while the directory is being modified by
>>>>> another NFSv3 client.
>>>>>
>>>>> The problem can be reproduced using 3 systems. One system
>>>>> serves
>>>>> as the NFS server, one system runs as the client that doing the
>>>>> 'ls -lrt' and another system runs the client that creates files
>>>>> on the server.
>>>>> Client1 creates files using this simple script:
>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> if [ $# -lt 2 ]; then
>>>>>> echo "Usage: $0 number_of_files base_filename"
>>>>>> exit
>>>>>> fi nfiles=$1
>>>>>> fname=$2
>>>>>> echo "creating $nfiles files using filename[$fname]..."
>>>>>> i=0 while [ i -lt $nfiles ] ;
>>>>>> do i=`expr $i + 1`
>>>>>> echo "xyz" > $fname$i
>>>>>> echo "$fname$i" done
>>>>> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
>>>>>
>>>>> The network traces and dtrace probes showed numerous
>>>>> READDIRPLUS3
>>>>> requests restarting from cookie 0 which seemed to indicate the
>>>>> cached pages of the directory were invalidated causing the
>>>>> pages
>>>>> to be refilled starting from cookie 0 until the current
>>>>> requested
>>>>> cookie. The cached page invalidation were tracked to
>>>>> nfs_force_use_readdirplus(). To verify, I made the below
>>>>> modification, ran the test for various kernel versions and
>>>>> captured the results shown below.
>>>>>
>>>>> The modification is:
>>>>>
>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>> index a73e2f8bd8ec..5d4a64555fa7 100644
>>>>>> --- a/fs/nfs/dir.c
>>>>>> +++ b/fs/nfs/dir.c
>>>>>> @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct
>>>>>> inode
>>>>>> *dir)
>>>>>> if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
>>>>>> - invalidate_mapping_pages(dir->i_mapping, 0, -1);
>>>>>> + nfs_zap_mapping(dir, dir->i_mapping);
>>>>>> }
>>>>>> }
>> This change is only reverting part of commit 79f687a3de9e. My problem
>> with that is as follows:
>>
>> RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers must
>> match
>> those returned by previous READDIRPLUS calls, and READDIR cookies and
>> verifiers must match those returned by previous READDIR calls. It
>> says
>> nothing about being able to assume cookies from READDIR and
>> READDIRPLUS
>> calls are interchangeable. So the only reason I can see for the
>> invalidate_mapping_pages() is to ensure that we do separate the two
>> cookie caches.
>>
>> OTOH, for NFSv4, there is no separate READDIRPLUS function, so there
>> really does not appear to be any reason to clear the page cache at
>> all
>> as we're switching between requesting attributes or not.
>>
> Sorry... To spell out my objection to this change more clearly: The
> call to nfs_zap_mapping() makes no sense in either case.
> * It defers the cache invalidation until the next call to
> rewinddir()/opendir(), so it does not address the NFSv3 concern.
> * It would appear to be entirely superfluous for the NFSv4 case.
>
> So a change that might be acceptable would be to keep the existing call
> to invalidate_mapping_pages() for NFSv3, but to remove it for NFSv4.

Thank you Trond, I'll make your suggested change, test it and resubmit.

-Dai

>

2020-01-18 02:32:52

by Dai Ngo

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

Hi Trond,

On 1/15/20 11:06 AM, Trond Myklebust wrote:
> On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
>> On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
>>> Hi Anna, Trond,
>>>
>>> Would you please let me know your opinion regarding reverting the
>>> change in
>>> nfs_force_use_readdirplus to call nfs_zap_mapping instead of
>>> invalidate_mapping_pages.
>>> This change is to prevent the cookie of the READDIRPLUS to be reset
>>> to 0 while
>>> an instance of 'ls' is running and the directory is being modified.
>>>
>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
>>>> a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
>>>> b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
>>>> nfs_force_use_readdirplus(struct inode *dir) if
>>>> (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>> !list_empty(&nfsi->open_files)) {
>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
>>>> invalidate_mapping_pages(dir->i_mapping, 0, -1); +
>>>> nfs_zap_mapping(dir, dir->i_mapping); } }
>>> Thanks,
>>> -Dai
>>>
>>> On 12/19/19 8:01 PM, Dai Ngo wrote:
>>>> Hi Anna, Trond,
>>>>
>>>> I made a mistake with the 5.5 numbers. The VM that runs 5.5 has
>>>> some
>>>> problems. There is no regression with 5.5, here are the new
>>>> numbers:
>>>>
>>>> Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
>>>> 10m35.789s
>>>> Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
>>>> 3m55.003s
>>>>
>>>> My apologies for the mistake.
>>>>
>>>> Now there is no regression with 5.5, I'd like to get your opinion
>>>> regarding the change to revert the call from
>>>> invalidate_mapping_pages
>>>> to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
>>>> current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
>>>> not quite sure about the intention of the prior change from
>>>> nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
>>>> seeking advise. Or do you have any suggestions to achieve the
>>>> same?
>>>>
>>>> Thanks,
>>>> -Dai
>>>>
>>>> On 12/17/19 4:34 PM, Dai Ngo wrote:
>>>>> Hi,
>>>>>
>>>>> I'd like to report an issue with 'ls -lrt' on NFSv3 client
>>>>> takes
>>>>> a very long time to display the content of a large directory
>>>>> (100k - 200k files) while the directory is being modified by
>>>>> another NFSv3 client.
>>>>>
>>>>> The problem can be reproduced using 3 systems. One system
>>>>> serves
>>>>> as the NFS server, one system runs as the client that doing the
>>>>> 'ls -lrt' and another system runs the client that creates files
>>>>> on the server.
>>>>> Client1 creates files using this simple script:
>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> if [ $# -lt 2 ]; then
>>>>>> echo "Usage: $0 number_of_files base_filename"
>>>>>> exit
>>>>>> fi nfiles=$1
>>>>>> fname=$2
>>>>>> echo "creating $nfiles files using filename[$fname]..."
>>>>>> i=0 while [ i -lt $nfiles ] ;
>>>>>> do i=`expr $i + 1`
>>>>>> echo "xyz" > $fname$i
>>>>>> echo "$fname$i" done
>>>>> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
>>>>>
>>>>> The network traces and dtrace probes showed numerous
>>>>> READDIRPLUS3
>>>>> requests restarting from cookie 0 which seemed to indicate the
>>>>> cached pages of the directory were invalidated causing the
>>>>> pages
>>>>> to be refilled starting from cookie 0 until the current
>>>>> requested
>>>>> cookie. The cached page invalidation were tracked to
>>>>> nfs_force_use_readdirplus(). To verify, I made the below
>>>>> modification, ran the test for various kernel versions and
>>>>> captured the results shown below.
>>>>>
>>>>> The modification is:
>>>>>
>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>> index a73e2f8bd8ec..5d4a64555fa7 100644
>>>>>> --- a/fs/nfs/dir.c
>>>>>> +++ b/fs/nfs/dir.c
>>>>>> @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct
>>>>>> inode
>>>>>> *dir)
>>>>>> if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
>>>>>> - invalidate_mapping_pages(dir->i_mapping, 0, -1);
>>>>>> + nfs_zap_mapping(dir, dir->i_mapping);
>>>>>> }
>>>>>> }
>> This change is only reverting part of commit 79f687a3de9e. My problem
>> with that is as follows:
>>
>> RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers must
>> match
>> those returned by previous READDIRPLUS calls, and READDIR cookies and
>> verifiers must match those returned by previous READDIR calls. It
>> says
>> nothing about being able to assume cookies from READDIR and
>> READDIRPLUS
>> calls are interchangeable. So the only reason I can see for the
>> invalidate_mapping_pages() is to ensure that we do separate the two
>> cookie caches.

If I understand your concern correctly that in NFSv3 the client must
maintain valid cookies and cookie verifiers when switching between
READDIR and READDIRPLUS, or vice sersa, then I think the current client
code handles this condition ok.

On the client, both READDIR and READDIRPLUS requests use the cookie values
from the same cached pages of the directory so I don't think they can be
out of sync when the client switches between READDIRPLUS and READDIR
requests for different nfs_readdir calls.

In fact, currently the first nfs_readdir uses READDIRPLUS's to fill read
the entries and if there is no LOOKUP/GETATTR on one of the directory
entries then the client reverts to READDIR's for subsequent nfs_readdir
calls without invalidating any cached pages of the directory. If there
are LOOKUP/GETATTR done on one of the directory entries then
nfs_advise_use_readdirplus is called which forces the client to use
READDIRPLUS again for the next nfs_readdir.

Unless the user mounts the export with 'nordirplus' option then the
client uses only READDIRs for all requests, no switching takes place.


Thanks,
-Dai

>>
>> OTOH, for NFSv4, there is no separate READDIRPLUS function, so there
>> really does not appear to be any reason to clear the page cache at
>> all
>> as we're switching between requesting attributes or not.
>>
> Sorry... To spell out my objection to this change more clearly: The
> call to nfs_zap_mapping() makes no sense in either case.
> * It defers the cache invalidation until the next call to
> rewinddir()/opendir(), so it does not address the NFSv3 concern.
> * It would appear to be entirely superfluous for the NFSv4 case.
>
> So a change that might be acceptable would be to keep the existing call
> to invalidate_mapping_pages() for NFSv3, but to remove it for NFSv4.
>

2020-01-18 15:59:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

On Fri, 2020-01-17 at 18:29 -0800, Dai Ngo wrote:
> Hi Trond,
>
> On 1/15/20 11:06 AM, Trond Myklebust wrote:
> > On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
> > > On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
> > > > Hi Anna, Trond,
> > > >
> > > > Would you please let me know your opinion regarding reverting
> > > > the
> > > > change in
> > > > nfs_force_use_readdirplus to call nfs_zap_mapping instead of
> > > > invalidate_mapping_pages.
> > > > This change is to prevent the cookie of the READDIRPLUS to be
> > > > reset
> > > > to 0 while
> > > > an instance of 'ls' is running and the directory is being
> > > > modified.
> > > >
> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
> > > > > a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
> > > > > b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
> > > > > nfs_force_use_readdirplus(struct inode *dir) if
> > > > > (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > > > > !list_empty(&nfsi->open_files)) {
> > > > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
> > > > > invalidate_mapping_pages(dir->i_mapping, 0, -1); +
> > > > > nfs_zap_mapping(dir, dir->i_mapping); } }
> > > > Thanks,
> > > > -Dai
> > > >
> > > > On 12/19/19 8:01 PM, Dai Ngo wrote:
> > > > > Hi Anna, Trond,
> > > > >
> > > > > I made a mistake with the 5.5 numbers. The VM that runs 5.5
> > > > > has
> > > > > some
> > > > > problems. There is no regression with 5.5, here are the new
> > > > > numbers:
> > > > >
> > > > > Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
> > > > > 10m35.789s
> > > > > Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
> > > > > 3m55.003s
> > > > >
> > > > > My apologies for the mistake.
> > > > >
> > > > > Now there is no regression with 5.5, I'd like to get your
> > > > > opinion
> > > > > regarding the change to revert the call from
> > > > > invalidate_mapping_pages
> > > > > to nfs_zap_mapping in nfs_force_use_readdirplus to prevent
> > > > > the
> > > > > current 'ls' from restarting the READDIRPLUS3 from cookie 0.
> > > > > I'm
> > > > > not quite sure about the intention of the prior change from
> > > > > nfs_zap_mapping to invalidate_mapping_pages so that is why
> > > > > I'm
> > > > > seeking advise. Or do you have any suggestions to achieve the
> > > > > same?
> > > > >
> > > > > Thanks,
> > > > > -Dai
> > > > >
> > > > > On 12/17/19 4:34 PM, Dai Ngo wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'd like to report an issue with 'ls -lrt' on NFSv3 client
> > > > > > takes
> > > > > > a very long time to display the content of a large
> > > > > > directory
> > > > > > (100k - 200k files) while the directory is being modified
> > > > > > by
> > > > > > another NFSv3 client.
> > > > > >
> > > > > > The problem can be reproduced using 3 systems. One system
> > > > > > serves
> > > > > > as the NFS server, one system runs as the client that doing
> > > > > > the
> > > > > > 'ls -lrt' and another system runs the client that creates
> > > > > > files
> > > > > > on the server.
> > > > > > Client1 creates files using this simple script:
> > > > > >
> > > > > > > #!/bin/sh
> > > > > > >
> > > > > > > if [ $# -lt 2 ]; then
> > > > > > > echo "Usage: $0 number_of_files base_filename"
> > > > > > > exit
> > > > > > > fi nfiles=$1
> > > > > > > fname=$2
> > > > > > > echo "creating $nfiles files using filename[$fname]..."
> > > > > > > i=0 while [ i -lt $nfiles ] ;
> > > > > > > do i=`expr $i + 1`
> > > > > > > echo "xyz" > $fname$i
> > > > > > > echo "$fname$i" done
> > > > > > Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
> > > > > >
> > > > > > The network traces and dtrace probes showed numerous
> > > > > > READDIRPLUS3
> > > > > > requests restarting from cookie 0 which seemed to indicate
> > > > > > the
> > > > > > cached pages of the directory were invalidated causing the
> > > > > > pages
> > > > > > to be refilled starting from cookie 0 until the current
> > > > > > requested
> > > > > > cookie. The cached page invalidation were tracked to
> > > > > > nfs_force_use_readdirplus(). To verify, I made the below
> > > > > > modification, ran the test for various kernel versions and
> > > > > > captured the results shown below.
> > > > > >
> > > > > > The modification is:
> > > > > >
> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > index a73e2f8bd8ec..5d4a64555fa7 100644
> > > > > > > --- a/fs/nfs/dir.c
> > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct
> > > > > > > inode
> > > > > > > *dir)
> > > > > > > if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > > > > > > !list_empty(&nfsi->open_files)) {
> > > > > > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> > > > > > > - invalidate_mapping_pages(dir->i_mapping, 0, -1);
> > > > > > > + nfs_zap_mapping(dir, dir->i_mapping);
> > > > > > > }
> > > > > > > }
> > > This change is only reverting part of commit 79f687a3de9e. My
> > > problem
> > > with that is as follows:
> > >
> > > RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers must
> > > match
> > > those returned by previous READDIRPLUS calls, and READDIR cookies
> > > and
> > > verifiers must match those returned by previous READDIR calls. It
> > > says
> > > nothing about being able to assume cookies from READDIR and
> > > READDIRPLUS
> > > calls are interchangeable. So the only reason I can see for the
> > > invalidate_mapping_pages() is to ensure that we do separate the
> > > two
> > > cookie caches.
>
> If I understand your concern correctly that in NFSv3 the client must
> maintain valid cookies and cookie verifiers when switching between
> READDIR and READDIRPLUS, or vice sersa, then I think the current
> client
> code handles this condition ok.
>
> On the client, both READDIR and READDIRPLUS requests use the cookie
> values
> from the same cached pages of the directory so I don't think they can
> be
> out of sync when the client switches between READDIRPLUS and READDIR
> requests for different nfs_readdir calls.
>
> In fact, currently the first nfs_readdir uses READDIRPLUS's to fill
> read
> the entries and if there is no LOOKUP/GETATTR on one of the directory
> entries then the client reverts to READDIR's for subsequent
> nfs_readdir
> calls without invalidating any cached pages of the directory. If
> there
> are LOOKUP/GETATTR done on one of the directory entries then
> nfs_advise_use_readdirplus is called which forces the client to use
> READDIRPLUS again for the next nfs_readdir.
>
> Unless the user mounts the export with 'nordirplus' option then the
> client uses only READDIRs for all requests, no switching takes place.


I don't understand your point. The issue is that
nfs_advise_use_readdirplus() can cause the behaviour to switch between
use of READDIRPLUS and use of READDIR from one syscall to getdents() to
the next.
If the client is using the same page cache, across those syscalls, then
it will end up caching a mixture of cookies. Furthermore, since the
cookie that is used as an argument to the next call to
READDIR/READDIRPLUS is taken from that page cache, then we can end up
calling READDIRPLUS with a cookie that came from READDIR and vice
versa.

As I said, I'm not convinced that is legal in RFC1813 (NFSv3).

That is why we want to clear the page cache when we swap between use of
READDIR and use of READDIRPLUS for the case of NFSv3.

>
> Thanks,
> -Dai
>
> > > OTOH, for NFSv4, there is no separate READDIRPLUS function, so
> > > there
> > > really does not appear to be any reason to clear the page cache
> > > at
> > > all
> > > as we're switching between requesting attributes or not.
> > >
> > Sorry... To spell out my objection to this change more clearly: The
> > call to nfs_zap_mapping() makes no sense in either case.
> > * It defers the cache invalidation until the next call to
> > rewinddir()/opendir(), so it does not address the NFSv3
> > concern.
> > * It would appear to be entirely superfluous for the NFSv4 case.
> >
> > So a change that might be acceptable would be to keep the existing
> > call
> > to invalidate_mapping_pages() for NFSv3, but to remove it for
> > NFSv4.
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-01-18 17:26:43

by Chuck Lever III

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified



> On Jan 18, 2020, at 10:58 AM, Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2020-01-17 at 18:29 -0800, Dai Ngo wrote:
>> Hi Trond,
>>
>> On 1/15/20 11:06 AM, Trond Myklebust wrote:
>>> On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
>>>> On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
>>>>> Hi Anna, Trond,
>>>>>
>>>>> Would you please let me know your opinion regarding reverting
>>>>> the
>>>>> change in
>>>>> nfs_force_use_readdirplus to call nfs_zap_mapping instead of
>>>>> invalidate_mapping_pages.
>>>>> This change is to prevent the cookie of the READDIRPLUS to be
>>>>> reset
>>>>> to 0 while
>>>>> an instance of 'ls' is running and the directory is being
>>>>> modified.
>>>>>
>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
>>>>>> a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
>>>>>> b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
>>>>>> nfs_force_use_readdirplus(struct inode *dir) if
>>>>>> (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
>>>>>> invalidate_mapping_pages(dir->i_mapping, 0, -1); +
>>>>>> nfs_zap_mapping(dir, dir->i_mapping); } }
>>>>> Thanks,
>>>>> -Dai
>>>>>
>>>>> On 12/19/19 8:01 PM, Dai Ngo wrote:
>>>>>> Hi Anna, Trond,
>>>>>>
>>>>>> I made a mistake with the 5.5 numbers. The VM that runs 5.5
>>>>>> has
>>>>>> some
>>>>>> problems. There is no regression with 5.5, here are the new
>>>>>> numbers:
>>>>>>
>>>>>> Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
>>>>>> 10m35.789s
>>>>>> Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
>>>>>> 3m55.003s
>>>>>>
>>>>>> My apologies for the mistake.
>>>>>>
>>>>>> Now there is no regression with 5.5, I'd like to get your
>>>>>> opinion
>>>>>> regarding the change to revert the call from
>>>>>> invalidate_mapping_pages
>>>>>> to nfs_zap_mapping in nfs_force_use_readdirplus to prevent
>>>>>> the
>>>>>> current 'ls' from restarting the READDIRPLUS3 from cookie 0.
>>>>>> I'm
>>>>>> not quite sure about the intention of the prior change from
>>>>>> nfs_zap_mapping to invalidate_mapping_pages so that is why
>>>>>> I'm
>>>>>> seeking advise. Or do you have any suggestions to achieve the
>>>>>> same?
>>>>>>
>>>>>> Thanks,
>>>>>> -Dai
>>>>>>
>>>>>> On 12/17/19 4:34 PM, Dai Ngo wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'd like to report an issue with 'ls -lrt' on NFSv3 client
>>>>>>> takes
>>>>>>> a very long time to display the content of a large
>>>>>>> directory
>>>>>>> (100k - 200k files) while the directory is being modified
>>>>>>> by
>>>>>>> another NFSv3 client.
>>>>>>>
>>>>>>> The problem can be reproduced using 3 systems. One system
>>>>>>> serves
>>>>>>> as the NFS server, one system runs as the client that doing
>>>>>>> the
>>>>>>> 'ls -lrt' and another system runs the client that creates
>>>>>>> files
>>>>>>> on the server.
>>>>>>> Client1 creates files using this simple script:
>>>>>>>
>>>>>>>> #!/bin/sh
>>>>>>>>
>>>>>>>> if [ $# -lt 2 ]; then
>>>>>>>> echo "Usage: $0 number_of_files base_filename"
>>>>>>>> exit
>>>>>>>> fi nfiles=$1
>>>>>>>> fname=$2
>>>>>>>> echo "creating $nfiles files using filename[$fname]..."
>>>>>>>> i=0 while [ i -lt $nfiles ] ;
>>>>>>>> do i=`expr $i + 1`
>>>>>>>> echo "xyz" > $fname$i
>>>>>>>> echo "$fname$i" done
>>>>>>> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
>>>>>>>
>>>>>>> The network traces and dtrace probes showed numerous
>>>>>>> READDIRPLUS3
>>>>>>> requests restarting from cookie 0 which seemed to indicate
>>>>>>> the
>>>>>>> cached pages of the directory were invalidated causing the
>>>>>>> pages
>>>>>>> to be refilled starting from cookie 0 until the current
>>>>>>> requested
>>>>>>> cookie. The cached page invalidation were tracked to
>>>>>>> nfs_force_use_readdirplus(). To verify, I made the below
>>>>>>> modification, ran the test for various kernel versions and
>>>>>>> captured the results shown below.
>>>>>>>
>>>>>>> The modification is:
>>>>>>>
>>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>>>> index a73e2f8bd8ec..5d4a64555fa7 100644
>>>>>>>> --- a/fs/nfs/dir.c
>>>>>>>> +++ b/fs/nfs/dir.c
>>>>>>>> @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct
>>>>>>>> inode
>>>>>>>> *dir)
>>>>>>>> if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
>>>>>>>> - invalidate_mapping_pages(dir->i_mapping, 0, -1);
>>>>>>>> + nfs_zap_mapping(dir, dir->i_mapping);
>>>>>>>> }
>>>>>>>> }
>>>> This change is only reverting part of commit 79f687a3de9e. My
>>>> problem
>>>> with that is as follows:
>>>>
>>>> RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers must
>>>> match
>>>> those returned by previous READDIRPLUS calls, and READDIR cookies
>>>> and
>>>> verifiers must match those returned by previous READDIR calls. It
>>>> says
>>>> nothing about being able to assume cookies from READDIR and
>>>> READDIRPLUS
>>>> calls are interchangeable. So the only reason I can see for the
>>>> invalidate_mapping_pages() is to ensure that we do separate the
>>>> two
>>>> cookie caches.
>>
>> If I understand your concern correctly that in NFSv3 the client must
>> maintain valid cookies and cookie verifiers when switching between
>> READDIR and READDIRPLUS, or vice sersa, then I think the current
>> client
>> code handles this condition ok.
>>
>> On the client, both READDIR and READDIRPLUS requests use the cookie
>> values
>> from the same cached pages of the directory so I don't think they can
>> be
>> out of sync when the client switches between READDIRPLUS and READDIR
>> requests for different nfs_readdir calls.
>>
>> In fact, currently the first nfs_readdir uses READDIRPLUS's to fill
>> read
>> the entries and if there is no LOOKUP/GETATTR on one of the directory
>> entries then the client reverts to READDIR's for subsequent
>> nfs_readdir
>> calls without invalidating any cached pages of the directory. If
>> there
>> are LOOKUP/GETATTR done on one of the directory entries then
>> nfs_advise_use_readdirplus is called which forces the client to use
>> READDIRPLUS again for the next nfs_readdir.
>>
>> Unless the user mounts the export with 'nordirplus' option then the
>> client uses only READDIRs for all requests, no switching takes place.
>
>
> I don't understand your point.

The original point was that the directory's page cache seems to
be cleared a little too often (quite apart from switching between
READDIRPLUS and READDIR).

I think Dai is saying that cache clearing is appropriate to defer
when the directory's mtime has changed but the READ method remains
the same. Otherwise repeatedly adding a new file to a very large
directory that is being read can trigger a situation where the
reading getdents loop never completes.

My two cents Euro.


> The issue is that
> nfs_advise_use_readdirplus() can cause the behaviour to switch between
> use of READDIRPLUS and use of READDIR from one syscall to getdents() to
> the next.
> If the client is using the same page cache, across those syscalls, then
> it will end up caching a mixture of cookies. Furthermore, since the
> cookie that is used as an argument to the next call to
> READDIR/READDIRPLUS is taken from that page cache, then we can end up
> calling READDIRPLUS with a cookie that came from READDIR and vice
> versa.
>
> As I said, I'm not convinced that is legal in RFC1813 (NFSv3).
>
> That is why we want to clear the page cache when we swap between use of
> READDIR and use of READDIRPLUS for the case of NFSv3.

Just curious, are you aware of an NFSv3 server implementation that
would have a problem with a client that mixes the cookies?


>> Thanks,
>> -Dai
>>
>>>> OTOH, for NFSv4, there is no separate READDIRPLUS function, so
>>>> there
>>>> really does not appear to be any reason to clear the page cache
>>>> at
>>>> all
>>>> as we're switching between requesting attributes or not.
>>>>
>>> Sorry... To spell out my objection to this change more clearly: The
>>> call to nfs_zap_mapping() makes no sense in either case.
>>> * It defers the cache invalidation until the next call to
>>> rewinddir()/opendir(), so it does not address the NFSv3
>>> concern.
>>> * It would appear to be entirely superfluous for the NFSv4 case.
>>>
>>> So a change that might be acceptable would be to keep the existing
>>> call
>>> to invalidate_mapping_pages() for NFSv3, but to remove it for
>>> NFSv4.
>>>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]

--
Chuck Lever



2020-01-18 17:32:15

by Trond Myklebust

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

On Sat, 2020-01-18 at 12:26 -0500, Chuck Lever wrote:
> > On Jan 18, 2020, at 10:58 AM, Trond Myklebust <
> > [email protected]> wrote:
> >
> > On Fri, 2020-01-17 at 18:29 -0800, Dai Ngo wrote:
> > > Hi Trond,
> > >
> > > On 1/15/20 11:06 AM, Trond Myklebust wrote:
> > > > On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
> > > > > On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
> > > > > > Hi Anna, Trond,
> > > > > >
> > > > > > Would you please let me know your opinion regarding
> > > > > > reverting
> > > > > > the
> > > > > > change in
> > > > > > nfs_force_use_readdirplus to call nfs_zap_mapping instead
> > > > > > of
> > > > > > invalidate_mapping_pages.
> > > > > > This change is to prevent the cookie of the READDIRPLUS to
> > > > > > be
> > > > > > reset
> > > > > > to 0 while
> > > > > > an instance of 'ls' is running and the directory is being
> > > > > > modified.
> > > > > >
> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
> > > > > > > a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
> > > > > > > b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
> > > > > > > nfs_force_use_readdirplus(struct inode *dir) if
> > > > > > > (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> > > > > > > !list_empty(&nfsi->open_files)) {
> > > > > > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
> > > > > > > invalidate_mapping_pages(dir->i_mapping, 0, -1); +
> > > > > > > nfs_zap_mapping(dir, dir->i_mapping); } }
> > > > > > Thanks,
> > > > > > -Dai
> > > > > >
> > > > > > On 12/19/19 8:01 PM, Dai Ngo wrote:
> > > > > > > Hi Anna, Trond,
> > > > > > >
> > > > > > > I made a mistake with the 5.5 numbers. The VM that runs
> > > > > > > 5.5
> > > > > > > has
> > > > > > > some
> > > > > > > problems. There is no regression with 5.5, here are the
> > > > > > > new
> > > > > > > numbers:
> > > > > > >
> > > > > > > Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
> > > > > > > 10m35.789s
> > > > > > > Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
> > > > > > > 3m55.003s
> > > > > > >
> > > > > > > My apologies for the mistake.
> > > > > > >
> > > > > > > Now there is no regression with 5.5, I'd like to get your
> > > > > > > opinion
> > > > > > > regarding the change to revert the call from
> > > > > > > invalidate_mapping_pages
> > > > > > > to nfs_zap_mapping in nfs_force_use_readdirplus to
> > > > > > > prevent
> > > > > > > the
> > > > > > > current 'ls' from restarting the READDIRPLUS3 from cookie
> > > > > > > 0.
> > > > > > > I'm
> > > > > > > not quite sure about the intention of the prior change
> > > > > > > from
> > > > > > > nfs_zap_mapping to invalidate_mapping_pages so that is
> > > > > > > why
> > > > > > > I'm
> > > > > > > seeking advise. Or do you have any suggestions to achieve
> > > > > > > the
> > > > > > > same?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > -Dai
> > > > > > >
> > > > > > > On 12/17/19 4:34 PM, Dai Ngo wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'd like to report an issue with 'ls -lrt' on NFSv3
> > > > > > > > client
> > > > > > > > takes
> > > > > > > > a very long time to display the content of a large
> > > > > > > > directory
> > > > > > > > (100k - 200k files) while the directory is being
> > > > > > > > modified
> > > > > > > > by
> > > > > > > > another NFSv3 client.
> > > > > > > >
> > > > > > > > The problem can be reproduced using 3 systems. One
> > > > > > > > system
> > > > > > > > serves
> > > > > > > > as the NFS server, one system runs as the client that
> > > > > > > > doing
> > > > > > > > the
> > > > > > > > 'ls -lrt' and another system runs the client that
> > > > > > > > creates
> > > > > > > > files
> > > > > > > > on the server.
> > > > > > > > Client1 creates files using this simple script:
> > > > > > > >
> > > > > > > > > #!/bin/sh
> > > > > > > > >
> > > > > > > > > if [ $# -lt 2 ]; then
> > > > > > > > > echo "Usage: $0 number_of_files
> > > > > > > > > base_filename"
> > > > > > > > > exit
> > > > > > > > > fi nfiles=$1
> > > > > > > > > fname=$2
> > > > > > > > > echo "creating $nfiles files using
> > > > > > > > > filename[$fname]..."
> > > > > > > > > i=0 while [ i -lt $nfiles ] ;
> > > > > > > > > do i=`expr $i + 1`
> > > > > > > > > echo "xyz" > $fname$i
> > > > > > > > > echo "$fname$i" done
> > > > > > > > Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a
> > > > > > > > loop.
> > > > > > > >
> > > > > > > > The network traces and dtrace probes showed numerous
> > > > > > > > READDIRPLUS3
> > > > > > > > requests restarting from cookie 0 which seemed to
> > > > > > > > indicate
> > > > > > > > the
> > > > > > > > cached pages of the directory were invalidated causing
> > > > > > > > the
> > > > > > > > pages
> > > > > > > > to be refilled starting from cookie 0 until the current
> > > > > > > > requested
> > > > > > > > cookie. The cached page invalidation were tracked to
> > > > > > > > nfs_force_use_readdirplus(). To verify, I made the
> > > > > > > > below
> > > > > > > > modification, ran the test for various kernel versions
> > > > > > > > and
> > > > > > > > captured the results shown below.
> > > > > > > >
> > > > > > > > The modification is:
> > > > > > > >
> > > > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > > > index a73e2f8bd8ec..5d4a64555fa7 100644
> > > > > > > > > --- a/fs/nfs/dir.c
> > > > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > > > @@ -444,7 +444,7 @@ void
> > > > > > > > > nfs_force_use_readdirplus(struct
> > > > > > > > > inode
> > > > > > > > > *dir)
> > > > > > > > > if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)
> > > > > > > > > &&
> > > > > > > > > !list_empty(&nfsi->open_files)) {
> > > > > > > > > set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi-
> > > > > > > > > >flags);
> > > > > > > > > - invalidate_mapping_pages(dir->i_mapping, 0,
> > > > > > > > > -1);
> > > > > > > > > + nfs_zap_mapping(dir, dir->i_mapping);
> > > > > > > > > }
> > > > > > > > > }
> > > > > This change is only reverting part of commit 79f687a3de9e. My
> > > > > problem
> > > > > with that is as follows:
> > > > >
> > > > > RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers
> > > > > must
> > > > > match
> > > > > those returned by previous READDIRPLUS calls, and READDIR
> > > > > cookies
> > > > > and
> > > > > verifiers must match those returned by previous READDIR
> > > > > calls. It
> > > > > says
> > > > > nothing about being able to assume cookies from READDIR and
> > > > > READDIRPLUS
> > > > > calls are interchangeable. So the only reason I can see for
> > > > > the
> > > > > invalidate_mapping_pages() is to ensure that we do separate
> > > > > the
> > > > > two
> > > > > cookie caches.
> > >
> > > If I understand your concern correctly that in NFSv3 the client
> > > must
> > > maintain valid cookies and cookie verifiers when switching
> > > between
> > > READDIR and READDIRPLUS, or vice sersa, then I think the current
> > > client
> > > code handles this condition ok.
> > >
> > > On the client, both READDIR and READDIRPLUS requests use the
> > > cookie
> > > values
> > > from the same cached pages of the directory so I don't think they
> > > can
> > > be
> > > out of sync when the client switches between READDIRPLUS and
> > > READDIR
> > > requests for different nfs_readdir calls.
> > >
> > > In fact, currently the first nfs_readdir uses READDIRPLUS's to
> > > fill
> > > read
> > > the entries and if there is no LOOKUP/GETATTR on one of the
> > > directory
> > > entries then the client reverts to READDIR's for subsequent
> > > nfs_readdir
> > > calls without invalidating any cached pages of the directory. If
> > > there
> > > are LOOKUP/GETATTR done on one of the directory entries then
> > > nfs_advise_use_readdirplus is called which forces the client to
> > > use
> > > READDIRPLUS again for the next nfs_readdir.
> > >
> > > Unless the user mounts the export with 'nordirplus' option then
> > > the
> > > client uses only READDIRs for all requests, no switching takes
> > > place.
> >
> > I don't understand your point.
>
> The original point was that the directory's page cache seems to
> be cleared a little too often (quite apart from switching between
> READDIRPLUS and READDIR).
>
> I think Dai is saying that cache clearing is appropriate to defer
> when the directory's mtime has changed but the READ method remains
> the same. Otherwise repeatedly adding a new file to a very large
> directory that is being read can trigger a situation where the
> reading getdents loop never completes.
>

Fair enough, but the section of code he was touching with his patch
should be only about switching between READDIR/PLUS methods.

The mtime monitoring happens elsewhere and is already set up to
invalidate the cache only on opendir()/rewinddir().

> My two cents Euro.
>
>
> > The issue is that
> > nfs_advise_use_readdirplus() can cause the behaviour to switch
> > between
> > use of READDIRPLUS and use of READDIR from one syscall to
> > getdents() to
> > the next.
> > If the client is using the same page cache, across those syscalls,
> > then
> > it will end up caching a mixture of cookies. Furthermore, since the
> > cookie that is used as an argument to the next call to
> > READDIR/READDIRPLUS is taken from that page cache, then we can end
> > up
> > calling READDIRPLUS with a cookie that came from READDIR and vice
> > versa.
> >
> > As I said, I'm not convinced that is legal in RFC1813 (NFSv3).
> >
> > That is why we want to clear the page cache when we swap between
> > use of
> > READDIR and use of READDIRPLUS for the case of NFSv3.
>
> Just curious, are you aware of an NFSv3 server implementation that
> would have a problem with a client that mixes the cookies?
>
>
> > > Thanks,
> > > -Dai
> > >
> > > > > OTOH, for NFSv4, there is no separate READDIRPLUS function,
> > > > > so
> > > > > there
> > > > > really does not appear to be any reason to clear the page
> > > > > cache
> > > > > at
> > > > > all
> > > > > as we're switching between requesting attributes or not.
> > > > >
> > > > Sorry... To spell out my objection to this change more clearly:
> > > > The
> > > > call to nfs_zap_mapping() makes no sense in either case.
> > > > * It defers the cache invalidation until the next call to
> > > > rewinddir()/opendir(), so it does not address the NFSv3
> > > > concern.
> > > > * It would appear to be entirely superfluous for the NFSv4
> > > > case.
> > > >
> > > > So a change that might be acceptable would be to keep the
> > > > existing
> > > > call
> > > > to invalidate_mapping_pages() for NFSv3, but to remove it for
> > > > NFSv4.
> > > >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
>
> --
> Chuck Lever
>
>
>
--
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
http://www.hammer.space


2020-01-18 18:04:33

by Dai Ngo

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified


On 1/18/20 9:31 AM, Trond Myklebust wrote:
> On Sat, 2020-01-18 at 12:26 -0500, Chuck Lever wrote:
>>> On Jan 18, 2020, at 10:58 AM, Trond Myklebust <
>>> [email protected]> wrote:
>>>
>>> On Fri, 2020-01-17 at 18:29 -0800, Dai Ngo wrote:
>>>> Hi Trond,
>>>>
>>>> On 1/15/20 11:06 AM, Trond Myklebust wrote:
>>>>> On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
>>>>>> On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
>>>>>>> Hi Anna, Trond,
>>>>>>>
>>>>>>> Would you please let me know your opinion regarding
>>>>>>> reverting
>>>>>>> the
>>>>>>> change in
>>>>>>> nfs_force_use_readdirplus to call nfs_zap_mapping instead
>>>>>>> of
>>>>>>> invalidate_mapping_pages.
>>>>>>> This change is to prevent the cookie of the READDIRPLUS to
>>>>>>> be
>>>>>>> reset
>>>>>>> to 0 while
>>>>>>> an instance of 'ls' is running and the directory is being
>>>>>>> modified.
>>>>>>>
>>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
>>>>>>>> a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
>>>>>>>> b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
>>>>>>>> nfs_force_use_readdirplus(struct inode *dir) if
>>>>>>>> (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
>>>>>>>> invalidate_mapping_pages(dir->i_mapping, 0, -1); +
>>>>>>>> nfs_zap_mapping(dir, dir->i_mapping); } }
>>>>>>> Thanks,
>>>>>>> -Dai
>>>>>>>
>>>>>>> On 12/19/19 8:01 PM, Dai Ngo wrote:
>>>>>>>> Hi Anna, Trond,
>>>>>>>>
>>>>>>>> I made a mistake with the 5.5 numbers. The VM that runs
>>>>>>>> 5.5
>>>>>>>> has
>>>>>>>> some
>>>>>>>> problems. There is no regression with 5.5, here are the
>>>>>>>> new
>>>>>>>> numbers:
>>>>>>>>
>>>>>>>> Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s 197891:
>>>>>>>> 10m35.789s
>>>>>>>> Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s 192801:
>>>>>>>> 3m55.003s
>>>>>>>>
>>>>>>>> My apologies for the mistake.
>>>>>>>>
>>>>>>>> Now there is no regression with 5.5, I'd like to get your
>>>>>>>> opinion
>>>>>>>> regarding the change to revert the call from
>>>>>>>> invalidate_mapping_pages
>>>>>>>> to nfs_zap_mapping in nfs_force_use_readdirplus to
>>>>>>>> prevent
>>>>>>>> the
>>>>>>>> current 'ls' from restarting the READDIRPLUS3 from cookie
>>>>>>>> 0.
>>>>>>>> I'm
>>>>>>>> not quite sure about the intention of the prior change
>>>>>>>> from
>>>>>>>> nfs_zap_mapping to invalidate_mapping_pages so that is
>>>>>>>> why
>>>>>>>> I'm
>>>>>>>> seeking advise. Or do you have any suggestions to achieve
>>>>>>>> the
>>>>>>>> same?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Dai
>>>>>>>>
>>>>>>>> On 12/17/19 4:34 PM, Dai Ngo wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'd like to report an issue with 'ls -lrt' on NFSv3
>>>>>>>>> client
>>>>>>>>> takes
>>>>>>>>> a very long time to display the content of a large
>>>>>>>>> directory
>>>>>>>>> (100k - 200k files) while the directory is being
>>>>>>>>> modified
>>>>>>>>> by
>>>>>>>>> another NFSv3 client.
>>>>>>>>>
>>>>>>>>> The problem can be reproduced using 3 systems. One
>>>>>>>>> system
>>>>>>>>> serves
>>>>>>>>> as the NFS server, one system runs as the client that
>>>>>>>>> doing
>>>>>>>>> the
>>>>>>>>> 'ls -lrt' and another system runs the client that
>>>>>>>>> creates
>>>>>>>>> files
>>>>>>>>> on the server.
>>>>>>>>> Client1 creates files using this simple script:
>>>>>>>>>
>>>>>>>>>> #!/bin/sh
>>>>>>>>>>
>>>>>>>>>> if [ $# -lt 2 ]; then
>>>>>>>>>> echo "Usage: $0 number_of_files
>>>>>>>>>> base_filename"
>>>>>>>>>> exit
>>>>>>>>>> fi nfiles=$1
>>>>>>>>>> fname=$2
>>>>>>>>>> echo "creating $nfiles files using
>>>>>>>>>> filename[$fname]..."
>>>>>>>>>> i=0 while [ i -lt $nfiles ] ;
>>>>>>>>>> do i=`expr $i + 1`
>>>>>>>>>> echo "xyz" > $fname$i
>>>>>>>>>> echo "$fname$i" done
>>>>>>>>> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a
>>>>>>>>> loop.
>>>>>>>>>
>>>>>>>>> The network traces and dtrace probes showed numerous
>>>>>>>>> READDIRPLUS3
>>>>>>>>> requests restarting from cookie 0 which seemed to
>>>>>>>>> indicate
>>>>>>>>> the
>>>>>>>>> cached pages of the directory were invalidated causing
>>>>>>>>> the
>>>>>>>>> pages
>>>>>>>>> to be refilled starting from cookie 0 until the current
>>>>>>>>> requested
>>>>>>>>> cookie. The cached page invalidation were tracked to
>>>>>>>>> nfs_force_use_readdirplus(). To verify, I made the
>>>>>>>>> below
>>>>>>>>> modification, ran the test for various kernel versions
>>>>>>>>> and
>>>>>>>>> captured the results shown below.
>>>>>>>>>
>>>>>>>>> The modification is:
>>>>>>>>>
>>>>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>>>>>> index a73e2f8bd8ec..5d4a64555fa7 100644
>>>>>>>>>> --- a/fs/nfs/dir.c
>>>>>>>>>> +++ b/fs/nfs/dir.c
>>>>>>>>>> @@ -444,7 +444,7 @@ void
>>>>>>>>>> nfs_force_use_readdirplus(struct
>>>>>>>>>> inode
>>>>>>>>>> *dir)
>>>>>>>>>> if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)
>>>>>>>>>> &&
>>>>>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi-
>>>>>>>>>>> flags);
>>>>>>>>>> - invalidate_mapping_pages(dir->i_mapping, 0,
>>>>>>>>>> -1);
>>>>>>>>>> + nfs_zap_mapping(dir, dir->i_mapping);
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>> This change is only reverting part of commit 79f687a3de9e. My
>>>>>> problem
>>>>>> with that is as follows:
>>>>>>
>>>>>> RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers
>>>>>> must
>>>>>> match
>>>>>> those returned by previous READDIRPLUS calls, and READDIR
>>>>>> cookies
>>>>>> and
>>>>>> verifiers must match those returned by previous READDIR
>>>>>> calls. It
>>>>>> says
>>>>>> nothing about being able to assume cookies from READDIR and
>>>>>> READDIRPLUS
>>>>>> calls are interchangeable. So the only reason I can see for
>>>>>> the
>>>>>> invalidate_mapping_pages() is to ensure that we do separate
>>>>>> the
>>>>>> two
>>>>>> cookie caches.
>>>> If I understand your concern correctly that in NFSv3 the client
>>>> must
>>>> maintain valid cookies and cookie verifiers when switching
>>>> between
>>>> READDIR and READDIRPLUS, or vice sersa, then I think the current
>>>> client
>>>> code handles this condition ok.
>>>>
>>>> On the client, both READDIR and READDIRPLUS requests use the
>>>> cookie
>>>> values
>>>> from the same cached pages of the directory so I don't think they
>>>> can
>>>> be
>>>> out of sync when the client switches between READDIRPLUS and
>>>> READDIR
>>>> requests for different nfs_readdir calls.
>>>>
>>>> In fact, currently the first nfs_readdir uses READDIRPLUS's to
>>>> fill
>>>> read
>>>> the entries and if there is no LOOKUP/GETATTR on one of the
>>>> directory
>>>> entries then the client reverts to READDIR's for subsequent
>>>> nfs_readdir
>>>> calls without invalidating any cached pages of the directory. If
>>>> there
>>>> are LOOKUP/GETATTR done on one of the directory entries then
>>>> nfs_advise_use_readdirplus is called which forces the client to
>>>> use
>>>> READDIRPLUS again for the next nfs_readdir.
>>>>
>>>> Unless the user mounts the export with 'nordirplus' option then
>>>> the
>>>> client uses only READDIRs for all requests, no switching takes
>>>> place.
>>> I don't understand your point.
>> The original point was that the directory's page cache seems to
>> be cleared a little too often (quite apart from switching between
>> READDIRPLUS and READDIR).
>>
>> I think Dai is saying that cache clearing is appropriate to defer
>> when the directory's mtime has changed but the READ method remains
>> the same. Otherwise repeatedly adding a new file to a very large
>> directory that is being read can trigger a situation where the
>> reading getdents loop never completes.
>>
> Fair enough, but the section of code he was touching with his patch
> should be only about switching between READDIR/PLUS methods.

The code that I'd to patch, nfs_force_use_readdirplus, is called when
there is an attribute cache miss which causes the client to make sure
READDIRPLUS is being used on the directory, regardless which READ method
is currently being used using for the directory. So it's possible that
the call to nfs_force_use_readdirplus will cause the next getdent/nfs_readdir
to switch to from READDIR to READDIRPLUS.

>
> The mtime monitoring happens elsewhere and is already set up to
> invalidate the cache only on opendir()/rewinddir().
>
>> My two cents Euro.
>>
>>
>>> The issue is that
>>> nfs_advise_use_readdirplus() can cause the behaviour to switch
>>> between
>>> use of READDIRPLUS and use of READDIR from one syscall to
>>> getdents() to
>>> the next.
>>> If the client is using the same page cache, across those syscalls,
>>> then
>>> it will end up caching a mixture of cookies. Furthermore, since the
>>> cookie that is used as an argument to the next call to
>>> READDIR/READDIRPLUS is taken from that page cache, then we can end
>>> up
>>> calling READDIRPLUS with a cookie that came from READDIR and vice
>>> versa.
>>>
>>> As I said, I'm not convinced that is legal in RFC1813 (NFSv3).

I think this is the contention point: the spec did not explicitly
mention anything regarding mixing of cookies from READDIR & READDIRPLUS.

However, as I mentioned, the current client implementation already mixing
cookies between READDIRPLUS and READDIR, everytime the user does a simple
'ls' on a large directory, without invalidating any mapping.

Also, as Chuck mentioned, we're not aware of any server implementation
that has problems with this mixing of cookies.

-Dai

>>>
>>> That is why we want to clear the page cache when we swap between
>>> use of
>>> READDIR and use of READDIRPLUS for the case of NFSv3.
>> Just curious, are you aware of an NFSv3 server implementation that
>> would have a problem with a client that mixes the cookies?
>>
>>
>>>> Thanks,
>>>> -Dai
>>>>
>>>>>> OTOH, for NFSv4, there is no separate READDIRPLUS function,
>>>>>> so
>>>>>> there
>>>>>> really does not appear to be any reason to clear the page
>>>>>> cache
>>>>>> at
>>>>>> all
>>>>>> as we're switching between requesting attributes or not.
>>>>>>
>>>>> Sorry... To spell out my objection to this change more clearly:
>>>>> The
>>>>> call to nfs_zap_mapping() makes no sense in either case.
>>>>> * It defers the cache invalidation until the next call to
>>>>> rewinddir()/opendir(), so it does not address the NFSv3
>>>>> concern.
>>>>> * It would appear to be entirely superfluous for the NFSv4
>>>>> case.
>>>>>
>>>>> So a change that might be acceptable would be to keep the
>>>>> existing
>>>>> call
>>>>> to invalidate_mapping_pages() for NFSv3, but to remove it for
>>>>> NFSv4.
>>>>>
>>> --
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> [email protected]
>> --
>> Chuck Lever
>>
>>
>>

2020-01-20 20:54:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified

On Sat, 2020-01-18 at 10:03 -0800, Dai Ngo wrote:
>
> I think this is the contention point: the spec did not explicitly
> mention anything regarding mixing of cookies from READDIR &
> READDIRPLUS.
>
> However, as I mentioned, the current client implementation already
> mixing
> cookies between READDIRPLUS and READDIR, everytime the user does a
> simple
> 'ls' on a large directory, without invalidating any mapping.
>
> Also, as Chuck mentioned, we're not aware of any server
> implementation
> that has problems with this mixing of cookies.
>

OK I did a little time warp and went back to the original emails around
this behaviour:

https://linux-nfs.vger.kernel.narkive.com/O0Xhnqxe/readdir-vs-getattr


Part of the problem that needed solving at the time was that even when
the directory and its contents were not changing, people were still
needing to do reams of GETATTR calls in order to typically satisfy an
'ls -l' or even 'ls --color'. When we see that pattern, we want to
switch from using GETATTR on all these files to using READDIRPLUS.

The cache invalidation was introduced in order to force the NFS client
to do these READDIRPLUS calls so we avoid the GETATTRs.

So one optimisation we could definitely do is try to track the index of
the last page our descriptor accessed on readdir(), and truncate only
the remaining pages. That way we don't keep re-reading the beginning of
the directory.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]