2016-12-07 13:29:03

by Benjamin Coddington

[permalink] [raw]
Subject: Concurrent `ls` takes out the thrash

I was asked to figure out why the listing of very large directories was
slow. More specifically, why concurrently listing the same large
directory
is /very/ slow. It seems that sometimes a user's reaction to waiting
for
'ls' to complete is to start a few more.. and then their machine takes a
very long time to complete that work.

I can reproduce that finding. As an example:

time ls -fl /dir/with/200000/entries/ >/dev/null

real 0m10.766s
user 0m0.716s
sys 0m0.827s

But..

for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null &
done

Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.

The problem is that concurrent 'ls' commands stack up in nfs_readdir()
both
waiting on the next page and taking turns filling the next page with
xdr,
but only one of them will have desc->plus set because setting it clears
the
flag on the directory. So if a page is filled by a process that doesn't
have
desc->plus then the next pass through lookup(), it dumps the entire page
cache with nfs_force_use_readdirplus(). Then the next readdir starts
all
over filling the pagecache. Forward progress happens, but only after
many
steps back re-filling the pagecache.

To me most obvious fix would be to serialize nfs_readdir() on the
directory
inode, so I'll follow-up with patch that does that with nfsi->rwsem.
With that,
the above parallel 'ls' takes 12 seconds for each 'ls' to complete.

This only works because with concurrent 'ls' there is a consistent
buffer
size so a waiting nfs_readdir() started in the same place for an
unmodified
directory should always hit the cache after waiting. Serializing
nfs_readdir() will not solve this problem for concurrent callers with
differing buffer sizes, or starting at different offsets, since there's
a
good chance the waiting readdir() will not see the readdirplus flag when
it
resumes and so will not prime the dcache.

While I think it's an OK fix, it feels bad to serialize. At the same
time, nfs_readdir() is already serialized on the pagecache when
concurrent
callers need to go to the server. There might be other problems I
haven't
thought about.

Maybe there's another way to fix this, or maybe we can just say "Don't
do ls
more than once, you impatient bastards!"

Ben


2016-12-07 13:37:04

by Benjamin Coddington

[permalink] [raw]
Subject: [PATCH] NFS: Serialize nfs_readdir()

Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect
concurrent users of nfs_readdir(), and can cause the pagecache to
repeatedly be invalidated unnecessarily for this case. Fix this by
serializing nfs_readdir() on the directory's rwsem.

Signed-off-by: Benjamin Coddington <[email protected]>
---
fs/nfs/dir.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 53e6d6a4bc9c..8321252a4c8d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -900,6 +900,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file_dentry(file);
struct inode *inode = d_inode(dentry);
+ struct nfs_inode *nfsi = NFS_I(inode);
nfs_readdir_descriptor_t my_desc,
*desc = &my_desc;
struct nfs_open_dir_context *dir_ctx = file->private_data;
@@ -917,6 +918,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
*/
memset(desc, 0, sizeof(*desc));

+ down_write(&nfsi->rwsem);
desc->file = file;
desc->ctx = ctx;
desc->dir_cookie = &dir_ctx->dir_cookie;
@@ -958,6 +960,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
break;
} while (!desc->eof);
out:
+ up_write(&nfsi->rwsem);
if (res > 0)
res = 0;
dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
--
2.5.5


2016-12-07 15:46:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAwODoyOCwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp
bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBJIHdhcyBhc2tlZCB0byBmaWd1cmUgb3V0IHdo
eSB0aGUgbGlzdGluZyBvZiB2ZXJ5IGxhcmdlIGRpcmVjdG9yaWVzIHdhcw0KPiBzbG93LiAgTW9y
ZSBzcGVjaWZpY2FsbHksIHdoeSBjb25jdXJyZW50bHkgbGlzdGluZyB0aGUgc2FtZSBsYXJnZSBk
aXJlY3RvcnkNCj4gaXMgL3ZlcnkvIHNsb3cuICBJdCBzZWVtcyB0aGF0IHNvbWV0aW1lcyBhIHVz
ZXIncyByZWFjdGlvbiB0byB3YWl0aW5nIGZvcg0KPiAnbHMnIHRvIGNvbXBsZXRlIGlzIHRvIHN0
YXJ0IGEgZmV3IG1vcmUuLiBhbmQgdGhlbiB0aGVpciBtYWNoaW5lIHRha2VzIGENCj4gdmVyeSBs
b25nIHRpbWUgdG8gY29tcGxldGUgdGhhdCB3b3JrLg0KPiANCj4gSSBjYW4gcmVwcm9kdWNlIHRo
YXQgZmluZGluZy4gIEFzIGFuIGV4YW1wbGU6DQo+IA0KPiB0aW1lIGxzIC1mbCAvZGlyL3dpdGgv
MjAwMDAwL2VudHJpZXMvID4vZGV2L251bGwNCj4gDQo+IHJlYWwgICAgMG0xMC43NjZzDQo+IHVz
ZXIgICAgMG0wLjcxNnMNCj4gc3lzICAgICAwbTAuODI3cw0KPiANCj4gQnV0Li4NCj4gDQo+IGZv
ciBpIGluIHsxLi4xMH07IGRvIHRpbWUgbHMgLWZsIC9kaXIvd2l0aC8yMDAwMDAvZW50cmllcy8g
Pi9kZXYvbnVsbCAmIGRvbmUNCj4gDQo+IEVhY2ggb2YgdGhlc2UgXl4gJ2xzJyBjb21tYW5kcyB3
aWxsIHRha2UgNCB0byA1IG1pbnV0ZXMgdG8gY29tcGxldGUuDQo+IA0KPiBUaGUgcHJvYmxlbSBp
cyB0aGF0IGNvbmN1cnJlbnQgJ2xzJyBjb21tYW5kcyBzdGFjayB1cCBpbiBuZnNfcmVhZGRpcigp
IGJvdGgNCj4gd2FpdGluZyBvbiB0aGUgbmV4dCBwYWdlIGFuZCB0YWtpbmcgdHVybnMgZmlsbGlu
ZyB0aGUgbmV4dCBwYWdlIHdpdGggeGRyLA0KPiBidXQgb25seSBvbmUgb2YgdGhlbSB3aWxsIGhh
dmUgZGVzYy0+cGx1cyBzZXQgYmVjYXVzZSBzZXR0aW5nIGl0IGNsZWFycyB0aGUNCj4gZmxhZyBv
biB0aGUgZGlyZWN0b3J5LiAgU28gaWYgYSBwYWdlIGlzIGZpbGxlZCBieSBhIHByb2Nlc3MgdGhh
dCBkb2Vzbid0IGhhdmUNCj4gZGVzYy0+cGx1cyB0aGVuIHRoZSBuZXh0IHBhc3MgdGhyb3VnaCBs
b29rdXAoKSwgaXQgZHVtcHMgdGhlIGVudGlyZSBwYWdlDQo+IGNhY2hlIHdpdGggbmZzX2ZvcmNl
X3VzZV9yZWFkZGlycGx1cygpLiAgVGhlbiB0aGUgbmV4dCByZWFkZGlyIHN0YXJ0cyBhbGwNCj4g
b3ZlciBmaWxsaW5nIHRoZSBwYWdlY2FjaGUuICBGb3J3YXJkIHByb2dyZXNzIGhhcHBlbnMsIGJ1
dCBvbmx5IGFmdGVyIG1hbnkNCj4gc3RlcHMgYmFjayByZS1maWxsaW5nIHRoZSBwYWdlY2FjaGUu
DQoNClllcywgdGhlIHJlYWRkaXIgY29kZSB3YXMgd3JpdHRlbiB3ZWxsIGJlZm9yZSBBbOKAmXMg
cGF0Y2hlcyB0byBwYXJhbGxlbGlzZSB0aGUgVkZTIG9wZXJhdGlvbnMsIGFuZCBhIGxvdCBvZiBp
dCBkaWQgcmVseSBvbiB0aGUgaW5vZGUtPmlfbXV0ZXggYmVpbmcgc2V0IG9uIHRoZSBkaXJlY3Rv
cnkgYnkgdGhlIFZGUyBsYXllci4NCg0KSG93IGFib3V0IHRoZSBmb2xsb3dpbmcgc3VnZ2VzdGlv
bjogaW5zdGVhZCBvZiBzZXR0aW5nIGEgZmxhZyBvbiB0aGUgaW5vZGUsIHdlIGl0ZXJhdGUgdGhy
b3VnaCB0aGUgZW50cmllcyBpbiAmbmZzaS0+b3Blbl9maWxlcywgYW5kIHNldCBhIGZsYWcgb24g
dGhlIHN0cnVjdCBuZnNfb3Blbl9kaXJfY29udGV4dCB0aGF0IHRoZSByZWFkZGlyIHByb2Nlc3Nl
cyBjYW4gY29weSBpbnRvIGRlc2MtPnBsdXMuIERvZXMgdGhhdCBoZWxwIHdpdGggeW91ciB3b3Jr
bG9hZD8NCg0K


2016-12-07 16:30:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] NFS: Serialize nfs_readdir()

On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote:
> Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect
> concurrent users of nfs_readdir(), and can cause the pagecache to
> repeatedly be invalidated unnecessarily for this case. Fix this by
> serializing nfs_readdir() on the directory's rwsem.

Just implement .iterate instead of .iterate_shared and you'll get the
serialization for free. While doing that please add a comment on why
it's serialized.

2016-12-07 17:01:44

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] NFS: Serialize nfs_readdir()

Hi Benjamin,

[auto build test ERROR on nfs/linux-next]
[also build test ERROR on v4.9-rc8 next-20161207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Benjamin-Coddington/NFS-Serialize-nfs_readdir/20161208-001039
base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa

All errors (new ones prefixed by >>):

fs/nfs/dir.c: In function 'nfs_readdir':
>> fs/nfs/dir.c:921:18: error: 'struct nfs_inode' has no member named 'rwsem'
down_write(&nfsi->rwsem);
^
fs/nfs/dir.c:963:16: error: 'struct nfs_inode' has no member named 'rwsem'
up_write(&nfsi->rwsem);
^

vim +921 fs/nfs/dir.c

915 * *desc->dir_cookie has the cookie for the next entry. We have
916 * to either find the entry with the appropriate number or
917 * revalidate the cookie.
918 */
919 memset(desc, 0, sizeof(*desc));
920
> 921 down_write(&nfsi->rwsem);
922 desc->file = file;
923 desc->ctx = ctx;
924 desc->dir_cookie = &dir_ctx->dir_cookie;

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.59 kB)
.config.gz (9.87 kB)
Download all attachments

2016-12-07 19:40:18

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH] NFS: Serialize nfs_readdir()

On 7 Dec 2016, at 11:30, Christoph Hellwig wrote:

> On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote:
>> Optimizations to NFS to choose between READDIR and READDIRPLUS don't
>> expect
>> concurrent users of nfs_readdir(), and can cause the pagecache to
>> repeatedly be invalidated unnecessarily for this case. Fix this by
>> serializing nfs_readdir() on the directory's rwsem.
>
> Just implement .iterate instead of .iterate_shared and you'll get the
> serialization for free. While doing that please add a comment on why
> it's serialized.

I had it in my head that Al was trying to get rid of .iterate, otherwise
yes, this would be more sensible. I think it'll still benefit from
.iterate_shared in the regular case (the case where we're not trying to
use
READDIRPLUS), so I'll try out Trond's suggestion to fix it up before
chasing
this down.

Ben

2016-12-07 19:46:20

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash



On 7 Dec 2016, at 10:46, Trond Myklebust wrote:

>> On Dec 7, 2016, at 08:28, Benjamin Coddington <[email protected]>
>> wrote:
>>
>> I was asked to figure out why the listing of very large directories
>> was
>> slow. More specifically, why concurrently listing the same large
>> directory
>> is /very/ slow. It seems that sometimes a user's reaction to waiting
>> for
>> 'ls' to complete is to start a few more.. and then their machine
>> takes a
>> very long time to complete that work.
>>
>> I can reproduce that finding. As an example:
>>
>> time ls -fl /dir/with/200000/entries/ >/dev/null
>>
>> real 0m10.766s
>> user 0m0.716s
>> sys 0m0.827s
>>
>> But..
>>
>> for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null
>> & done
>>
>> Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.
>>
>> The problem is that concurrent 'ls' commands stack up in
>> nfs_readdir() both
>> waiting on the next page and taking turns filling the next page with
>> xdr,
>> but only one of them will have desc->plus set because setting it
>> clears the
>> flag on the directory. So if a page is filled by a process that
>> doesn't have
>> desc->plus then the next pass through lookup(), it dumps the entire
>> page
>> cache with nfs_force_use_readdirplus(). Then the next readdir starts
>> all
>> over filling the pagecache. Forward progress happens, but only after
>> many
>> steps back re-filling the pagecache.
>
> Yes, the readdir code was written well before Al’s patches to
> parallelise
> the VFS operations, and a lot of it did rely on the inode->i_mutex
> being
> set on the directory by the VFS layer.
>
> How about the following suggestion: instead of setting a flag on the
> inode, we iterate through the entries in &nfsi->open_files, and set a
> flag
> on the struct nfs_open_dir_context that the readdir processes can copy
> into desc->plus. Does that help with your workload?

That should work.. I guess I'll hack it up and present it for
dissection.

Thanks!
Ben

2016-12-07 22:34:40

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

From: "Benjamin Coddington" <[email protected]>

So, I've got the following patch, and it seems to work fine for the original
workload. However, if I use ~20 procs started 2 seconds apart, I can still
sometimes get into the state where the machine takes a very long time (5 - 8
minutes). I wonder if I am getting into a state were vmscan is reclaiming
pages while I'm trying to fill them up. So.. I'll do a bit more debugging
and re-post this if I feel like it is still the right approach.

Adding an int to nfs_open_dir_context puts it at 60 bytes here. Probably
could have added a unsigned long flags and used bits for the duped stuff as
well, but it was uglier that way, so I left it. I like how duped carries
-1, 0, and >1 information without having flag defines cluttering everywhere.

Ben

8<------------------------------------------------------------------------

2016-12-07 22:41:59

by Trond Myklebust

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash


> On Dec 7, 2016, at 17:34, Benjamin Coddington <[email protected]> wrote=
:
>=20
> From: "Benjamin Coddington" <[email protected]>
>=20
> So, I've got the following patch, and it seems to work fine for the origi=
nal
> workload. However, if I use ~20 procs started 2 seconds apart, I can sti=
ll
> sometimes get into the state where the machine takes a very long time (5 =
- 8
> minutes). I wonder if I am getting into a state were vmscan is reclaimin=
g
> pages while I'm trying to fill them up. So.. I'll do a bit more debuggin=
g
> and re-post this if I feel like it is still the right approach.
>=20
> Adding an int to nfs_open_dir_context puts it at 60 bytes here. Probably
> could have added a unsigned long flags and used bits for the duped stuff =
as
> well, but it was uglier that way, so I left it. I like how duped carries
> -1, 0, and >1 information without having flag defines cluttering everywhe=
re.
>=20
> Ben
>=20
> 8<-----------------------------------------------------------------------=
-
>=20
> From 5b3e747a984ec966e8c742d8f4fe4b08e1c93acd Mon Sep 17 00:00:00 2001
> Message-Id: <5b3e747a984ec966e8c742d8f4fe4b08e1c93acd.1481149380.git.bcod=
[email protected]>
> From: Benjamin Coddington <[email protected]>
> Date: Wed, 7 Dec 2016 16:23:30 -0500
> Subject: [PATCH] NFS: Move readdirplus flag to directory context
>=20
> For a concurrent 'ls -l' workload, processes can stack up in nfs_readdir(=
)
> both waiting on the next page and taking turns filling the next page from
> XDR, but only one of them will have desc->plus set because setting it
> clears the flag on the directory. If a page is filled by a process that
> doesn't have desc->plus set then the next pass through lookup() will caus=
e
> it to dump the entire page cache with nfs_force_use_readdirplus(). Then
> the next readdir starts all over filling the pagecache. Forward progress
> happens, but only after many steps back re-filling the pagecache.
>=20
> Fix this by moving the flag to use readdirplus to each open directory
> context.
>=20
> Suggested-by: Trond Myklebust <[email protected]>
> Signed-off-by: Benjamin Coddington <[email protected]>
> ---
> fs/nfs/dir.c | 39 ++++++++++++++++++++++++---------------
> include/linux/nfs_fs.h | 1 +
> 2 files changed, 25 insertions(+), 15 deletions(-)
>=20
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 7483722162fa..818172606fed 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -78,6 +78,7 @@ static struct nfs_open_dir_context *alloc_nfs_open_dir_=
context(struct inode *dir
> =09=09ctx->dir_cookie =3D 0;
> =09=09ctx->dup_cookie =3D 0;
> =09=09ctx->cred =3D get_rpccred(cred);
> +=09=09ctx->use_readdir_plus =3D 0;
> =09=09spin_lock(&dir->i_lock);
> =09=09list_add(&ctx->list, &nfsi->open_files);
> =09=09spin_unlock(&dir->i_lock);
> @@ -443,17 +444,35 @@ int nfs_same_file(struct dentry *dentry, struct nfs=
_entry *entry)
> }
>=20
> static
> -bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> +bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx,
> +=09=09=09struct nfs_open_dir_context *dir_ctx)
> {
> =09if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
> =09=09return false;
> -=09if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
> +=09if (xchg(&dir_ctx->use_readdir_plus, 0))
> =09=09return true;
> =09if (ctx->pos =3D=3D 0)
> =09=09return true;
> =09return false;
> }
>=20
> +static
> +void nfs_set_readdirplus(struct inode *dir, int force)
> +{
> +=09struct nfs_inode *nfsi =3D NFS_I(dir);
> +=09struct nfs_open_dir_context *ctx;
> +
> +=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> +=09 !list_empty(&nfsi->open_files)) {
> +=09=09spin_lock(&dir->i_lock);
> +=09=09list_for_each_entry(ctx, &nfsi->open_files, list)
> +=09=09=09ctx->use_readdir_plus =3D 1;
> +=09=09spin_unlock(&dir->i_lock);
> +=09=09if (force)
> +=09=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1);
> +=09}
> +}
> +
> /*
> * This function is called by the lookup and getattr code to request the
> * use of readdirplus to accelerate any future lookups in the same
> @@ -461,11 +480,7 @@ bool nfs_use_readdirplus(struct inode *dir, struct d=
ir_context *ctx)
> */
> void nfs_advise_use_readdirplus(struct inode *dir)
> {
> -=09struct nfs_inode *nfsi =3D NFS_I(dir);
> -
> -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> -=09 !list_empty(&nfsi->open_files))
> -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> +=09nfs_set_readdirplus(dir, 0);
> }
>=20
> /*
> @@ -478,13 +493,7 @@ void nfs_advise_use_readdirplus(struct inode *dir)
> */
> void nfs_force_use_readdirplus(struct inode *dir)
> {
> -=09struct nfs_inode *nfsi =3D NFS_I(dir);
> -
> -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> -=09 !list_empty(&nfsi->open_files)) {
> -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> -=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1);
> -=09}
> +=09nfs_set_readdirplus(dir, 1);
> }
>=20
> static
> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct dir_=
context *ctx)
> =09desc->ctx =3D ctx;
> =09desc->dir_cookie =3D &dir_ctx->dir_cookie;
> =09desc->decode =3D NFS_PROTO(inode)->decode_dirent;
> -=09desc->plus =3D nfs_use_readdirplus(inode, ctx) ? 1 : 0;
> +=09desc->plus =3D nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;

This fixes desc->plus at the beginning of the readdir() call. Perhaps we sh=
ould instead check the value of ctx->use_readdir_plus in the call to nfs_re=
addir_xdr_filler(), and just resetting cts->use_readdir_plus at the very en=
d of nfs_readdir()?

>=20
> =09if (ctx->pos =3D=3D 0 || nfs_attribute_cache_expired(inode))
> =09=09res =3D nfs_revalidate_mapping(inode, file->f_mapping);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index cb631973839a..fe06c1dd1737 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -89,6 +89,7 @@ struct nfs_open_dir_context {
> =09__u64 dir_cookie;
> =09__u64 dup_cookie;
> =09signed char duped;
> +=09int use_readdir_plus;
> };
>=20
> /*
> --=20
> 2.5.5
>=20

2016-12-07 22:55:55

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

On 7 Dec 2016, at 17:41, Trond Myklebust wrote:

>> On Dec 7, 2016, at 17:34, Benjamin Coddington <[email protected]>
>> wrote:
>> static
>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct
>> dir_context *ctx)
>> desc->ctx = ctx;
>> desc->dir_cookie = &dir_ctx->dir_cookie;
>> desc->decode = NFS_PROTO(inode)->decode_dirent;
>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>
> This fixes desc->plus at the beginning of the readdir() call. Perhaps
> we
> should instead check the value of ctx->use_readdir_plus in the call to
> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus at
> the
> very end of nfs_readdir()?

I don't understand the functional difference. Is this just a
preference?

There must be something else happening.. dcache is getting under
pressure
pruned maybe, that causes a miss and then the process starts over?

Ben

2016-12-07 22:59:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAxNzo1NSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp
bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBPbiA3IERlYyAyMDE2LCBhdCAxNzo0MSwgVHJv
bmQgTXlrbGVidXN0IHdyb3RlOg0KPiANCj4+PiBPbiBEZWMgNywgMjAxNiwgYXQgMTc6MzQsIEJl
bmphbWluIENvZGRpbmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPj4+IHN0YXRp
Yw0KPj4+IEBAIC05MjEsNyArOTMwLDcgQEAgc3RhdGljIGludCBuZnNfcmVhZGRpcihzdHJ1Y3Qg
ZmlsZSAqZmlsZSwgc3RydWN0IGRpcl9jb250ZXh0ICpjdHgpDQo+Pj4gCWRlc2MtPmN0eCA9IGN0
eDsNCj4+PiAJZGVzYy0+ZGlyX2Nvb2tpZSA9ICZkaXJfY3R4LT5kaXJfY29va2llOw0KPj4+IAlk
ZXNjLT5kZWNvZGUgPSBORlNfUFJPVE8oaW5vZGUpLT5kZWNvZGVfZGlyZW50Ow0KPj4+IC0JZGVz
Yy0+cGx1cyA9IG5mc191c2VfcmVhZGRpcnBsdXMoaW5vZGUsIGN0eCkgPyAxIDogMDsNCj4+PiAr
CWRlc2MtPnBsdXMgPSBuZnNfdXNlX3JlYWRkaXJwbHVzKGlub2RlLCBjdHgsIGRpcl9jdHgpID8g
MSA6IDA7DQo+PiANCj4+IFRoaXMgZml4ZXMgZGVzYy0+cGx1cyBhdCB0aGUgYmVnaW5uaW5nIG9m
IHRoZSByZWFkZGlyKCkgY2FsbC4gUGVyaGFwcyB3ZQ0KPj4gc2hvdWxkIGluc3RlYWQgY2hlY2sg
dGhlIHZhbHVlIG9mIGN0eC0+dXNlX3JlYWRkaXJfcGx1cyBpbiB0aGUgY2FsbCB0bw0KPj4gbmZz
X3JlYWRkaXJfeGRyX2ZpbGxlcigpLCBhbmQganVzdCByZXNldHRpbmcgY3RzLT51c2VfcmVhZGRp
cl9wbHVzIGF0IHRoZQ0KPj4gdmVyeSBlbmQgb2YgbmZzX3JlYWRkaXIoKT8NCj4gDQo+IEkgZG9u
J3QgdW5kZXJzdGFuZCB0aGUgZnVuY3Rpb25hbCBkaWZmZXJlbmNlLiAgSXMgdGhpcyBqdXN0IGEg
cHJlZmVyZW5jZT8NCg0KTm8uIFRoZSBmdW5jdGlvbmFsIGRpZmZlcmVuY2UgaXMgdGhhdCB3ZSB0
YWtlIGludG8gYWNjb3VudCB0aGUgZmFjdCB0aGF0IGEgcHJvY2VzcyBtYXkgYmUgaW4gdGhlIHJl
YWRkaXIoKSBjb2RlIHdoaWxlIGEgR0VUQVRUUiBvciBMT09LVVAgZnJvbSBhIHNlY29uZCBwcm9j
ZXNzIGlzIHRyaWdnZXJpbmcg4oCcdXNlIHJlYWRkaXJwbHVz4oCdIGZlZWRiYWNrLg0KDQo+IA0K
PiBUaGVyZSBtdXN0IGJlIHNvbWV0aGluZyBlbHNlIGhhcHBlbmluZy4uIGRjYWNoZSBpcyBnZXR0
aW5nIHVuZGVyIHByZXNzdXJlDQo+IHBydW5lZCBtYXliZSwgdGhhdCBjYXVzZXMgYSBtaXNzIGFu
ZCB0aGVuIHRoZSBwcm9jZXNzIHN0YXJ0cyBvdmVyPw0KPiANCj4gQmVuDQo+IA0KDQo=

2016-12-07 23:10:51

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

On 7 Dec 2016, at 17:59, Trond Myklebust wrote:

>> On Dec 7, 2016, at 17:55, Benjamin Coddington <[email protected]>
>> wrote:
>>
>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>>
>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington <[email protected]>
>>>> wrote:
>>>> static
>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file,
>>>> struct dir_context *ctx)
>>>> desc->ctx = ctx;
>>>> desc->dir_cookie = &dir_ctx->dir_cookie;
>>>> desc->decode = NFS_PROTO(inode)->decode_dirent;
>>>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>>>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>>>
>>> This fixes desc->plus at the beginning of the readdir() call.
>>> Perhaps we
>>> should instead check the value of ctx->use_readdir_plus in the call
>>> to
>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus
>>> at the
>>> very end of nfs_readdir()?
>>
>> I don't understand the functional difference. Is this just a
>> preference?
>
> No. The functional difference is that we take into account the fact
> that a
> process may be in the readdir() code while a GETATTR or LOOKUP from a
> second process is triggering “use readdirplus” feedback.

This should only matter if the concurrent processes have different
buffer
sizes or start at a different offsets -- which shouldn't happen with
plain
old 'ls -l'.

.. or maybe I'm wrong if, hmm.. if acdirmin ran out (maybe?).. or if we
mix
'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK. I'll try
it.

Thanks for your suggestions.
Ben

2016-12-08 14:19:25

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash


On 7 Dec 2016, at 18:10, Benjamin Coddington wrote:

> On 7 Dec 2016, at 17:59, Trond Myklebust wrote:
>
>>> On Dec 7, 2016, at 17:55, Benjamin Coddington <[email protected]>
>>> wrote:
>>>
>>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>>>
>>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington
>>>>> <[email protected]> wrote:
>>>>> static
>>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file,
>>>>> struct dir_context *ctx)
>>>>> desc->ctx = ctx;
>>>>> desc->dir_cookie = &dir_ctx->dir_cookie;
>>>>> desc->decode = NFS_PROTO(inode)->decode_dirent;
>>>>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>>>>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>>>>
>>>> This fixes desc->plus at the beginning of the readdir() call.
>>>> Perhaps we
>>>> should instead check the value of ctx->use_readdir_plus in the call
>>>> to
>>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus
>>>> at the
>>>> very end of nfs_readdir()?
>>>
>>> I don't understand the functional difference. Is this just a
>>> preference?
>>
>> No. The functional difference is that we take into account the fact
>> that a
>> process may be in the readdir() code while a GETATTR or LOOKUP from a
>> second process is triggering “use readdirplus” feedback.
>
> This should only matter if the concurrent processes have different
> buffer
> sizes or start at a different offsets -- which shouldn't happen with
> plain
> old 'ls -l'.
>
> .. or maybe I'm wrong if, hmm.. if acdirmin ran out (maybe?).. or if
> we mix
> 'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK. I'll
> try it.

This doesn't help.

The issue is that anything more than a couple of processes cause
contention
on the directory's i_lock. The i_lock is taken x entries x processes.
The
inode flags and serialization worked far better.

If we add another inode flag, then we can add a DOING_RDPLUS. A process
entering nfs_readdir() that sees ADVISE and not DOING clears it and sets
DOING and remembers to clear DOING on exit of nfs_readdir(). Any
process
that sees DOING uses plus.

Ben

2016-12-08 16:13:20

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

From: "Benjamin Coddington" <[email protected]>

Ehh.. I think it's kinda ugly, but this is what I came up with.

It works well, though, and handles everything I throw at it. Its not as
simple as Christoph's suggestion to just go back to .iterate, which would
solve this in a simpler way.

Ben


8<---------------------------------------------------------------------------------------

2016-12-08 21:48:47

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Concurrent `ls` takes out the thrash

> Ehh.. I think it's kinda ugly, but this is what I came up with.
>
> It works well, though, and handles everything I throw at it. Its not as
> simple as Christoph's suggestion to just go back to .iterate, which would
> solve this in a simpler way.

After many testings, I can't find any way this is faster than using the old
.iterate serialized readdir. Additionally, at around 1M entries, this way
frequently needs to oom-kill things. Using .iterate, that doesn't happen,
and is much faster at that scale.

Maybe my feeble brain is unable to think of the right workload to make
.iterate_shared preferable for NFS. Suggestions welcome.

Otherwise, I think serialized is the way to go.

Ben