I was asked to figure out why the listing of very large directories was
slow. More specifically, why concurrently listing the same large
directory
is /very/ slow. It seems that sometimes a user's reaction to waiting
for
'ls' to complete is to start a few more.. and then their machine takes a
very long time to complete that work.
I can reproduce that finding. As an example:
time ls -fl /dir/with/200000/entries/ >/dev/null
real 0m10.766s
user 0m0.716s
sys 0m0.827s
But..
for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null &
done
Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.
The problem is that concurrent 'ls' commands stack up in nfs_readdir()
both
waiting on the next page and taking turns filling the next page with
xdr,
but only one of them will have desc->plus set because setting it clears
the
flag on the directory. So if a page is filled by a process that doesn't
have
desc->plus then the next pass through lookup(), it dumps the entire page
cache with nfs_force_use_readdirplus(). Then the next readdir starts
all
over filling the pagecache. Forward progress happens, but only after
many
steps back re-filling the pagecache.
To me most obvious fix would be to serialize nfs_readdir() on the
directory
inode, so I'll follow-up with patch that does that with nfsi->rwsem.
With that,
the above parallel 'ls' takes 12 seconds for each 'ls' to complete.
This only works because with concurrent 'ls' there is a consistent
buffer
size so a waiting nfs_readdir() started in the same place for an
unmodified
directory should always hit the cache after waiting. Serializing
nfs_readdir() will not solve this problem for concurrent callers with
differing buffer sizes, or starting at different offsets, since there's
a
good chance the waiting readdir() will not see the readdirplus flag when
it
resumes and so will not prime the dcache.
While I think it's an OK fix, it feels bad to serialize. At the same
time, nfs_readdir() is already serialized on the pagecache when
concurrent
callers need to go to the server. There might be other problems I
haven't
thought about.
Maybe there's another way to fix this, or maybe we can just say "Don't
do ls
more than once, you impatient bastards!"
Ben
Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect
concurrent users of nfs_readdir(), and can cause the pagecache to
repeatedly be invalidated unnecessarily for this case. Fix this by
serializing nfs_readdir() on the directory's rwsem.
Signed-off-by: Benjamin Coddington <[email protected]>
---
fs/nfs/dir.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 53e6d6a4bc9c..8321252a4c8d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -900,6 +900,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file_dentry(file);
struct inode *inode = d_inode(dentry);
+ struct nfs_inode *nfsi = NFS_I(inode);
nfs_readdir_descriptor_t my_desc,
*desc = &my_desc;
struct nfs_open_dir_context *dir_ctx = file->private_data;
@@ -917,6 +918,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
*/
memset(desc, 0, sizeof(*desc));
+ down_write(&nfsi->rwsem);
desc->file = file;
desc->ctx = ctx;
desc->dir_cookie = &dir_ctx->dir_cookie;
@@ -958,6 +960,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
break;
} while (!desc->eof);
out:
+ up_write(&nfsi->rwsem);
if (res > 0)
res = 0;
dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
--
2.5.5
DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAwODoyOCwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp
bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBJIHdhcyBhc2tlZCB0byBmaWd1cmUgb3V0IHdo
eSB0aGUgbGlzdGluZyBvZiB2ZXJ5IGxhcmdlIGRpcmVjdG9yaWVzIHdhcw0KPiBzbG93LiAgTW9y
ZSBzcGVjaWZpY2FsbHksIHdoeSBjb25jdXJyZW50bHkgbGlzdGluZyB0aGUgc2FtZSBsYXJnZSBk
aXJlY3RvcnkNCj4gaXMgL3ZlcnkvIHNsb3cuICBJdCBzZWVtcyB0aGF0IHNvbWV0aW1lcyBhIHVz
ZXIncyByZWFjdGlvbiB0byB3YWl0aW5nIGZvcg0KPiAnbHMnIHRvIGNvbXBsZXRlIGlzIHRvIHN0
YXJ0IGEgZmV3IG1vcmUuLiBhbmQgdGhlbiB0aGVpciBtYWNoaW5lIHRha2VzIGENCj4gdmVyeSBs
b25nIHRpbWUgdG8gY29tcGxldGUgdGhhdCB3b3JrLg0KPiANCj4gSSBjYW4gcmVwcm9kdWNlIHRo
YXQgZmluZGluZy4gIEFzIGFuIGV4YW1wbGU6DQo+IA0KPiB0aW1lIGxzIC1mbCAvZGlyL3dpdGgv
MjAwMDAwL2VudHJpZXMvID4vZGV2L251bGwNCj4gDQo+IHJlYWwgICAgMG0xMC43NjZzDQo+IHVz
ZXIgICAgMG0wLjcxNnMNCj4gc3lzICAgICAwbTAuODI3cw0KPiANCj4gQnV0Li4NCj4gDQo+IGZv
ciBpIGluIHsxLi4xMH07IGRvIHRpbWUgbHMgLWZsIC9kaXIvd2l0aC8yMDAwMDAvZW50cmllcy8g
Pi9kZXYvbnVsbCAmIGRvbmUNCj4gDQo+IEVhY2ggb2YgdGhlc2UgXl4gJ2xzJyBjb21tYW5kcyB3
aWxsIHRha2UgNCB0byA1IG1pbnV0ZXMgdG8gY29tcGxldGUuDQo+IA0KPiBUaGUgcHJvYmxlbSBp
cyB0aGF0IGNvbmN1cnJlbnQgJ2xzJyBjb21tYW5kcyBzdGFjayB1cCBpbiBuZnNfcmVhZGRpcigp
IGJvdGgNCj4gd2FpdGluZyBvbiB0aGUgbmV4dCBwYWdlIGFuZCB0YWtpbmcgdHVybnMgZmlsbGlu
ZyB0aGUgbmV4dCBwYWdlIHdpdGggeGRyLA0KPiBidXQgb25seSBvbmUgb2YgdGhlbSB3aWxsIGhh
dmUgZGVzYy0+cGx1cyBzZXQgYmVjYXVzZSBzZXR0aW5nIGl0IGNsZWFycyB0aGUNCj4gZmxhZyBv
biB0aGUgZGlyZWN0b3J5LiAgU28gaWYgYSBwYWdlIGlzIGZpbGxlZCBieSBhIHByb2Nlc3MgdGhh
dCBkb2Vzbid0IGhhdmUNCj4gZGVzYy0+cGx1cyB0aGVuIHRoZSBuZXh0IHBhc3MgdGhyb3VnaCBs
b29rdXAoKSwgaXQgZHVtcHMgdGhlIGVudGlyZSBwYWdlDQo+IGNhY2hlIHdpdGggbmZzX2ZvcmNl
X3VzZV9yZWFkZGlycGx1cygpLiAgVGhlbiB0aGUgbmV4dCByZWFkZGlyIHN0YXJ0cyBhbGwNCj4g
b3ZlciBmaWxsaW5nIHRoZSBwYWdlY2FjaGUuICBGb3J3YXJkIHByb2dyZXNzIGhhcHBlbnMsIGJ1
dCBvbmx5IGFmdGVyIG1hbnkNCj4gc3RlcHMgYmFjayByZS1maWxsaW5nIHRoZSBwYWdlY2FjaGUu
DQoNClllcywgdGhlIHJlYWRkaXIgY29kZSB3YXMgd3JpdHRlbiB3ZWxsIGJlZm9yZSBBbOKAmXMg
cGF0Y2hlcyB0byBwYXJhbGxlbGlzZSB0aGUgVkZTIG9wZXJhdGlvbnMsIGFuZCBhIGxvdCBvZiBp
dCBkaWQgcmVseSBvbiB0aGUgaW5vZGUtPmlfbXV0ZXggYmVpbmcgc2V0IG9uIHRoZSBkaXJlY3Rv
cnkgYnkgdGhlIFZGUyBsYXllci4NCg0KSG93IGFib3V0IHRoZSBmb2xsb3dpbmcgc3VnZ2VzdGlv
bjogaW5zdGVhZCBvZiBzZXR0aW5nIGEgZmxhZyBvbiB0aGUgaW5vZGUsIHdlIGl0ZXJhdGUgdGhy
b3VnaCB0aGUgZW50cmllcyBpbiAmbmZzaS0+b3Blbl9maWxlcywgYW5kIHNldCBhIGZsYWcgb24g
dGhlIHN0cnVjdCBuZnNfb3Blbl9kaXJfY29udGV4dCB0aGF0IHRoZSByZWFkZGlyIHByb2Nlc3Nl
cyBjYW4gY29weSBpbnRvIGRlc2MtPnBsdXMuIERvZXMgdGhhdCBoZWxwIHdpdGggeW91ciB3b3Jr
bG9hZD8NCg0K
On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote:
> Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect
> concurrent users of nfs_readdir(), and can cause the pagecache to
> repeatedly be invalidated unnecessarily for this case. Fix this by
> serializing nfs_readdir() on the directory's rwsem.
Just implement .iterate instead of .iterate_shared and you'll get the
serialization for free. While doing that please add a comment on why
it's serialized.
Hi Benjamin,
[auto build test ERROR on nfs/linux-next]
[also build test ERROR on v4.9-rc8 next-20161207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Benjamin-Coddington/NFS-Serialize-nfs_readdir/20161208-001039
base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa
All errors (new ones prefixed by >>):
fs/nfs/dir.c: In function 'nfs_readdir':
>> fs/nfs/dir.c:921:18: error: 'struct nfs_inode' has no member named 'rwsem'
down_write(&nfsi->rwsem);
^
fs/nfs/dir.c:963:16: error: 'struct nfs_inode' has no member named 'rwsem'
up_write(&nfsi->rwsem);
^
vim +921 fs/nfs/dir.c
915 * *desc->dir_cookie has the cookie for the next entry. We have
916 * to either find the entry with the appropriate number or
917 * revalidate the cookie.
918 */
919 memset(desc, 0, sizeof(*desc));
920
> 921 down_write(&nfsi->rwsem);
922 desc->file = file;
923 desc->ctx = ctx;
924 desc->dir_cookie = &dir_ctx->dir_cookie;
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 7 Dec 2016, at 11:30, Christoph Hellwig wrote:
> On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote:
>> Optimizations to NFS to choose between READDIR and READDIRPLUS don't
>> expect
>> concurrent users of nfs_readdir(), and can cause the pagecache to
>> repeatedly be invalidated unnecessarily for this case. Fix this by
>> serializing nfs_readdir() on the directory's rwsem.
>
> Just implement .iterate instead of .iterate_shared and you'll get the
> serialization for free. While doing that please add a comment on why
> it's serialized.
I had it in my head that Al was trying to get rid of .iterate, otherwise
yes, this would be more sensible. I think it'll still benefit from
.iterate_shared in the regular case (the case where we're not trying to
use
READDIRPLUS), so I'll try out Trond's suggestion to fix it up before
chasing
this down.
Ben
On 7 Dec 2016, at 10:46, Trond Myklebust wrote:
>> On Dec 7, 2016, at 08:28, Benjamin Coddington <[email protected]>
>> wrote:
>>
>> I was asked to figure out why the listing of very large directories
>> was
>> slow. More specifically, why concurrently listing the same large
>> directory
>> is /very/ slow. It seems that sometimes a user's reaction to waiting
>> for
>> 'ls' to complete is to start a few more.. and then their machine
>> takes a
>> very long time to complete that work.
>>
>> I can reproduce that finding. As an example:
>>
>> time ls -fl /dir/with/200000/entries/ >/dev/null
>>
>> real 0m10.766s
>> user 0m0.716s
>> sys 0m0.827s
>>
>> But..
>>
>> for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null
>> & done
>>
>> Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.
>>
>> The problem is that concurrent 'ls' commands stack up in
>> nfs_readdir() both
>> waiting on the next page and taking turns filling the next page with
>> xdr,
>> but only one of them will have desc->plus set because setting it
>> clears the
>> flag on the directory. So if a page is filled by a process that
>> doesn't have
>> desc->plus then the next pass through lookup(), it dumps the entire
>> page
>> cache with nfs_force_use_readdirplus(). Then the next readdir starts
>> all
>> over filling the pagecache. Forward progress happens, but only after
>> many
>> steps back re-filling the pagecache.
>
> Yes, the readdir code was written well before Al’s patches to
> parallelise
> the VFS operations, and a lot of it did rely on the inode->i_mutex
> being
> set on the directory by the VFS layer.
>
> How about the following suggestion: instead of setting a flag on the
> inode, we iterate through the entries in &nfsi->open_files, and set a
> flag
> on the struct nfs_open_dir_context that the readdir processes can copy
> into desc->plus. Does that help with your workload?
That should work.. I guess I'll hack it up and present it for
dissection.
Thanks!
Ben
From: "Benjamin Coddington" <[email protected]>
So, I've got the following patch, and it seems to work fine for the original
workload. However, if I use ~20 procs started 2 seconds apart, I can still
sometimes get into the state where the machine takes a very long time (5 - 8
minutes). I wonder if I am getting into a state were vmscan is reclaiming
pages while I'm trying to fill them up. So.. I'll do a bit more debugging
and re-post this if I feel like it is still the right approach.
Adding an int to nfs_open_dir_context puts it at 60 bytes here. Probably
could have added a unsigned long flags and used bits for the duped stuff as
well, but it was uglier that way, so I left it. I like how duped carries
-1, 0, and >1 information without having flag defines cluttering everywhere.
Ben
8<------------------------------------------------------------------------
> On Dec 7, 2016, at 17:34, Benjamin Coddington <[email protected]> wrote=
:
>=20
> From: "Benjamin Coddington" <[email protected]>
>=20
> So, I've got the following patch, and it seems to work fine for the origi=
nal
> workload. However, if I use ~20 procs started 2 seconds apart, I can sti=
ll
> sometimes get into the state where the machine takes a very long time (5 =
- 8
> minutes). I wonder if I am getting into a state were vmscan is reclaimin=
g
> pages while I'm trying to fill them up. So.. I'll do a bit more debuggin=
g
> and re-post this if I feel like it is still the right approach.
>=20
> Adding an int to nfs_open_dir_context puts it at 60 bytes here. Probably
> could have added a unsigned long flags and used bits for the duped stuff =
as
> well, but it was uglier that way, so I left it. I like how duped carries
> -1, 0, and >1 information without having flag defines cluttering everywhe=
re.
>=20
> Ben
>=20
> 8<-----------------------------------------------------------------------=
-
>=20
> From 5b3e747a984ec966e8c742d8f4fe4b08e1c93acd Mon Sep 17 00:00:00 2001
> Message-Id: <5b3e747a984ec966e8c742d8f4fe4b08e1c93acd.1481149380.git.bcod=
[email protected]>
> From: Benjamin Coddington <[email protected]>
> Date: Wed, 7 Dec 2016 16:23:30 -0500
> Subject: [PATCH] NFS: Move readdirplus flag to directory context
>=20
> For a concurrent 'ls -l' workload, processes can stack up in nfs_readdir(=
)
> both waiting on the next page and taking turns filling the next page from
> XDR, but only one of them will have desc->plus set because setting it
> clears the flag on the directory. If a page is filled by a process that
> doesn't have desc->plus set then the next pass through lookup() will caus=
e
> it to dump the entire page cache with nfs_force_use_readdirplus(). Then
> the next readdir starts all over filling the pagecache. Forward progress
> happens, but only after many steps back re-filling the pagecache.
>=20
> Fix this by moving the flag to use readdirplus to each open directory
> context.
>=20
> Suggested-by: Trond Myklebust <[email protected]>
> Signed-off-by: Benjamin Coddington <[email protected]>
> ---
> fs/nfs/dir.c | 39 ++++++++++++++++++++++++---------------
> include/linux/nfs_fs.h | 1 +
> 2 files changed, 25 insertions(+), 15 deletions(-)
>=20
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 7483722162fa..818172606fed 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -78,6 +78,7 @@ static struct nfs_open_dir_context *alloc_nfs_open_dir_=
context(struct inode *dir
> =09=09ctx->dir_cookie =3D 0;
> =09=09ctx->dup_cookie =3D 0;
> =09=09ctx->cred =3D get_rpccred(cred);
> +=09=09ctx->use_readdir_plus =3D 0;
> =09=09spin_lock(&dir->i_lock);
> =09=09list_add(&ctx->list, &nfsi->open_files);
> =09=09spin_unlock(&dir->i_lock);
> @@ -443,17 +444,35 @@ int nfs_same_file(struct dentry *dentry, struct nfs=
_entry *entry)
> }
>=20
> static
> -bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> +bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx,
> +=09=09=09struct nfs_open_dir_context *dir_ctx)
> {
> =09if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
> =09=09return false;
> -=09if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
> +=09if (xchg(&dir_ctx->use_readdir_plus, 0))
> =09=09return true;
> =09if (ctx->pos =3D=3D 0)
> =09=09return true;
> =09return false;
> }
>=20
> +static
> +void nfs_set_readdirplus(struct inode *dir, int force)
> +{
> +=09struct nfs_inode *nfsi =3D NFS_I(dir);
> +=09struct nfs_open_dir_context *ctx;
> +
> +=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> +=09 !list_empty(&nfsi->open_files)) {
> +=09=09spin_lock(&dir->i_lock);
> +=09=09list_for_each_entry(ctx, &nfsi->open_files, list)
> +=09=09=09ctx->use_readdir_plus =3D 1;
> +=09=09spin_unlock(&dir->i_lock);
> +=09=09if (force)
> +=09=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1);
> +=09}
> +}
> +
> /*
> * This function is called by the lookup and getattr code to request the
> * use of readdirplus to accelerate any future lookups in the same
> @@ -461,11 +480,7 @@ bool nfs_use_readdirplus(struct inode *dir, struct d=
ir_context *ctx)
> */
> void nfs_advise_use_readdirplus(struct inode *dir)
> {
> -=09struct nfs_inode *nfsi =3D NFS_I(dir);
> -
> -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> -=09 !list_empty(&nfsi->open_files))
> -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> +=09nfs_set_readdirplus(dir, 0);
> }
>=20
> /*
> @@ -478,13 +493,7 @@ void nfs_advise_use_readdirplus(struct inode *dir)
> */
> void nfs_force_use_readdirplus(struct inode *dir)
> {
> -=09struct nfs_inode *nfsi =3D NFS_I(dir);
> -
> -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> -=09 !list_empty(&nfsi->open_files)) {
> -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> -=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1);
> -=09}
> +=09nfs_set_readdirplus(dir, 1);
> }
>=20
> static
> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct dir_=
context *ctx)
> =09desc->ctx =3D ctx;
> =09desc->dir_cookie =3D &dir_ctx->dir_cookie;
> =09desc->decode =3D NFS_PROTO(inode)->decode_dirent;
> -=09desc->plus =3D nfs_use_readdirplus(inode, ctx) ? 1 : 0;
> +=09desc->plus =3D nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
This fixes desc->plus at the beginning of the readdir() call. Perhaps we sh=
ould instead check the value of ctx->use_readdir_plus in the call to nfs_re=
addir_xdr_filler(), and just resetting cts->use_readdir_plus at the very en=
d of nfs_readdir()?
>=20
> =09if (ctx->pos =3D=3D 0 || nfs_attribute_cache_expired(inode))
> =09=09res =3D nfs_revalidate_mapping(inode, file->f_mapping);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index cb631973839a..fe06c1dd1737 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -89,6 +89,7 @@ struct nfs_open_dir_context {
> =09__u64 dir_cookie;
> =09__u64 dup_cookie;
> =09signed char duped;
> +=09int use_readdir_plus;
> };
>=20
> /*
> --=20
> 2.5.5
>=20
On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>> On Dec 7, 2016, at 17:34, Benjamin Coddington <[email protected]>
>> wrote:
>> static
>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct
>> dir_context *ctx)
>> desc->ctx = ctx;
>> desc->dir_cookie = &dir_ctx->dir_cookie;
>> desc->decode = NFS_PROTO(inode)->decode_dirent;
>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>
> This fixes desc->plus at the beginning of the readdir() call. Perhaps
> we
> should instead check the value of ctx->use_readdir_plus in the call to
> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus at
> the
> very end of nfs_readdir()?
I don't understand the functional difference. Is this just a
preference?
There must be something else happening.. dcache is getting under
pressure
pruned maybe, that causes a miss and then the process starts over?
Ben
DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAxNzo1NSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp
bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBPbiA3IERlYyAyMDE2LCBhdCAxNzo0MSwgVHJv
bmQgTXlrbGVidXN0IHdyb3RlOg0KPiANCj4+PiBPbiBEZWMgNywgMjAxNiwgYXQgMTc6MzQsIEJl
bmphbWluIENvZGRpbmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPj4+IHN0YXRp
Yw0KPj4+IEBAIC05MjEsNyArOTMwLDcgQEAgc3RhdGljIGludCBuZnNfcmVhZGRpcihzdHJ1Y3Qg
ZmlsZSAqZmlsZSwgc3RydWN0IGRpcl9jb250ZXh0ICpjdHgpDQo+Pj4gCWRlc2MtPmN0eCA9IGN0
eDsNCj4+PiAJZGVzYy0+ZGlyX2Nvb2tpZSA9ICZkaXJfY3R4LT5kaXJfY29va2llOw0KPj4+IAlk
ZXNjLT5kZWNvZGUgPSBORlNfUFJPVE8oaW5vZGUpLT5kZWNvZGVfZGlyZW50Ow0KPj4+IC0JZGVz
Yy0+cGx1cyA9IG5mc191c2VfcmVhZGRpcnBsdXMoaW5vZGUsIGN0eCkgPyAxIDogMDsNCj4+PiAr
CWRlc2MtPnBsdXMgPSBuZnNfdXNlX3JlYWRkaXJwbHVzKGlub2RlLCBjdHgsIGRpcl9jdHgpID8g
MSA6IDA7DQo+PiANCj4+IFRoaXMgZml4ZXMgZGVzYy0+cGx1cyBhdCB0aGUgYmVnaW5uaW5nIG9m
IHRoZSByZWFkZGlyKCkgY2FsbC4gUGVyaGFwcyB3ZQ0KPj4gc2hvdWxkIGluc3RlYWQgY2hlY2sg
dGhlIHZhbHVlIG9mIGN0eC0+dXNlX3JlYWRkaXJfcGx1cyBpbiB0aGUgY2FsbCB0bw0KPj4gbmZz
X3JlYWRkaXJfeGRyX2ZpbGxlcigpLCBhbmQganVzdCByZXNldHRpbmcgY3RzLT51c2VfcmVhZGRp
cl9wbHVzIGF0IHRoZQ0KPj4gdmVyeSBlbmQgb2YgbmZzX3JlYWRkaXIoKT8NCj4gDQo+IEkgZG9u
J3QgdW5kZXJzdGFuZCB0aGUgZnVuY3Rpb25hbCBkaWZmZXJlbmNlLiAgSXMgdGhpcyBqdXN0IGEg
cHJlZmVyZW5jZT8NCg0KTm8uIFRoZSBmdW5jdGlvbmFsIGRpZmZlcmVuY2UgaXMgdGhhdCB3ZSB0
YWtlIGludG8gYWNjb3VudCB0aGUgZmFjdCB0aGF0IGEgcHJvY2VzcyBtYXkgYmUgaW4gdGhlIHJl
YWRkaXIoKSBjb2RlIHdoaWxlIGEgR0VUQVRUUiBvciBMT09LVVAgZnJvbSBhIHNlY29uZCBwcm9j
ZXNzIGlzIHRyaWdnZXJpbmcg4oCcdXNlIHJlYWRkaXJwbHVz4oCdIGZlZWRiYWNrLg0KDQo+IA0K
PiBUaGVyZSBtdXN0IGJlIHNvbWV0aGluZyBlbHNlIGhhcHBlbmluZy4uIGRjYWNoZSBpcyBnZXR0
aW5nIHVuZGVyIHByZXNzdXJlDQo+IHBydW5lZCBtYXliZSwgdGhhdCBjYXVzZXMgYSBtaXNzIGFu
ZCB0aGVuIHRoZSBwcm9jZXNzIHN0YXJ0cyBvdmVyPw0KPiANCj4gQmVuDQo+IA0KDQo=
On 7 Dec 2016, at 17:59, Trond Myklebust wrote:
>> On Dec 7, 2016, at 17:55, Benjamin Coddington <[email protected]>
>> wrote:
>>
>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>>
>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington <[email protected]>
>>>> wrote:
>>>> static
>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file,
>>>> struct dir_context *ctx)
>>>> desc->ctx = ctx;
>>>> desc->dir_cookie = &dir_ctx->dir_cookie;
>>>> desc->decode = NFS_PROTO(inode)->decode_dirent;
>>>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>>>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>>>
>>> This fixes desc->plus at the beginning of the readdir() call.
>>> Perhaps we
>>> should instead check the value of ctx->use_readdir_plus in the call
>>> to
>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus
>>> at the
>>> very end of nfs_readdir()?
>>
>> I don't understand the functional difference. Is this just a
>> preference?
>
> No. The functional difference is that we take into account the fact
> that a
> process may be in the readdir() code while a GETATTR or LOOKUP from a
> second process is triggering “use readdirplus” feedback.
This should only matter if the concurrent processes have different
buffer
sizes or start at a different offsets -- which shouldn't happen with
plain
old 'ls -l'.
.. or maybe I'm wrong if, hmm.. if acdirmin ran out (maybe?).. or if we
mix
'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK. I'll try
it.
Thanks for your suggestions.
Ben
On 7 Dec 2016, at 18:10, Benjamin Coddington wrote:
> On 7 Dec 2016, at 17:59, Trond Myklebust wrote:
>
>>> On Dec 7, 2016, at 17:55, Benjamin Coddington <[email protected]>
>>> wrote:
>>>
>>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>>>
>>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington
>>>>> <[email protected]> wrote:
>>>>> static
>>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file,
>>>>> struct dir_context *ctx)
>>>>> desc->ctx = ctx;
>>>>> desc->dir_cookie = &dir_ctx->dir_cookie;
>>>>> desc->decode = NFS_PROTO(inode)->decode_dirent;
>>>>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>>>>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>>>>
>>>> This fixes desc->plus at the beginning of the readdir() call.
>>>> Perhaps we
>>>> should instead check the value of ctx->use_readdir_plus in the call
>>>> to
>>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus
>>>> at the
>>>> very end of nfs_readdir()?
>>>
>>> I don't understand the functional difference. Is this just a
>>> preference?
>>
>> No. The functional difference is that we take into account the fact
>> that a
>> process may be in the readdir() code while a GETATTR or LOOKUP from a
>> second process is triggering “use readdirplus” feedback.
>
> This should only matter if the concurrent processes have different
> buffer
> sizes or start at a different offsets -- which shouldn't happen with
> plain
> old 'ls -l'.
>
> .. or maybe I'm wrong if, hmm.. if acdirmin ran out (maybe?).. or if
> we mix
> 'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK. I'll
> try it.
This doesn't help.
The issue is that anything more than a couple of processes cause
contention
on the directory's i_lock. The i_lock is taken x entries x processes.
The
inode flags and serialization worked far better.
If we add another inode flag, then we can add a DOING_RDPLUS. A process
entering nfs_readdir() that sees ADVISE and not DOING clears it and sets
DOING and remembers to clear DOING on exit of nfs_readdir(). Any
process
that sees DOING uses plus.
Ben
From: "Benjamin Coddington" <[email protected]>
Ehh.. I think it's kinda ugly, but this is what I came up with.
It works well, though, and handles everything I throw at it. Its not as
simple as Christoph's suggestion to just go back to .iterate, which would
solve this in a simpler way.
Ben
8<---------------------------------------------------------------------------------------
> Ehh.. I think it's kinda ugly, but this is what I came up with.
>
> It works well, though, and handles everything I throw at it. Its not as
> simple as Christoph's suggestion to just go back to .iterate, which would
> solve this in a simpler way.
After many testings, I can't find any way this is faster than using the old
.iterate serialized readdir. Additionally, at around 1M entries, this way
frequently needs to oom-kill things. Using .iterate, that doesn't happen,
and is much faster at that scale.
Maybe my feeble brain is unable to think of the right workload to make
.iterate_shared preferable for NFS. Suggestions welcome.
Otherwise, I think serialized is the way to go.
Ben