From: "Benjamin Coddington" <bcodding@redhat.com>
To: "linux-nfs list" <linux-nfs@vger.kernel.org>
Subject: Concurrent `ls` takes out the thrash
Date: Wed, 07 Dec 2016 08:28:41 -0500
Message-ID: <C2A81CDB-ED8F-4D41-919F-3F92CBFF751B@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; format=flowed
Sender: linux-nfs-owner@vger.kernel.org

I was asked to figure out why the listing of very large directories was
slow.  More specifically, why concurrently listing the same large 
directory
is /very/ slow.  It seems that sometimes a user's reaction to waiting 
for
'ls' to complete is to start a few more.. and then their machine takes a
very long time to complete that work.

I can reproduce that finding.  As an example:

time ls -fl /dir/with/200000/entries/ >/dev/null

real    0m10.766s
user    0m0.716s
sys     0m0.827s

But..

for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null & 
done

Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.

The problem is that concurrent 'ls' commands stack up in nfs_readdir() 
both
waiting on the next page and taking turns filling the next page with 
xdr,
but only one of them will have desc->plus set because setting it clears 
the
flag on the directory.  So if a page is filled by a process that doesn't 
have
desc->plus then the next pass through lookup(), it dumps the entire page
cache with nfs_force_use_readdirplus().  Then the next readdir starts 
all
over filling the pagecache.  Forward progress happens, but only after 
many
steps back re-filling the pagecache.

To me most obvious fix would be to serialize nfs_readdir() on the 
directory
inode, so I'll follow-up with patch that does that with nfsi->rwsem.  
With that,
the above parallel 'ls' takes 12 seconds for each 'ls' to complete.

This only works because with concurrent 'ls' there is a consistent 
buffer
size so a waiting nfs_readdir() started in the same place for an 
unmodified
directory should always hit the cache after waiting.  Serializing
nfs_readdir() will not solve this problem for concurrent callers with
differing buffer sizes, or starting at different offsets, since there's 
a
good chance the waiting readdir() will not see the readdirplus flag when 
it
resumes and so will not prime the dcache.

While I think it's an OK fix, it feels bad to serialize.  At the same
time, nfs_readdir() is already serialized on the pagecache when 
concurrent
callers need to go to the server.  There might be other problems I 
haven't
thought about.

Maybe there's another way to fix this, or maybe we can just say "Don't 
do ls
more than once, you impatient bastards!"

Ben