Content-Type: text/plain; charset=us-ascii
Subject: Fwd: Prioritizing readdirplus/getattr/lookup
From: Chuck Lever <chuck.lever@oracle.com>
Date: Mon, 4 Apr 2011 16:59:44 -0700
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Message-Id: <4104737B-ECEA-46F7-BD26-39241DB8EBAF@oracle.com>
References: <8114CB12-77A9-47DB-A396-E30A4BF3742A@oracle.com>
To: Andrew Klaassen <clawsoon@yahoo.com>
Sender: linux-nfs-owner@vger.kernel.org
MIME-Version: 1.0

Try again

Begin forwarded message:

> From: Chuck Lever <chuck.lever@oracle.com>
> Date: April 4, 2011 8:09:05 AM PDT
> To: Andrew Klaassen <clawsoon@yahoo.com>
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: Prioritizing readdirplus/getattr/lookup
> 
> 
> On Apr 4, 2011, at 6:31 AM, Andrew Klaassen wrote:
> 
>> How difficult would it be to make nfsd give priority to the calls generated by "ls -l" (i.e. readdirplus, getattr, lookup) over read and write calls?  Is it a matter of tweaking a couple of sysctls or changing a few lines of code, or would it mean a major re-write?
>> 
>> I'm working in an environment where it's important to have reasonably good throughput for the HPC farm (50-200 machines reading and writing 5-10MB files as fast as they can pump them through), while simultaneously providing snappy responses to "ls -l" and equivalents for people reviewing file sizes and times and browsing through the filesystem and constructing new jobs and whatnot.
>> 
>> I've tried a small handful of server OSes (Solaris, Exastore, various Linux flavours and tunings and nfsd counts) that do great on the throughput side but horrible on the "ls -l" under load side (as mentioned in my previous emails).
>> 
>> However, I know what I need is possible, because Netapp GX on very similar hardware (similar processor, memory, and spindle count), does slightly worse (20% or so) on total throughput but much better (10-100 times better than Solaris/Linux/Exastore) on under-load "ls -l" responsiveness.
>> 
>> In the Linux case, I think I've narrowed to the problem down to nfsd rather than filesystem or VM system.  It's not filesystem or VM system, because when the server is under heavy local load equivalent to my HPC farm load, both local and remote "ls -l" commands are fast.  It's not that the NFS load overwhelms the server, because when the server is under heavy HPC farm load, local "ls -l" commands are still fast.
>> 
>> It's only when there's an NFS load and an NFS "ls -l" that the "ls -l" is slow.  Like so:
>> 
>>                                  throughput     ls -l
>>                                  ==========     =====
>> Heavy local load, local ls -l     fast           fast
>> Heavy local load, NFS ls -l       fast           fast
>> Heavy NFS load, local ls -l       fast           fast
>> Heavy NFS load, NFS ls -l         fast           very slow
> 
> I suspect this is a result of the POSIX requirement that stat(2) must return mtime and size that reflects the very latest write to a file.  To meet this requirement, the client flushes all dirty data for a file before performing the GETATTR.
> 
> The Linux VM allows a large amount of dirty data outstanding when writing a file.  This means it could take quite a while before that GETATTR can be done.
> 
> As an experiment, you could try mounting the NFS file systems under test with the "sync" mount option.  This effectively caps the amount of dirty data the client can cache when writing a file.
> 
>> This suggests to me that it's nfsd that's slowing down the ls -l response times rather than the filesystem or VM system.
>> 
>> Would fixing the bottom-right-corner case - even if it meant a modest throughput slowdown - be an easy tweak/patch?  Or major re-write?  (Or just a kernel upgrade?)
>> 
>> I know it's doable because the Netapp does it; the question is how large a job would it be on Linux.
> 
> At a guess, NetApp is faster probably because it commits data to permanent storage immediately during each WRITE operation, so the client doesn't have to keep much dirty data in its cache.  NFS writes are two-phase, and NetApps cut out one of those phases by telling the client the second phase is not needed.  With the other server implementations, the client has to continue caching that data until it does a COMMIT (the second phase).  If the servers haven't already committed that data to permanent storage at that point, the client has to wait.
> 
>> Thanks again.
>> 
>> 
>> FWIW, here's what I've tried so far to try to make this problem go away without success:
>> 
>> Server side:
>> 
>> kernels (all x86_64): 2.6.32-[something] on Scientific Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on Slackware
>> filesystems: xfs, ext4
>> nfsd counts: 8,32,64,127,128,256,1024
>> schedulers: cfq,deadline
>> export options: async,no_root_squash
>> 
>> Client side:
>> kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE 11.3
>> hard,intr,noatime,vers=3,mountvers=3 # always on
>> rsize,wsize:  32768,65536
>> proto:        tcp,udp
>> nolock        # on or off
>> noac          # on or off
>> actimeo:      3,5,60,240,600  # I had really hoped this would help
>> 
>> Andrew
> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com