Try again
Begin forwarded message:
> From: Chuck Lever <[email protected]>
> Date: April 4, 2011 8:09:05 AM PDT
> To: Andrew Klaassen <[email protected]>
> Cc: [email protected]
> Subject: Re: Prioritizing readdirplus/getattr/lookup
>
>
> On Apr 4, 2011, at 6:31 AM, Andrew Klaassen wrote:
>
>> How difficult would it be to make nfsd give priority to the calls generated by "ls -l" (i.e. readdirplus, getattr, lookup) over read and write calls? Is it a matter of tweaking a couple of sysctls or changing a few lines of code, or would it mean a major re-write?
>>
>> I'm working in an environment where it's important to have reasonably good throughput for the HPC farm (50-200 machines reading and writing 5-10MB files as fast as they can pump them through), while simultaneously providing snappy responses to "ls -l" and equivalents for people reviewing file sizes and times and browsing through the filesystem and constructing new jobs and whatnot.
>>
>> I've tried a small handful of server OSes (Solaris, Exastore, various Linux flavours and tunings and nfsd counts) that do great on the throughput side but horrible on the "ls -l" under load side (as mentioned in my previous emails).
>>
>> However, I know what I need is possible, because Netapp GX on very similar hardware (similar processor, memory, and spindle count), does slightly worse (20% or so) on total throughput but much better (10-100 times better than Solaris/Linux/Exastore) on under-load "ls -l" responsiveness.
>>
>> In the Linux case, I think I've narrowed to the problem down to nfsd rather than filesystem or VM system. It's not filesystem or VM system, because when the server is under heavy local load equivalent to my HPC farm load, both local and remote "ls -l" commands are fast. It's not that the NFS load overwhelms the server, because when the server is under heavy HPC farm load, local "ls -l" commands are still fast.
>>
>> It's only when there's an NFS load and an NFS "ls -l" that the "ls -l" is slow. Like so:
>>
>> throughput ls -l
>> ========== =====
>> Heavy local load, local ls -l fast fast
>> Heavy local load, NFS ls -l fast fast
>> Heavy NFS load, local ls -l fast fast
>> Heavy NFS load, NFS ls -l fast very slow
>
> I suspect this is a result of the POSIX requirement that stat(2) must return mtime and size that reflects the very latest write to a file. To meet this requirement, the client flushes all dirty data for a file before performing the GETATTR.
>
> The Linux VM allows a large amount of dirty data outstanding when writing a file. This means it could take quite a while before that GETATTR can be done.
>
> As an experiment, you could try mounting the NFS file systems under test with the "sync" mount option. This effectively caps the amount of dirty data the client can cache when writing a file.
>
>> This suggests to me that it's nfsd that's slowing down the ls -l response times rather than the filesystem or VM system.
>>
>> Would fixing the bottom-right-corner case - even if it meant a modest throughput slowdown - be an easy tweak/patch? Or major re-write? (Or just a kernel upgrade?)
>>
>> I know it's doable because the Netapp does it; the question is how large a job would it be on Linux.
>
> At a guess, NetApp is faster probably because it commits data to permanent storage immediately during each WRITE operation, so the client doesn't have to keep much dirty data in its cache. NFS writes are two-phase, and NetApps cut out one of those phases by telling the client the second phase is not needed. With the other server implementations, the client has to continue caching that data until it does a COMMIT (the second phase). If the servers haven't already committed that data to permanent storage at that point, the client has to wait.
>
>> Thanks again.
>>
>>
>> FWIW, here's what I've tried so far to try to make this problem go away without success:
>>
>> Server side:
>>
>> kernels (all x86_64): 2.6.32-[something] on Scientific Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on Slackware
>> filesystems: xfs, ext4
>> nfsd counts: 8,32,64,127,128,256,1024
>> schedulers: cfq,deadline
>> export options: async,no_root_squash
>>
>> Client side:
>> kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE 11.3
>> hard,intr,noatime,vers=3,mountvers=3 # always on
>> rsize,wsize: 32768,65536
>> proto: tcp,udp
>> nolock # on or off
>> noac # on or off
>> actimeo: 3,5,60,240,600 # I had really hoped this would help
>>
>> Andrew
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--- On Mon, 4/4/11, Chuck Lever <[email protected]> wrote:
> I suspect this is a result of the POSIX requirement
> that stat(2) must return mtime and size that reflects the
> very latest write to a file.? To meet this requirement,
> the client flushes all dirty data for a file before
> performing the GETATTR.
>
> The Linux VM allows a large amount of dirty data
> outstanding when writing a file.? This means it could
> take quite a while before that GETATTR can be done.
I've done some more benchmarking, and in my case writes appear to *not* be the culprit.
Having the HPC farm only reading files (with noatime set everywhere, of course) actually makes "ls -l" over NFS slightly ~slower~.
Having the HPC farm only reading files that all fit in the server's cache makes "ls -l" over NFS yet slower. (I watched iostat while this was running to make sure that nothing was being written to or read from disk.)
So I've eliminated the disk as a bottleneck, and (as per my earlier emails) I've eliminated the filesystem and VM system.
It really does look at this point like nfsd is the choke point.
Andrew
On Tue, Apr 05, 2011 at 09:32:38AM -0700, Andrew Klaassen wrote:
> I've done some more benchmarking, and in my case writes appear to *not* be the culprit.
>
> Having the HPC farm only reading files (with noatime set everywhere, of course) actually makes "ls -l" over NFS slightly ~slower~.
Do you know which part of 'ls -l' is taking longer? The readdir, or the
stats? (Would an strace determine it?) Probably the latter, I guess.
(Apologies if you already said.)
> Having the HPC farm only reading files that all fit in the server's cache makes "ls -l" over NFS yet slower. (I watched iostat while this was running to make sure that nothing was being written to or read from disk.)
>
> So I've eliminated the disk as a bottleneck, and (as per my earlier emails) I've eliminated the filesystem and VM system.
>
> It really does look at this point like nfsd is the choke point.
So I suppose the replies are likely waiting to be sent back over the tcp
connection to the client, and waiting behind big read replies?
As a check I wonder if there's any cheesy way we could prioritize
threads waiting to send back getattr requests over threads waiting to
send back reads.
Such threads are waiting on the xpt_mutex. Is there some mutex or
semaphore variant that gives us control over the order things are woken
up in, or would we need our own?
--b.