Message-ID: <13701.83906.qm@web65411.mail.ac4.yahoo.com>
Date: Mon, 4 Apr 2011 09:45:07 -0700 (PDT)
From: Andrew Klaassen <clawsoon@yahoo.com>
Subject: Re: Prioritizing readdirplus/getattr/lookup
To: Steven Procter <steven@void.org>
Cc: linux-nfs@vger.kernel.org
In-Reply-To: <4694.1301934124@risen.void.org>
Content-Type: text/plain; charset=iso-8859-1
Sender: linux-nfs-owner@vger.kernel.org
MIME-Version: 1.0

Hi Steven,

Packet sniffing is exactly what I did (to the limit of my current abilities, anyway); it's what led to my questions.  If you read further down, you'll see the packet sniffer results I got that led me to ask about having the server nfsd processes prioritize getattr/readdirplus/lookup queries.

In brief: Under the same load, on similar hardware, with a similar number of disks, our NetApp pumps back getattr/readdirplus/lookup replies at a rate of thousands per second, compared to the tens per second (or less, plus periodic long delays) averaged by our Linux server under that load.

The Linux filesystem and VM system don't seem to be the problem, because the same load locally doesn't cause the "ls -l" problem, and the NFS load doesn't cause the problem for local "ls -l" runs.

I guess I could start trying to trace nfsd processes to find the contention, but I really don't feel qualified for that; I was hoping someone familiar with the code could say, "Sure, that's an easy fix," or, "Not gonna happen, because everything would have to be re-written to get the kernel to prioritize nfsd threads based on what they're doing."

I can gladly provide more details about my test setup and/or packet traces.

Thanks.

Andrew


--- On Mon, 4/4/11, Steven Procter <steven@risen.void.org> wrote:

> I'd recommend using a packet sniffer to see what is going on at the
> protocol level when there are performance issues.? I've found that
> wireshark works well for this kind of investigation.
> 
> --Steven
> 
> > X-Mailer: YahooMailClassic/12.0.2
> YahooMailWebService/0.8.109.295617
> > Date:??? Mon, 4 Apr 2011 06:31:09 -0700
> (PDT)
> > From:??? Andrew Klaassen <clawsoon@yahoo.com>
> > Subject: Prioritizing readdirplus/getattr/lookup
> > To:??? linux-nfs@vger.kernel.org
> > Sender:??? linux-nfs-owner@vger.kernel.org
> > 
> > How difficult would it be to make nfsd give priority
> to the calls generated by "ls -l" (i.e. readdirplus,
> getattr, lookup) over read and write calls?? Is it a matter
> of tweaking a couple of sysctls or changing a few lines of
> code, or would it mean a major re-write?
> > 
> > I'm working in an environment where it's important to
> have reasonably good throughput for the HPC farm (50-200
> machines reading and writing 5-10MB files as fast as they
> can pump them through), while simultaneously providing
> snappy responses to "ls -l" and equivalents for people
> reviewing file sizes and times and browsing through the
> filesystem and constructing new jobs and whatnot.
> > 
> > I've tried a small handful of server OSes (Solaris,
> Exastore, various Linux flavours and tunings and nfsd
> counts) that do great on the throughput side but horrible on
> the "ls -l" under load side (as mentioned in my previous
> emails).
> > 
> > However, I know what I need is possible, because
> Netapp GX on very similar hardware (similar processor,
> memory, and spindle count), does slightly worse (20% or so)
> on total throughput but much better (10-100 times better
> than Solaris/Linux/Exastore) on under-load "ls -l"
> responsiveness.
> > 
> > In the Linux case, I think I've narrowed to the
> problem down to nfsd rather than filesystem or VM system.?
> It's not filesystem or VM system, because when the server is
> under heavy local load equivalent to my HPC farm load, both
> local and remote "ls -l" commands are fast.? It's not that
> the NFS load overwhelms the server, because when the server
> is under heavy HPC farm load, local "ls -l" commands are
> still fast.
> > 
> > It's only when there's an NFS load and an NFS "ls -l"
> that the "ls -l" is slow.? Like so:
> > 
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> throughput? ???ls -l
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ==========? ???=====
> > Heavy local load, local ls -l? ???fast? ? ? ?
> ???fast
> > Heavy local load, NFS ls -l? ? ???fast? ? ? ?
> ???fast
> > Heavy NFS load, local ls -l? ? ???fast? ? ? ?
> ???fast
> > Heavy NFS load, NFS ls -l? ? ? ???fast? ? ?
> ? ???very slow
> > 
> > This suggests to me that it's nfsd that's slowing down
> the ls -l response times rather than the filesystem or VM
> system.
> > 
> > Would fixing the bottom-right-corner case - even if it
> meant a modest throughput slowdown - be an easy
> tweak/patch?? Or major re-write?? (Or just a kernel
> upgrade?)
> > 
> > I know it's doable because the Netapp does it; the
> question is how large a job would it be on Linux.
> > 
> > Thanks again.
> > 
> > 
> > FWIW, here's what I've tried so far to try to make
> this problem go away without success:
> > 
> > Server side:
> > 
> > kernels (all x86_64): 2.6.32-[something] on Scientific
> Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on Slackware
> > filesystems: xfs, ext4
> > nfsd counts: 8,32,64,127,128,256,1024
> > schedulers: cfq,deadline
> > export options: async,no_root_squash
> > 
> > Client side:
> > kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE
> 11.3
> > hard,intr,noatime,vers=3,mountvers=3 # always on
> > rsize,wsize:? 32768,65536
> > proto:? ? ? ? tcp,udp
> > nolock? ? ? ? # on or off
> > noac? ? ? ? ? # on or off
> > actimeo:? ? ? 3,5,60,240,600? # I had really hoped
> this would help
> > 
> > Andrew
> > 
> > 
> > --- On Thu, 3/31/11, Andrew Klaassen <clawsoon@yahoo.com>
> wrote:
> > 
> > > Setting actimeo=600 gave me part of
> > > the behaviour I expected; on the first directory
> listing,
> > > the calls were all readdirplus and no getattr.
> > > 
> > > However, there were now long stretches where
> nothing was
> > > happening.? During a single directory listing to
> a
> > > loaded server, there'd be:
> > > 
> > >? ~10 seconds of readdirplus calls and replies,
> followed by
> > >? ~70 seconds of nothing, followed by
> > >? ~10 seconds of readdirplus calls and replies,
> followed by
> > >? ~100 seconds of nothing, followed by
> > >? ~10 seconds of readdirplus calls and replies,
> followed by
> > >? ~110 seconds of nothing, followed by
> > >? ~2 seconds of readdirplus calls and replies
> > > 
> > > Why the long stretches of nothing?? If I'm
> reading my
> > > tshark output properly, it doesn't seem like the
> client was
> > > waiting for a server response.? Here are a
> couple of
> > > lines before and after a long stretch of
> nothing:
> > > 
> > >? 28.575537 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > >? 28.593943 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 358) random_1168.exr
> > > random_2159.exr random_2188
> > > .exr random_0969.exr random_1662.exr
> random_0022.exr
> > > random_0785.exr random_2316.exr random_0831.exr
> > > random_0443.exr random_
> > > 1203.exr random_1907.exr
> > >? 28.594006 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > >? 28.623736 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 362) random_1575.exr
> > > random_0492.exr random_0335
> > > .exr random_2460.exr random_0754.exr
> random_1114.exr
> > > random_2001.exr random_2298.exr random_1858.exr
> > > random_1889.exr random_
> > > 2249.exr random_0782.exr
> > > 103.811801 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > > 103.883930 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 2311) random_0025.exr
> > > random_1665.exr random_231
> > > 1.exr random_1204.exr random_0444.exr
> random_0836.exr
> > > random_0332.exr random_0495.exr random_1572.exr
> > > random_1900.exr random
> > > _2467.exr random_1113.exr
> > > 103.884014 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > > 103.965167 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 2316) random_0753.exr
> > > random_2006.exr random_021
> > > 6.exr random_1824.exr random_1456.exr
> random_1790.exr
> > > random_1037.exr random_0677.exr random_2122.exr
> > > random_0101.exr random
> > > _1741.exr random_2235.exr
> > > 
> > > Calls are sent and replies received at the 28
> second mark,
> > > and then... nothing... until the 103 second
> mark.? I'm
> > > sure the server must be somehow telling the
> client that it's
> > > busy, but - at least with the tools I'm looking
> at - I don't
> > > see how.? Is tshark just hiding TCP delays and
> > > retransmits from me?
> > > 
> > > Thanks again.
> > > 
> > > Andrew
> > > 
> > > 
> > > --- On Thu, 3/31/11, Andrew Klaassen <clawsoon@yahoo.com>
> > > wrote:
> > > 
> > > > Interesting.? So the reason it's
> > > > switching back and forth between readdirplus
> and
> > > getattr
> > > > during the same ls command is because the
> command is
> > > taking
> > > > so long to run that the cache is
> periodically expiring
> > > as
> > > > the command is running?
> > > > 
> > > > I'll do some playing with actimeo to see if
> I'm
> > > actually
> > > > understanding this.
> > > > 
> > > > Thanks!
> > > > 
> > > > Andrew
> > > > 
> > > > 
> > > > --- On Thu, 3/31/11, Steven Procter <steven@risen.void.org>
> > > > wrote:
> > > > 
> > > > > This is due to client caching.?
> > > > > When the second ls -l runs the cache
> > > > > contains an entry for the directory.?
> The client
> > > can
> > > > > check if the cached
> > > > > directory data is still valid by
> issuing a
> > > GETATTR on
> > > > the
> > > > > directory.
> > > > > 
> > > > > But this only validates the names, not
> the
> > > > attributes,
> > > > > which are not
> > > > > actually part of the directory.? Those
> must be
> > > > > refetched.? So the client
> > > > > issues a GETATTR for each entry in the
> > > directory.?
> > > > It
> > > > > issues them
> > > > > sequentially, probably as ls calls
> readdir() and
> > > then
> > > > > stat()
> > > > > sequentially on the directory entries.
> > > > > 
> > > > > This takes so long that the cache entry
> times out
> > > and
> > > > the
> > > > > next time you
> > > > > run ls -l the client reloads the
> directory using
> > > > > READDIRPLUS.
> > > > > 
> > > > > --Steven
> > > > > 
> > > > > > X-Mailer: YahooMailClassic/12.0.2
> > > > > YahooMailWebService/0.8.109.295617
> > > > > > Date:??? Thu, 31 Mar 2011
> 15:24:15
> > > > > -0700 (PDT)
> > > > > > From:??? Andrew Klaassen <clawsoon@yahoo.com>
> > > > > > Subject: readdirplus/getattr
> > > > > > To:??? linux-nfs@vger.kernel.org
> > > > > > Sender:??? linux-nfs-owner@vger.kernel.org
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I've been trying to get my Linux
> NFS clients
> > > to
> > > > be a
> > > > > little snappier about listing large
> directories
> > > from
> > > > > heavily-loaded servers.? I found the
> following
> > > > > fascinating behaviour (this is with
> > > > 2.6.31.14-0.6-desktop,
> > > > > x86_64, from openSUSE 11.3, Solaris
> Express 11
> > > NFS
> > > > server):
> > > > > > 
> > > > > > With "ls -l --color=none" on a
> directory
> > > with
> > > > 2500
> > > > > files:
> > > > > > 
> > > > > >? ? ? ? ? ? ?
> > > > > |? ? ? rdirplus???|?
> > > > > ? nordirplus???|
> > > > > >? ? ? ? ? ? ?
> > > > > |1st? |2nd? |1st? |1st? |2nd?
> > > > > |1st? |
> > > > > >? ? ? ? ? ? ?
> > > > > |run? |run? |run? |run? |run?
> > > > > |run? |
> > > > > >? ? ? ? ? ? ?
> > > > > |light|light|heavy|light|light|heavy|
> > > > > >? ? ? ? ? ? ? |load
> > > > > |load |load |load |load |load |
> > > > > >
> > > >
> --------------------------------------------------
> > > > > > readdir? ? ? |???0
> > > > > |???0 |???0 |? 25
> > > > > |???0 |? 25 |
> > > > > > readdirplus? | 209 |???0 |
> 276
> > > > > |???0 |???0
> > > > > |???0 |
> > > > > > lookup? ? ???|? 16
> > > > > |???0 |? 10 |2316 |???0
> > > > > |2473 |
> > > > > > getattr? ? ? |???1 |2501
> > > > > |2452 |???1 |2465 |???1 |
> > > > > > 
> > > > > > The most interesting case is with
> rdirplus
> > > > specified
> > > > > as a mount option to a heavily loaded
> server.?
> > > The
> > > > NFS
> > > > > client keeps switching back and forth
> between
> > > > readdirplus
> > > > > and getattr:
> > > > > > 
> > > > > >? ~10 seconds doing ~70
> readdirplus calls,
> > > > > followed by
> > > > > >? ~150 seconds doing ~800 gettattr
> calls,
> > > > followed
> > > > > by
> > > > > >? ~12 seconds doing ~70
> readdirplus calls,
> > > > > followed by
> > > > > >? ~200 seconds doing ~800 gettattr
> calls,
> > > > followed
> > > > > by
> > > > > >? ~20 seconds doing ~130
> readdirplus calls,
> > > > > followed by
> > > > > >? ~220 seconds doing ~800 gettattr
> calls
> > > > > > 
> > > > > > All the calls appear to get
> reasonably
> > > prompt
> > > > replies
> > > > > (never more than a second or so), which
> makes me
> > > > wonder why
> > > > > it keeps switching back and forth
> between the
> > > > > strategies.? (Especially since I've
> specified
> > > > rdirplus
> > > > > as a mount option.)
> > > > > > 
> > > > > > Is it supposed to do that?
> > > > > > 
> > > > > > I'd really like to see how it does
> with
> > > > readdirplus
> > > > > ~only~, no getattr calls, since it's
> spending
> > > only 40
> > > > > seconds in total on readdirplus calls
> compared to
> > > 570
> > > > > seconds in total on (redundant, I
> think, based on
> > > the
> > > > > lightly-loaded case) getattr calls.
> > > > > > 
> > > > > > It'd also be nice to be able to
> force
> > > > readdirplus
> > > > > calls instead of getattr calls for
> second and
> > > > subsequent
> > > > > listings of a directory.
> > > > > > 
> > > > > > I saw a recent thread talking
> about
> > > readdirplus
> > > > > changes in 2.6.37, so I'll give that a
> try when I
> > > get
> > > > a
> > > > > chance to see how it behaves.
> > > > > > 
> > > > > > Andrew
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list:
> send the
> > > line
> > > > > "unsubscribe linux-nfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the
> line
> > > > "unsubscribe
> > > > > linux-nfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the
> line
> > > "unsubscribe
> > > > linux-nfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > 
> > > --
> > > To unsubscribe from this list: send the line
> "unsubscribe
> > > linux-nfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > >
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at? http://vger.kernel.org/majordomo-info.html
>