2011-04-04 13:31:14

by Andrew Klaassen

[permalink] [raw]
Subject: Prioritizing readdirplus/getattr/lookup

How difficult would it be to make nfsd give priority to the calls generated by "ls -l" (i.e. readdirplus, getattr, lookup) over read and write calls?? Is it a matter of tweaking a couple of sysctls or changing a few lines of code, or would it mean a major re-write?

I'm working in an environment where it's important to have reasonably good throughput for the HPC farm (50-200 machines reading and writing 5-10MB files as fast as they can pump them through), while simultaneously providing snappy responses to "ls -l" and equivalents for people reviewing file sizes and times and browsing through the filesystem and constructing new jobs and whatnot.

I've tried a small handful of server OSes (Solaris, Exastore, various Linux flavours and tunings and nfsd counts) that do great on the throughput side but horrible on the "ls -l" under load side (as mentioned in my previous emails).

However, I know what I need is possible, because Netapp GX on very similar hardware (similar processor, memory, and spindle count), does slightly worse (20% or so) on total throughput but much better (10-100 times better than Solaris/Linux/Exastore) on under-load "ls -l" responsiveness.

In the Linux case, I think I've narrowed to the problem down to nfsd rather than filesystem or VM system.? It's not filesystem or VM system, because when the server is under heavy local load equivalent to my HPC farm load, both local and remote "ls -l" commands are fast.? It's not that the NFS load overwhelms the server, because when the server is under heavy HPC farm load, local "ls -l" commands are still fast.

It's only when there's an NFS load and an NFS "ls -l" that the "ls -l" is slow. Like so:

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? throughput? ???ls -l
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ==========? ???=====
Heavy local load, local ls -l? ???fast? ? ? ? ???fast
Heavy local load, NFS ls -l? ? ???fast? ? ? ? ???fast
Heavy NFS load, local ls -l? ? ???fast? ? ? ? ???fast
Heavy NFS load, NFS ls -l? ? ? ???fast? ? ? ? ???very slow

This suggests to me that it's nfsd that's slowing down the ls -l response times rather than the filesystem or VM system.

Would fixing the bottom-right-corner case - even if it meant a modest throughput slowdown - be an easy tweak/patch?? Or major re-write? (Or just a kernel upgrade?)

I know it's doable because the Netapp does it; the question is how large a job would it be on Linux.

Thanks again.


FWIW, here's what I've tried so far to try to make this problem go away without success:

Server side:

kernels (all x86_64): 2.6.32-[something] on Scientific Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on Slackware
filesystems: xfs, ext4
nfsd counts: 8,32,64,127,128,256,1024
schedulers: cfq,deadline
export options: async,no_root_squash

Client side:
kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE 11.3
hard,intr,noatime,vers=3,mountvers=3 # always on
rsize,wsize:? 32768,65536
proto:? ? ? ? tcp,udp
nolock? ? ? ? # on or off
noac? ? ? ? ? # on or off
actimeo:? ? ? 3,5,60,240,600? # I had really hoped this would help

Andrew


--- On Thu, 3/31/11, Andrew Klaassen <[email protected]> wrote:

> Setting actimeo=600 gave me part of
> the behaviour I expected; on the first directory listing,
> the calls were all readdirplus and no getattr.
>
> However, there were now long stretches where nothing was
> happening.? During a single directory listing to a
> loaded server, there'd be:
>
>? ~10 seconds of readdirplus calls and replies, followed by
>? ~70 seconds of nothing, followed by
>? ~10 seconds of readdirplus calls and replies, followed by
>? ~100 seconds of nothing, followed by
>? ~10 seconds of readdirplus calls and replies, followed by
>? ~110 seconds of nothing, followed by
>? ~2 seconds of readdirplus calls and replies
>
> Why the long stretches of nothing?? If I'm reading my
> tshark output properly, it doesn't seem like the client was
> waiting for a server response.? Here are a couple of
> lines before and after a long stretch of nothing:
>
>? 28.575537 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
>? 28.593943 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 358) random_1168.exr
> random_2159.exr random_2188
> .exr random_0969.exr random_1662.exr random_0022.exr
> random_0785.exr random_2316.exr random_0831.exr
> random_0443.exr random_
> 1203.exr random_1907.exr
>? 28.594006 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
>? 28.623736 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 362) random_1575.exr
> random_0492.exr random_0335
> .exr random_2460.exr random_0754.exr random_1114.exr
> random_2001.exr random_2298.exr random_1858.exr
> random_1889.exr random_
> 2249.exr random_0782.exr
> 103.811801 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
> 103.883930 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 2311) random_0025.exr
> random_1665.exr random_231
> 1.exr random_1204.exr random_0444.exr random_0836.exr
> random_0332.exr random_0495.exr random_1572.exr
> random_1900.exr random
> _2467.exr random_1113.exr
> 103.884014 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
> 103.965167 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 2316) random_0753.exr
> random_2006.exr random_021
> 6.exr random_1824.exr random_1456.exr random_1790.exr
> random_1037.exr random_0677.exr random_2122.exr
> random_0101.exr random
> _1741.exr random_2235.exr
>
> Calls are sent and replies received at the 28 second mark,
> and then... nothing... until the 103 second mark.? I'm
> sure the server must be somehow telling the client that it's
> busy, but - at least with the tools I'm looking at - I don't
> see how.? Is tshark just hiding TCP delays and
> retransmits from me?
>
> Thanks again.
>
> Andrew
>
>
> --- On Thu, 3/31/11, Andrew Klaassen <[email protected]>
> wrote:
>
> > Interesting.? So the reason it's
> > switching back and forth between readdirplus and
> getattr
> > during the same ls command is because the command is
> taking
> > so long to run that the cache is periodically expiring
> as
> > the command is running?
> >
> > I'll do some playing with actimeo to see if I'm
> actually
> > understanding this.
> >
> > Thanks!
> >
> > Andrew
> >
> >
> > --- On Thu, 3/31/11, Steven Procter <[email protected]>
> > wrote:
> >
> > > This is due to client caching.?
> > > When the second ls -l runs the cache
> > > contains an entry for the directory.? The client
> can
> > > check if the cached
> > > directory data is still valid by issuing a
> GETATTR on
> > the
> > > directory.
> > >
> > > But this only validates the names, not the
> > attributes,
> > > which are not
> > > actually part of the directory.? Those must be
> > > refetched.? So the client
> > > issues a GETATTR for each entry in the
> directory.?
> > It
> > > issues them
> > > sequentially, probably as ls calls readdir() and
> then
> > > stat()
> > > sequentially on the directory entries.
> > >
> > > This takes so long that the cache entry times out
> and
> > the
> > > next time you
> > > run ls -l the client reloads the directory using
> > > READDIRPLUS.
> > >
> > > --Steven
> > >
> > > > X-Mailer: YahooMailClassic/12.0.2
> > > YahooMailWebService/0.8.109.295617
> > > > Date:??? Thu, 31 Mar 2011 15:24:15
> > > -0700 (PDT)
> > > > From:??? Andrew Klaassen <[email protected]>
> > > > Subject: readdirplus/getattr
> > > > To:??? [email protected]
> > > > Sender:??? [email protected]
> > > >
> > > > Hi,
> > > >
> > > > I've been trying to get my Linux NFS clients
> to
> > be a
> > > little snappier about listing large directories
> from
> > > heavily-loaded servers.? I found the following
> > > fascinating behaviour (this is with
> > 2.6.31.14-0.6-desktop,
> > > x86_64, from openSUSE 11.3, Solaris Express 11
> NFS
> > server):
> > > >
> > > > With "ls -l --color=none" on a directory
> with
> > 2500
> > > files:
> > > >
> > > >? ? ? ? ? ? ?
> > > |? ? ? rdirplus???|?
> > > ? nordirplus???|
> > > >? ? ? ? ? ? ?
> > > |1st? |2nd? |1st? |1st? |2nd?
> > > |1st? |
> > > >? ? ? ? ? ? ?
> > > |run? |run? |run? |run? |run?
> > > |run? |
> > > >? ? ? ? ? ? ?
> > > |light|light|heavy|light|light|heavy|
> > > >? ? ? ? ? ? ? |load
> > > |load |load |load |load |load |
> > > >
> > --------------------------------------------------
> > > > readdir? ? ? |???0
> > > |???0 |???0 |? 25
> > > |???0 |? 25 |
> > > > readdirplus? | 209 |???0 | 276
> > > |???0 |???0
> > > |???0 |
> > > > lookup? ? ???|? 16
> > > |???0 |? 10 |2316 |???0
> > > |2473 |
> > > > getattr? ? ? |???1 |2501
> > > |2452 |???1 |2465 |???1 |
> > > >
> > > > The most interesting case is with rdirplus
> > specified
> > > as a mount option to a heavily loaded server.?
> The
> > NFS
> > > client keeps switching back and forth between
> > readdirplus
> > > and getattr:
> > > >
> > > >? ~10 seconds doing ~70 readdirplus calls,
> > > followed by
> > > >? ~150 seconds doing ~800 gettattr calls,
> > followed
> > > by
> > > >? ~12 seconds doing ~70 readdirplus calls,
> > > followed by
> > > >? ~200 seconds doing ~800 gettattr calls,
> > followed
> > > by
> > > >? ~20 seconds doing ~130 readdirplus calls,
> > > followed by
> > > >? ~220 seconds doing ~800 gettattr calls
> > > >
> > > > All the calls appear to get reasonably
> prompt
> > replies
> > > (never more than a second or so), which makes me
> > wonder why
> > > it keeps switching back and forth between the
> > > strategies.? (Especially since I've specified
> > rdirplus
> > > as a mount option.)
> > > >
> > > > Is it supposed to do that?
> > > >
> > > > I'd really like to see how it does with
> > readdirplus
> > > ~only~, no getattr calls, since it's spending
> only 40
> > > seconds in total on readdirplus calls compared to
> 570
> > > seconds in total on (redundant, I think, based on
> the
> > > lightly-loaded case) getattr calls.
> > > >
> > > > It'd also be nice to be able to force
> > readdirplus
> > > calls instead of getattr calls for second and
> > subsequent
> > > listings of a directory.
> > > >
> > > > I saw a recent thread talking about
> readdirplus
> > > changes in 2.6.37, so I'll give that a try when I
> get
> > a
> > > chance to see how it behaves.
> > > >
> > > > Andrew
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the
> line
> > > "unsubscribe linux-nfs" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line
> > "unsubscribe
> > > linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > >
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe
> > linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at? http://vger.kernel.org/majordomo-info.html
>





2011-04-05 17:14:49

by Garth Gibson

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

The OpenGroup HECE proposals for extending the application/filesystem interface did not have a team of implementers behind them. At the time some of the parallel file system vendors that added modules to the kernel were willing to work toward supporting these interfaces, but not a broader community.

I encourage the pNFS community to consider the use cases that led to those proposals.

One example is lazy attributes. Folks running large parallel jobs have a nasty habit of monitoring the progress of the job by running on their desktop a looping script doing ls -l on output files. What is the length of a file that is open and being written to by other nodes? Much of the time you want to be able to ask for a recently accurate value of attributes without recalling layouts, but perhaps some of the time you would like layouts to be recalled, or at least committed.

garth

On Apr 5, 2011, at 9:30 AM, Benny Halevy wrote:
> On 2011-04-04 17:39, Jim Rees wrote:
>> Benny Halevy wrote:
>>
>> Like this? :)
>> http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf
>>
>> Yes, exactly, but that never went anywhere, did it?
>
> Correct. It did not go anywhere AFAIK.
>
> Benny


2011-04-07 14:34:25

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

--- On Thu, 4/7/11, Chuck Lever <[email protected]> wrote:

> On Apr 6, 2011, at 3:25 PM, Andrew Klaassen wrote:
>
> > I have, for now, hacked my way around my problem by
> > splitting the server's uplink into two separate bonds.?
> > This seems to have moderated the transaction rate of the HPC
> > farm and allowed the "ls -l" calls to be served at an
> > acceptable rate.
>
> Have you created enough nfsd threads on your Linux NFS
> servers?

That was one of the things I varied as part of my testing; almost every power of 2 from 8 threads up to 1024 threads. It actually made things slightly worse, not better, but I can certainly give it a try again.

Andrew



2011-04-05 00:26:52

by Benny Halevy

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On 2011-04-04 16:05, Jim Rees wrote:
> Andrew Klaassen wrote:
>
> I've confirmed my earlier results using 2.6.37.5 on the server, though now
> the results are closer to 10 times worse than the Netapp on similar
> hardware rather than 100 times worse.
>
> A big improvement, but I'd still be interested to know if the server is
> capable of the getattr/readdirplus/lookup versus read/write tradeoffs that
> I'm looking for to bring "ls -l" speeds under load down to levels that
> won't make my users yell at me.
>
> What would be nice is if we could add a new system call that would do
> roughly what the nfs4 readdir rpc does. It would take a mask of requested
> file attributes, then return those attributes along with the directory
> entries. This would help "ls -l" but would also help all those chatty new
> gnome/kde user interface things that love to stat every file in every
> directory.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Like this? :)
http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf

See this for discussion:
http://www.spinics.net/lists/linux-fsdevel/msg03276.html

And this for other ideas ;-)
http://www.opengroup.org/platform/hecewg/

Benny

2011-04-05 19:48:09

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On Tue, 2011-04-05 at 12:11 -0700, Benny Halevy wrote:
> On 2011-04-05 10:14, Garth Gibson wrote:
> > The OpenGroup HECE proposals for extending the application/filesystem interface did not have a team of implementers behind them. At the time some of the parallel file system vendors that added modules to the kernel were willing to work toward supporting these interfaces, but not a broader community.
> >
> > I encourage the pNFS community to consider the use cases that led to those proposals.
> >
> > One example is lazy attributes. Folks running large parallel jobs have a nasty habit of monitoring the progress of the job by running on their desktop a looping script doing ls -l on output files. What is the length of a file that is open and being written to by other nodes? Much of the time you want to be able to ask for a recently accurate value of attributes without recalling layouts, but perhaps some of the time you would like layouts to be recalled, or at least committed.
>
> Right now the pNFS server does not have to recall the layout on GETATTR
> so lazy would be the default behavior for most implementations. Even if
> a client holds a delegation the server could send a CB_GETATTR to it to
> get the latest attributes without recalling the layout. At any rate,
> the broader issue is that the posix system call API assumes a local file
> system and is not network not cluster file-system aware.

Recalling the outstanding layouts in a directory on every 'ls -l' sounds
like the perfect recipe for poor performance. I can't see why any
servers would want to do this.

In any case, a layout recall does not trigger client writeback: layouts
do not define a caching protocol.

Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-05 19:52:43

by Benny Halevy

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On 2011-04-05 12:48, Trond Myklebust wrote:
> On Tue, 2011-04-05 at 12:11 -0700, Benny Halevy wrote:
>> On 2011-04-05 10:14, Garth Gibson wrote:
>>> The OpenGroup HECE proposals for extending the application/filesystem interface did not have a team of implementers behind them. At the time some of the parallel file system vendors that added modules to the kernel were willing to work toward supporting these interfaces, but not a broader community.
>>>
>>> I encourage the pNFS community to consider the use cases that led to those proposals.
>>>
>>> One example is lazy attributes. Folks running large parallel jobs have a nasty habit of monitoring the progress of the job by running on their desktop a looping script doing ls -l on output files. What is the length of a file that is open and being written to by other nodes? Much of the time you want to be able to ask for a recently accurate value of attributes without recalling layouts, but perhaps some of the time you would like layouts to be recalled, or at least committed.
>>
>> Right now the pNFS server does not have to recall the layout on GETATTR
>> so lazy would be the default behavior for most implementations. Even if
>> a client holds a delegation the server could send a CB_GETATTR to it to
>> get the latest attributes without recalling the layout. At any rate,
>> the broader issue is that the posix system call API assumes a local file
>> system and is not network not cluster file-system aware.
>
> Recalling the outstanding layouts in a directory on every 'ls -l' sounds
> like the perfect recipe for poor performance. I can't see why any
> servers would want to do this.
>
> In any case, a layout recall does not trigger client writeback: layouts
> do not define a caching protocol.

Yet for already written (DATA_SYNC) data, a layout recall should trigger
a LAYOUTCOMMIT and that will update the visible attrs.

Benny

>
> Trond


2011-04-06 19:25:24

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

I have, for now, hacked my way around my problem by splitting the server's uplink into two separate bonds. This seems to have moderated the transaction rate of the HPC farm and allowed the "ls -l" calls to be served at an acceptable rate.

If someone smart thinks that giving the problem a real solution would be worthwhile, I'd be glad to offer more details about my test setup.

Thanks again for the various suggestions offered.

Andrew


--- On Tue, 4/5/11, Andrew Klaassen <[email protected]> wrote:

> From: Andrew Klaassen <[email protected]>
> Subject: Re: Prioritizing readdirplus/getattr/lookup
> To: "Benny Halevy" <[email protected]>, "Jim Rees" <[email protected]>
> Cc: "Garth Gibson" <[email protected]>, [email protected]
> Received: Tuesday, April 5, 2011, 5:06 PM
> --- On Tue, 4/5/11, Jim Rees <[email protected]>
> wrote:
>
> > It would be interesting to see someone who is affected
> by
> > this implement the readdirplus system call, modify ls
> to
> > use it, and report the results.? I suspect the system
> call
> > won't go anywhere in linux without some evidence
> > that it solves a real problem for real users.
>
> I can't speak to a readdirplus system call; my own
> (admittedly naive) testing suggests that the system calls
> are JustFineThankYou, since local "ls -l" with NFS loading
> (and vice versa) performs very well.
>
> What I can say for certain is that whatever NetApp GX is
> doing to respond to getattr and lookup calls completely
> wipes the floor with whatever the Linux nfsd is doing...
>
> ...*even when Linux is serving out a completely cached,
> completely read-only workload*.
>
> I wish I had the know-how and intelligence to figure out
> exactly why.
>
> Andrew
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at? http://vger.kernel.org/majordomo-info.html
>

2011-04-05 19:57:48

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On Tue, 2011-04-05 at 12:52 -0700, Benny Halevy wrote:
> On 2011-04-05 12:48, Trond Myklebust wrote:
> > On Tue, 2011-04-05 at 12:11 -0700, Benny Halevy wrote:
> >> On 2011-04-05 10:14, Garth Gibson wrote:
> >>> The OpenGroup HECE proposals for extending the application/filesystem interface did not have a team of implementers behind them. At the time some of the parallel file system vendors that added modules to the kernel were willing to work toward supporting these interfaces, but not a broader community.
> >>>
> >>> I encourage the pNFS community to consider the use cases that led to those proposals.
> >>>
> >>> One example is lazy attributes. Folks running large parallel jobs have a nasty habit of monitoring the progress of the job by running on their desktop a looping script doing ls -l on output files. What is the length of a file that is open and being written to by other nodes? Much of the time you want to be able to ask for a recently accurate value of attributes without recalling layouts, but perhaps some of the time you would like layouts to be recalled, or at least committed.
> >>
> >> Right now the pNFS server does not have to recall the layout on GETATTR
> >> so lazy would be the default behavior for most implementations. Even if
> >> a client holds a delegation the server could send a CB_GETATTR to it to
> >> get the latest attributes without recalling the layout. At any rate,
> >> the broader issue is that the posix system call API assumes a local file
> >> system and is not network not cluster file-system aware.
> >
> > Recalling the outstanding layouts in a directory on every 'ls -l' sounds
> > like the perfect recipe for poor performance. I can't see why any
> > servers would want to do this.
> >
> > In any case, a layout recall does not trigger client writeback: layouts
> > do not define a caching protocol.
>
> Yet for already written (DATA_SYNC) data, a layout recall should trigger
> a LAYOUTCOMMIT and that will update the visible attrs.

Sure, but if the client still has the file open and has not flushed its
writes, then you are proposing a very expensive way to update attributes
that may bear no relevance to the true state of the file.

pNFS will suck _badly_ if you start recalling layouts willy nilly.
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-07 14:01:13

by Chuck Lever III

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup


On Apr 6, 2011, at 3:25 PM, Andrew Klaassen wrote:

> I have, for now, hacked my way around my problem by splitting the server's uplink into two separate bonds. This seems to have moderated the transaction rate of the HPC farm and allowed the "ls -l" calls to be served at an acceptable rate.

Have you created enough nfsd threads on your Linux NFS servers?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-04-05 20:57:52

by Benny Halevy

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On 2011-04-05 12:57, Trond Myklebust wrote:
> On Tue, 2011-04-05 at 12:52 -0700, Benny Halevy wrote:
>> On 2011-04-05 12:48, Trond Myklebust wrote:
>>> On Tue, 2011-04-05 at 12:11 -0700, Benny Halevy wrote:
>>>> On 2011-04-05 10:14, Garth Gibson wrote:
>>>>> The OpenGroup HECE proposals for extending the application/filesystem interface did not have a team of implementers behind them. At the time some of the parallel file system vendors that added modules to the kernel were willing to work toward supporting these interfaces, but not a broader community.
>>>>>
>>>>> I encourage the pNFS community to consider the use cases that led to those proposals.
>>>>>
>>>>> One example is lazy attributes. Folks running large parallel jobs have a nasty habit of monitoring the progress of the job by running on their desktop a looping script doing ls -l on output files. What is the length of a file that is open and being written to by other nodes? Much of the time you want to be able to ask for a recently accurate value of attributes without recalling layouts, but perhaps some of the time you would like layouts to be recalled, or at least committed.
>>>>
>>>> Right now the pNFS server does not have to recall the layout on GETATTR
>>>> so lazy would be the default behavior for most implementations. Even if
>>>> a client holds a delegation the server could send a CB_GETATTR to it to
>>>> get the latest attributes without recalling the layout. At any rate,
>>>> the broader issue is that the posix system call API assumes a local file
>>>> system and is not network not cluster file-system aware.
>>>
>>> Recalling the outstanding layouts in a directory on every 'ls -l' sounds
>>> like the perfect recipe for poor performance. I can't see why any
>>> servers would want to do this.
>>>
>>> In any case, a layout recall does not trigger client writeback: layouts
>>> do not define a caching protocol.
>>
>> Yet for already written (DATA_SYNC) data, a layout recall should trigger
>> a LAYOUTCOMMIT and that will update the visible attrs.
>
> Sure, but if the client still has the file open and has not flushed its
> writes, then you are proposing a very expensive way to update attributes
> that may bear no relevance to the true state of the file.
>
> pNFS will suck _badly_ if you start recalling layouts willy nilly.

Agreed. Supporting CB_GETATTR seems to be a better choice.

Benny

2011-04-08 01:22:39

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Prioritizing readdirplus/getattr/lookup

--- On Thu, 4/7/11, Murata, Dennis <[email protected]> wrote:

> Did you set each ksoftirqd to a
> specific cpu?
> Wayne

I didn't do that explicitly, but /proc/interrupts shows that each CPU is handling the interrupts for only one ethernet card, so I assume that must be default behaviour.

Andrew




2011-04-04 16:45:10

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

Hi Steven,

Packet sniffing is exactly what I did (to the limit of my current abilities, anyway); it's what led to my questions. If you read further down, you'll see the packet sniffer results I got that led me to ask about having the server nfsd processes prioritize getattr/readdirplus/lookup queries.

In brief: Under the same load, on similar hardware, with a similar number of disks, our NetApp pumps back getattr/readdirplus/lookup replies at a rate of thousands per second, compared to the tens per second (or less, plus periodic long delays) averaged by our Linux server under that load.

The Linux filesystem and VM system don't seem to be the problem, because the same load locally doesn't cause the "ls -l" problem, and the NFS load doesn't cause the problem for local "ls -l" runs.

I guess I could start trying to trace nfsd processes to find the contention, but I really don't feel qualified for that; I was hoping someone familiar with the code could say, "Sure, that's an easy fix," or, "Not gonna happen, because everything would have to be re-written to get the kernel to prioritize nfsd threads based on what they're doing."

I can gladly provide more details about my test setup and/or packet traces.

Thanks.

Andrew


--- On Mon, 4/4/11, Steven Procter <[email protected]> wrote:

> I'd recommend using a packet sniffer to see what is going on at the
> protocol level when there are performance issues.? I've found that
> wireshark works well for this kind of investigation.
>
> --Steven
>
> > X-Mailer: YahooMailClassic/12.0.2
> YahooMailWebService/0.8.109.295617
> > Date:??? Mon, 4 Apr 2011 06:31:09 -0700
> (PDT)
> > From:??? Andrew Klaassen <[email protected]>
> > Subject: Prioritizing readdirplus/getattr/lookup
> > To:??? [email protected]
> > Sender:??? [email protected]
> >
> > How difficult would it be to make nfsd give priority
> to the calls generated by "ls -l" (i.e. readdirplus,
> getattr, lookup) over read and write calls?? Is it a matter
> of tweaking a couple of sysctls or changing a few lines of
> code, or would it mean a major re-write?
> >
> > I'm working in an environment where it's important to
> have reasonably good throughput for the HPC farm (50-200
> machines reading and writing 5-10MB files as fast as they
> can pump them through), while simultaneously providing
> snappy responses to "ls -l" and equivalents for people
> reviewing file sizes and times and browsing through the
> filesystem and constructing new jobs and whatnot.
> >
> > I've tried a small handful of server OSes (Solaris,
> Exastore, various Linux flavours and tunings and nfsd
> counts) that do great on the throughput side but horrible on
> the "ls -l" under load side (as mentioned in my previous
> emails).
> >
> > However, I know what I need is possible, because
> Netapp GX on very similar hardware (similar processor,
> memory, and spindle count), does slightly worse (20% or so)
> on total throughput but much better (10-100 times better
> than Solaris/Linux/Exastore) on under-load "ls -l"
> responsiveness.
> >
> > In the Linux case, I think I've narrowed to the
> problem down to nfsd rather than filesystem or VM system.?
> It's not filesystem or VM system, because when the server is
> under heavy local load equivalent to my HPC farm load, both
> local and remote "ls -l" commands are fast.? It's not that
> the NFS load overwhelms the server, because when the server
> is under heavy HPC farm load, local "ls -l" commands are
> still fast.
> >
> > It's only when there's an NFS load and an NFS "ls -l"
> that the "ls -l" is slow.? Like so:
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> throughput? ???ls -l
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ==========? ???=====
> > Heavy local load, local ls -l? ???fast? ? ? ?
> ???fast
> > Heavy local load, NFS ls -l? ? ???fast? ? ? ?
> ???fast
> > Heavy NFS load, local ls -l? ? ???fast? ? ? ?
> ???fast
> > Heavy NFS load, NFS ls -l? ? ? ???fast? ? ?
> ? ???very slow
> >
> > This suggests to me that it's nfsd that's slowing down
> the ls -l response times rather than the filesystem or VM
> system.
> >
> > Would fixing the bottom-right-corner case - even if it
> meant a modest throughput slowdown - be an easy
> tweak/patch?? Or major re-write?? (Or just a kernel
> upgrade?)
> >
> > I know it's doable because the Netapp does it; the
> question is how large a job would it be on Linux.
> >
> > Thanks again.
> >
> >
> > FWIW, here's what I've tried so far to try to make
> this problem go away without success:
> >
> > Server side:
> >
> > kernels (all x86_64): 2.6.32-[something] on Scientific
> Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on Slackware
> > filesystems: xfs, ext4
> > nfsd counts: 8,32,64,127,128,256,1024
> > schedulers: cfq,deadline
> > export options: async,no_root_squash
> >
> > Client side:
> > kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE
> 11.3
> > hard,intr,noatime,vers=3,mountvers=3 # always on
> > rsize,wsize:? 32768,65536
> > proto:? ? ? ? tcp,udp
> > nolock? ? ? ? # on or off
> > noac? ? ? ? ? # on or off
> > actimeo:? ? ? 3,5,60,240,600? # I had really hoped
> this would help
> >
> > Andrew
> >
> >
> > --- On Thu, 3/31/11, Andrew Klaassen <[email protected]>
> wrote:
> >
> > > Setting actimeo=600 gave me part of
> > > the behaviour I expected; on the first directory
> listing,
> > > the calls were all readdirplus and no getattr.
> > >
> > > However, there were now long stretches where
> nothing was
> > > happening.? During a single directory listing to
> a
> > > loaded server, there'd be:
> > >
> > >? ~10 seconds of readdirplus calls and replies,
> followed by
> > >? ~70 seconds of nothing, followed by
> > >? ~10 seconds of readdirplus calls and replies,
> followed by
> > >? ~100 seconds of nothing, followed by
> > >? ~10 seconds of readdirplus calls and replies,
> followed by
> > >? ~110 seconds of nothing, followed by
> > >? ~2 seconds of readdirplus calls and replies
> > >
> > > Why the long stretches of nothing?? If I'm
> reading my
> > > tshark output properly, it doesn't seem like the
> client was
> > > waiting for a server response.? Here are a
> couple of
> > > lines before and after a long stretch of
> nothing:
> > >
> > >? 28.575537 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > >? 28.593943 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 358) random_1168.exr
> > > random_2159.exr random_2188
> > > .exr random_0969.exr random_1662.exr
> random_0022.exr
> > > random_0785.exr random_2316.exr random_0831.exr
> > > random_0443.exr random_
> > > 1203.exr random_1907.exr
> > >? 28.594006 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > >? 28.623736 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 362) random_1575.exr
> > > random_0492.exr random_0335
> > > .exr random_2460.exr random_0754.exr
> random_1114.exr
> > > random_2001.exr random_2298.exr random_1858.exr
> > > random_1889.exr random_
> > > 2249.exr random_0782.exr
> > > 103.811801 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > > 103.883930 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 2311) random_0025.exr
> > > random_1665.exr random_231
> > > 1.exr random_1204.exr random_0444.exr
> random_0836.exr
> > > random_0332.exr random_0495.exr random_1572.exr
> > > random_1900.exr random
> > > _2467.exr random_1113.exr
> > > 103.884014 192.168.10.158 -> 192.168.10.5 NFS
> V3
> > > READDIRPLUS Call, FH:0xa216e302
> > > 103.965167 192.168.10.5 -> 192.168.10.158 NFS
> V3
> > > READDIRPLUS Reply (Call In 2316) random_0753.exr
> > > random_2006.exr random_021
> > > 6.exr random_1824.exr random_1456.exr
> random_1790.exr
> > > random_1037.exr random_0677.exr random_2122.exr
> > > random_0101.exr random
> > > _1741.exr random_2235.exr
> > >
> > > Calls are sent and replies received at the 28
> second mark,
> > > and then... nothing... until the 103 second
> mark.? I'm
> > > sure the server must be somehow telling the
> client that it's
> > > busy, but - at least with the tools I'm looking
> at - I don't
> > > see how.? Is tshark just hiding TCP delays and
> > > retransmits from me?
> > >
> > > Thanks again.
> > >
> > > Andrew
> > >
> > >
> > > --- On Thu, 3/31/11, Andrew Klaassen <[email protected]>
> > > wrote:
> > >
> > > > Interesting.? So the reason it's
> > > > switching back and forth between readdirplus
> and
> > > getattr
> > > > during the same ls command is because the
> command is
> > > taking
> > > > so long to run that the cache is
> periodically expiring
> > > as
> > > > the command is running?
> > > >
> > > > I'll do some playing with actimeo to see if
> I'm
> > > actually
> > > > understanding this.
> > > >
> > > > Thanks!
> > > >
> > > > Andrew
> > > >
> > > >
> > > > --- On Thu, 3/31/11, Steven Procter <[email protected]>
> > > > wrote:
> > > >
> > > > > This is due to client caching.?
> > > > > When the second ls -l runs the cache
> > > > > contains an entry for the directory.?
> The client
> > > can
> > > > > check if the cached
> > > > > directory data is still valid by
> issuing a
> > > GETATTR on
> > > > the
> > > > > directory.
> > > > >
> > > > > But this only validates the names, not
> the
> > > > attributes,
> > > > > which are not
> > > > > actually part of the directory.? Those
> must be
> > > > > refetched.? So the client
> > > > > issues a GETATTR for each entry in the
> > > directory.?
> > > > It
> > > > > issues them
> > > > > sequentially, probably as ls calls
> readdir() and
> > > then
> > > > > stat()
> > > > > sequentially on the directory entries.
> > > > >
> > > > > This takes so long that the cache entry
> times out
> > > and
> > > > the
> > > > > next time you
> > > > > run ls -l the client reloads the
> directory using
> > > > > READDIRPLUS.
> > > > >
> > > > > --Steven
> > > > >
> > > > > > X-Mailer: YahooMailClassic/12.0.2
> > > > > YahooMailWebService/0.8.109.295617
> > > > > > Date:??? Thu, 31 Mar 2011
> 15:24:15
> > > > > -0700 (PDT)
> > > > > > From:??? Andrew Klaassen <[email protected]>
> > > > > > Subject: readdirplus/getattr
> > > > > > To:??? [email protected]
> > > > > > Sender:??? [email protected]
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I've been trying to get my Linux
> NFS clients
> > > to
> > > > be a
> > > > > little snappier about listing large
> directories
> > > from
> > > > > heavily-loaded servers.? I found the
> following
> > > > > fascinating behaviour (this is with
> > > > 2.6.31.14-0.6-desktop,
> > > > > x86_64, from openSUSE 11.3, Solaris
> Express 11
> > > NFS
> > > > server):
> > > > > >
> > > > > > With "ls -l --color=none" on a
> directory
> > > with
> > > > 2500
> > > > > files:
> > > > > >
> > > > > >? ? ? ? ? ? ?
> > > > > |? ? ? rdirplus???|?
> > > > > ? nordirplus???|
> > > > > >? ? ? ? ? ? ?
> > > > > |1st? |2nd? |1st? |1st? |2nd?
> > > > > |1st? |
> > > > > >? ? ? ? ? ? ?
> > > > > |run? |run? |run? |run? |run?
> > > > > |run? |
> > > > > >? ? ? ? ? ? ?
> > > > > |light|light|heavy|light|light|heavy|
> > > > > >? ? ? ? ? ? ? |load
> > > > > |load |load |load |load |load |
> > > > > >
> > > >
> --------------------------------------------------
> > > > > > readdir? ? ? |???0
> > > > > |???0 |???0 |? 25
> > > > > |???0 |? 25 |
> > > > > > readdirplus? | 209 |???0 |
> 276
> > > > > |???0 |???0
> > > > > |???0 |
> > > > > > lookup? ? ???|? 16
> > > > > |???0 |? 10 |2316 |???0
> > > > > |2473 |
> > > > > > getattr? ? ? |???1 |2501
> > > > > |2452 |???1 |2465 |???1 |
> > > > > >
> > > > > > The most interesting case is with
> rdirplus
> > > > specified
> > > > > as a mount option to a heavily loaded
> server.?
> > > The
> > > > NFS
> > > > > client keeps switching back and forth
> between
> > > > readdirplus
> > > > > and getattr:
> > > > > >
> > > > > >? ~10 seconds doing ~70
> readdirplus calls,
> > > > > followed by
> > > > > >? ~150 seconds doing ~800 gettattr
> calls,
> > > > followed
> > > > > by
> > > > > >? ~12 seconds doing ~70
> readdirplus calls,
> > > > > followed by
> > > > > >? ~200 seconds doing ~800 gettattr
> calls,
> > > > followed
> > > > > by
> > > > > >? ~20 seconds doing ~130
> readdirplus calls,
> > > > > followed by
> > > > > >? ~220 seconds doing ~800 gettattr
> calls
> > > > > >
> > > > > > All the calls appear to get
> reasonably
> > > prompt
> > > > replies
> > > > > (never more than a second or so), which
> makes me
> > > > wonder why
> > > > > it keeps switching back and forth
> between the
> > > > > strategies.? (Especially since I've
> specified
> > > > rdirplus
> > > > > as a mount option.)
> > > > > >
> > > > > > Is it supposed to do that?
> > > > > >
> > > > > > I'd really like to see how it does
> with
> > > > readdirplus
> > > > > ~only~, no getattr calls, since it's
> spending
> > > only 40
> > > > > seconds in total on readdirplus calls
> compared to
> > > 570
> > > > > seconds in total on (redundant, I
> think, based on
> > > the
> > > > > lightly-loaded case) getattr calls.
> > > > > >
> > > > > > It'd also be nice to be able to
> force
> > > > readdirplus
> > > > > calls instead of getattr calls for
> second and
> > > > subsequent
> > > > > listings of a directory.
> > > > > >
> > > > > > I saw a recent thread talking
> about
> > > readdirplus
> > > > > changes in 2.6.37, so I'll give that a
> try when I
> > > get
> > > > a
> > > > > chance to see how it behaves.
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > > >
> > > > > > --
> > > > > > To unsubscribe from this list:
> send the
> > > line
> > > > > "unsubscribe linux-nfs" in
> > > > > > the body of a message to [email protected]
> > > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the
> line
> > > > "unsubscribe
> > > > > linux-nfs" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > --
> > > > To unsubscribe from this list: send the
> line
> > > "unsubscribe
> > > > linux-nfs" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > >
> > > --
> > > To unsubscribe from this list: send the line
> "unsubscribe
> > > linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at? http://vger.kernel.org/majordomo-info.html
>

2011-04-07 21:49:49

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

--- On Thu, 4/7/11, Andrew Klaassen <[email protected]> wrote:

> I do notice that ksoftirqd is eating up 100% of a core when
> I'm loading the server heavily.? I assume that's
> because I'm not using jumbo frames and the ethernet cards
> are spitting out interrupts as fast as they're able.

I just got myself edjumicated on smp_affinity, and now I'm able to achieve 99% CPU usage by 8 nfsd processes on 8 cores on a read-only, fully-cached workload, with ksoftirqd processes only using 1% CPU per core.

Unfortunately, this doesn't help the "ls -l" speed.

Andrew



2011-04-08 15:17:55

by Chuck Lever III

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup


On Apr 7, 2011, at 5:49 PM, Andrew Klaassen wrote:

> --- On Thu, 4/7/11, Andrew Klaassen <[email protected]> wrote:
>
>> I do notice that ksoftirqd is eating up 100% of a core when
>> I'm loading the server heavily. I assume that's
>> because I'm not using jumbo frames and the ethernet cards
>> are spitting out interrupts as fast as they're able.
>
> I just got myself edjumicated on smp_affinity, and now I'm able to achieve 99% CPU usage by 8 nfsd processes on 8 cores on a read-only, fully-cached workload, with ksoftirqd processes only using 1% CPU per core.
>
> Unfortunately, this doesn't help the "ls -l" speed.

Serving NFS files is generally not CPU intensive. The problem may be lock contention on the server, but that's about as far as my expertise goes.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-04-07 14:36:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup


On Apr 7, 2011, at 10:34 AM, Andrew Klaassen wrote:

> --- On Thu, 4/7/11, Chuck Lever <[email protected]> wrote:
>
>> On Apr 6, 2011, at 3:25 PM, Andrew Klaassen wrote:
>>
>>> I have, for now, hacked my way around my problem by
>>> splitting the server's uplink into two separate bonds.
>>> This seems to have moderated the transaction rate of the HPC
>>> farm and allowed the "ls -l" calls to be served at an
>>> acceptable rate.
>>
>> Have you created enough nfsd threads on your Linux NFS
>> servers?
>
> That was one of the things I varied as part of my testing; almost every power of 2 from 8 threads up to 1024 threads. It actually made things slightly worse, not better, but I can certainly give it a try again.

Is the server an SMP system? Do you see high CPU load during times of slow performance?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-04-05 19:11:13

by Benny Halevy

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On 2011-04-05 10:14, Garth Gibson wrote:
> The OpenGroup HECE proposals for extending the application/filesystem interface did not have a team of implementers behind them. At the time some of the parallel file system vendors that added modules to the kernel were willing to work toward supporting these interfaces, but not a broader community.
>
> I encourage the pNFS community to consider the use cases that led to those proposals.
>
> One example is lazy attributes. Folks running large parallel jobs have a nasty habit of monitoring the progress of the job by running on their desktop a looping script doing ls -l on output files. What is the length of a file that is open and being written to by other nodes? Much of the time you want to be able to ask for a recently accurate value of attributes without recalling layouts, but perhaps some of the time you would like layouts to be recalled, or at least committed.

Right now the pNFS server does not have to recall the layout on GETATTR
so lazy would be the default behavior for most implementations. Even if
a client holds a delegation the server could send a CB_GETATTR to it to
get the latest attributes without recalling the layout. At any rate,
the broader issue is that the posix system call API assumes a local file
system and is not network not cluster file-system aware.

Benny

> garth
>
> On Apr 5, 2011, at 9:30 AM, Benny Halevy wrote:
>> On 2011-04-04 17:39, Jim Rees wrote:
>>> Benny Halevy wrote:
>>>
>>> Like this? :)
>>> http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf
>>>
>>> Yes, exactly, but that never went anywhere, did it?
>> Correct. It did not go anywhere AFAIK.
>>
>> Benny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-04-05 19:34:26

by Jim Rees

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

Benny Halevy wrote:

At any rate, the broader issue is that the posix system call API assumes a
local file system and is not network not cluster file-system aware.

It would be interesting to see someone who is affected by this implement the
readdirplus system call, modify ls to use it, and report the results. I
suspect the system call won't go anywhere in linux without some evidence
that it solves a real problem for real users.

2011-04-07 23:03:31

by Murata, Dennis

[permalink] [raw]
Subject: RE: Prioritizing readdirplus/getattr/lookup

Did you set each ksoftirqd to a specific cpu?
Wayne

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Andrew Klaassen
> Sent: Thursday, April 07, 2011 4:50 PM
> To: Chuck Lever
> Cc: Benny Halevy; Jim Rees; Garth Gibson; [email protected]
> Subject: Re: Prioritizing readdirplus/getattr/lookup
>
> --- On Thu, 4/7/11, Andrew Klaassen <[email protected]> wrote:
>
> > I do notice that ksoftirqd is eating up 100% of a core when I'm
> > loading the server heavily.? I assume that's because I'm not using
> > jumbo frames and the ethernet cards are spitting out interrupts as
> > fast as they're able.
>
> I just got myself edjumicated on smp_affinity, and now I'm
> able to achieve 99% CPU usage by 8 nfsd processes on 8 cores
> on a read-only, fully-cached workload, with ksoftirqd
> processes only using 1% CPU per core.
>
> Unfortunately, this doesn't help the "ls -l" speed.
>
> Andrew
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in the body of a message to
> [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>

2011-04-06 00:06:48

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

--- On Tue, 4/5/11, Jim Rees <[email protected]> wrote:

> It would be interesting to see someone who is affected by
> this implement the readdirplus system call, modify ls to
> use it, and report the results.? I suspect the system call
> won't go anywhere in linux without some evidence
> that it solves a real problem for real users.

I can't speak to a readdirplus system call; my own (admittedly naive) testing suggests that the system calls are JustFineThankYou, since local "ls -l" with NFS loading (and vice versa) performs very well.

What I can say for certain is that whatever NetApp GX is doing to respond to getattr and lookup calls completely wipes the floor with whatever the Linux nfsd is doing...

...*even when Linux is serving out a completely cached, completely read-only workload*.

I wish I had the know-how and intelligence to figure out exactly why.

Andrew



2011-04-04 23:05:32

by Jim Rees

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

Andrew Klaassen wrote:

I've confirmed my earlier results using 2.6.37.5 on the server, though now
the results are closer to 10 times worse than the Netapp on similar
hardware rather than 100 times worse.

A big improvement, but I'd still be interested to know if the server is
capable of the getattr/readdirplus/lookup versus read/write tradeoffs that
I'm looking for to bring "ls -l" speeds under load down to levels that
won't make my users yell at me.

What would be nice is if we could add a new system call that would do
roughly what the nfs4 readdir rpc does. It would take a mask of requested
file attributes, then return those attributes along with the directory
entries. This would help "ls -l" but would also help all those chatty new
gnome/kde user interface things that love to stat every file in every
directory.

2011-04-11 13:31:37

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

--- On Fri, 4/8/11, Chuck Lever <[email protected]> wrote:

> On Apr 7, 2011, at 5:49 PM, Andrew Klaassen wrote:
>
> > I just got myself edjumicated on smp_affinity, and now
> > I'm able to achieve 99% CPU usage by 8 nfsd processes on 8
> > cores on a read-only, fully-cached workload, with ksoftirqd
> > processes only using 1% CPU per core.
> >
> > Unfortunately, this doesn't help the "ls -l" speed.
>
> Serving NFS files is generally not CPU intensive.? The
> problem may be lock contention on the server, but that's
> about as far as my expertise goes.

In that case the test workload was fully cached in memory, so I'm not completely surprised.

Andrew



2011-04-04 22:10:12

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

I've confirmed my earlier results using 2.6.37.5 on the server, though now the results are closer to 10 times worse than the Netapp on similar hardware rather than 100 times worse.

A big improvement, but I'd still be interested to know if the server is capable of the getattr/readdirplus/lookup versus read/write tradeoffs that I'm looking for to bring "ls -l" speeds under load down to levels that won't make my users yell at me.

Is this the wrong place to ask the question?

Is there more knowledge of Linux NFS server internals on a different mailing list that I should contact?

Thanks once more.

Andrew


--- On Mon, 4/4/11, Andrew Klaassen <[email protected]> wrote:

> Hi Steven,
>
> Packet sniffing is exactly what I did (to the limit of my
> current abilities, anyway); it's what led to my
> questions.? If you read further down, you'll see the
> packet sniffer results I got that led me to ask about having
> the server nfsd processes prioritize
> getattr/readdirplus/lookup queries.
>
> In brief: Under the same load, on similar hardware, with a
> similar number of disks, our NetApp pumps back
> getattr/readdirplus/lookup replies at a rate of thousands
> per second, compared to the tens per second (or less, plus
> periodic long delays) averaged by our Linux server under
> that load.
>
> The Linux filesystem and VM system don't seem to be the
> problem, because the same load locally doesn't cause the "ls
> -l" problem, and the NFS load doesn't cause the problem for
> local "ls -l" runs.
>
> I guess I could start trying to trace nfsd processes to
> find the contention, but I really don't feel qualified for
> that; I was hoping someone familiar with the code could say,
> "Sure, that's an easy fix," or, "Not gonna happen, because
> everything would have to be re-written to get the kernel to
> prioritize nfsd threads based on what they're doing."
>
> I can gladly provide more details about my test setup
> and/or packet traces.
>
> Thanks.
>
> Andrew
>
>
> --- On Mon, 4/4/11, Steven Procter <[email protected]>
> wrote:
>
> > I'd recommend using a packet sniffer to see what is
> going on at the
> > protocol level when there are performance issues.?
> I've found that
> > wireshark works well for this kind of investigation.
> >
> > --Steven
> >
> > > X-Mailer: YahooMailClassic/12.0.2
> > YahooMailWebService/0.8.109.295617
> > > Date:??? Mon, 4 Apr 2011 06:31:09 -0700
> > (PDT)
> > > From:??? Andrew Klaassen <[email protected]>
> > > Subject: Prioritizing readdirplus/getattr/lookup
> > > To:??? [email protected]
> > > Sender:??? [email protected]
> > >
> > > How difficult would it be to make nfsd give
> priority
> > to the calls generated by "ls -l" (i.e. readdirplus,
> > getattr, lookup) over read and write calls?? Is it a
> matter
> > of tweaking a couple of sysctls or changing a few
> lines of
> > code, or would it mean a major re-write?
> > >
> > > I'm working in an environment where it's
> important to
> > have reasonably good throughput for the HPC farm
> (50-200
> > machines reading and writing 5-10MB files as fast as
> they
> > can pump them through), while simultaneously
> providing
> > snappy responses to "ls -l" and equivalents for
> people
> > reviewing file sizes and times and browsing through
> the
> > filesystem and constructing new jobs and whatnot.
> > >
> > > I've tried a small handful of server OSes
> (Solaris,
> > Exastore, various Linux flavours and tunings and nfsd
> > counts) that do great on the throughput side but
> horrible on
> > the "ls -l" under load side (as mentioned in my
> previous
> > emails).
> > >
> > > However, I know what I need is possible, because
> > Netapp GX on very similar hardware (similar
> processor,
> > memory, and spindle count), does slightly worse (20%
> or so)
> > on total throughput but much better (10-100 times
> better
> > than Solaris/Linux/Exastore) on under-load "ls -l"
> > responsiveness.
> > >
> > > In the Linux case, I think I've narrowed to the
> > problem down to nfsd rather than filesystem or VM
> system.?
> > It's not filesystem or VM system, because when the
> server is
> > under heavy local load equivalent to my HPC farm load,
> both
> > local and remote "ls -l" commands are fast.? It's not
> that
> > the NFS load overwhelms the server, because when the
> server
> > is under heavy HPC farm load, local "ls -l" commands
> are
> > still fast.
> > >
> > > It's only when there's an NFS load and an NFS "ls
> -l"
> > that the "ls -l" is slow.? Like so:
> > >
> > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ?
> > throughput? ???ls -l
> > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ?
> > ==========? ???=====
> > > Heavy local load, local ls -l? ???fast? ?
> ? ?
> > ???fast
> > > Heavy local load, NFS ls -l? ? ???fast? ?
> ? ?
> > ???fast
> > > Heavy NFS load, local ls -l? ? ???fast? ?
> ? ?
> > ???fast
> > > Heavy NFS load, NFS ls -l? ? ? ???fast? ?
> ?
> > ? ???very slow
> > >
> > > This suggests to me that it's nfsd that's slowing
> down
> > the ls -l response times rather than the filesystem or
> VM
> > system.
> > >
> > > Would fixing the bottom-right-corner case - even
> if it
> > meant a modest throughput slowdown - be an easy
> > tweak/patch?? Or major re-write?? (Or just a kernel
> > upgrade?)
> > >
> > > I know it's doable because the Netapp does it;
> the
> > question is how large a job would it be on Linux.
> > >
> > > Thanks again.
> > >
> > >
> > > FWIW, here's what I've tried so far to try to
> make
> > this problem go away without success:
> > >
> > > Server side:
> > >
> > > kernels (all x86_64): 2.6.32-[something] on
> Scientific
> > Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on
> Slackware
> > > filesystems: xfs, ext4
> > > nfsd counts: 8,32,64,127,128,256,1024
> > > schedulers: cfq,deadline
> > > export options: async,no_root_squash
> > >
> > > Client side:
> > > kernel: 2.6.31.14-0.6-desktop, x86_64, from
> openSUSE
> > 11.3
> > > hard,intr,noatime,vers=3,mountvers=3 # always on
> > > rsize,wsize:? 32768,65536
> > > proto:? ? ? ? tcp,udp
> > > nolock? ? ? ? # on or off
> > > noac? ? ? ? ? # on or off
> > > actimeo:? ? ? 3,5,60,240,600? # I had really
> hoped
> > this would help
> > >
> > > Andrew
> > >
> > >
> > > --- On Thu, 3/31/11, Andrew Klaassen <[email protected]>
> > wrote:
> > >
> > > > Setting actimeo=600 gave me part of
> > > > the behaviour I expected; on the first
> directory
> > listing,
> > > > the calls were all readdirplus and no
> getattr.
> > > >
> > > > However, there were now long stretches
> where
> > nothing was
> > > > happening.? During a single directory
> listing to
> > a
> > > > loaded server, there'd be:
> > > >
> > > >? ~10 seconds of readdirplus calls and
> replies,
> > followed by
> > > >? ~70 seconds of nothing, followed by
> > > >? ~10 seconds of readdirplus calls and
> replies,
> > followed by
> > > >? ~100 seconds of nothing, followed by
> > > >? ~10 seconds of readdirplus calls and
> replies,
> > followed by
> > > >? ~110 seconds of nothing, followed by
> > > >? ~2 seconds of readdirplus calls and
> replies
> > > >
> > > > Why the long stretches of nothing?? If I'm
> > reading my
> > > > tshark output properly, it doesn't seem like
> the
> > client was
> > > > waiting for a server response.? Here are a
> > couple of
> > > > lines before and after a long stretch of
> > nothing:
> > > >
> > > >? 28.575537 192.168.10.158 ->
> 192.168.10.5 NFS
> > V3
> > > > READDIRPLUS Call, FH:0xa216e302
> > > >? 28.593943 192.168.10.5 ->
> 192.168.10.158 NFS
> > V3
> > > > READDIRPLUS Reply (Call In 358)
> random_1168.exr
> > > > random_2159.exr random_2188
> > > > .exr random_0969.exr random_1662.exr
> > random_0022.exr
> > > > random_0785.exr random_2316.exr
> random_0831.exr
> > > > random_0443.exr random_
> > > > 1203.exr random_1907.exr
> > > >? 28.594006 192.168.10.158 ->
> 192.168.10.5 NFS
> > V3
> > > > READDIRPLUS Call, FH:0xa216e302
> > > >? 28.623736 192.168.10.5 ->
> 192.168.10.158 NFS
> > V3
> > > > READDIRPLUS Reply (Call In 362)
> random_1575.exr
> > > > random_0492.exr random_0335
> > > > .exr random_2460.exr random_0754.exr
> > random_1114.exr
> > > > random_2001.exr random_2298.exr
> random_1858.exr
> > > > random_1889.exr random_
> > > > 2249.exr random_0782.exr
> > > > 103.811801 192.168.10.158 -> 192.168.10.5
> NFS
> > V3
> > > > READDIRPLUS Call, FH:0xa216e302
> > > > 103.883930 192.168.10.5 -> 192.168.10.158
> NFS
> > V3
> > > > READDIRPLUS Reply (Call In 2311)
> random_0025.exr
> > > > random_1665.exr random_231
> > > > 1.exr random_1204.exr random_0444.exr
> > random_0836.exr
> > > > random_0332.exr random_0495.exr
> random_1572.exr
> > > > random_1900.exr random
> > > > _2467.exr random_1113.exr
> > > > 103.884014 192.168.10.158 -> 192.168.10.5
> NFS
> > V3
> > > > READDIRPLUS Call, FH:0xa216e302
> > > > 103.965167 192.168.10.5 -> 192.168.10.158
> NFS
> > V3
> > > > READDIRPLUS Reply (Call In 2316)
> random_0753.exr
> > > > random_2006.exr random_021
> > > > 6.exr random_1824.exr random_1456.exr
> > random_1790.exr
> > > > random_1037.exr random_0677.exr
> random_2122.exr
> > > > random_0101.exr random
> > > > _1741.exr random_2235.exr
> > > >
> > > > Calls are sent and replies received at the
> 28
> > second mark,
> > > > and then... nothing... until the 103 second
> > mark.? I'm
> > > > sure the server must be somehow telling the
> > client that it's
> > > > busy, but - at least with the tools I'm
> looking
> > at - I don't
> > > > see how.? Is tshark just hiding TCP delays
> and
> > > > retransmits from me?
> > > >
> > > > Thanks again.
> > > >
> > > > Andrew
> > > >
> > > >
> > > > --- On Thu, 3/31/11, Andrew Klaassen <[email protected]>
> > > > wrote:
> > > >
> > > > > Interesting.? So the reason it's
> > > > > switching back and forth between
> readdirplus
> > and
> > > > getattr
> > > > > during the same ls command is because
> the
> > command is
> > > > taking
> > > > > so long to run that the cache is
> > periodically expiring
> > > > as
> > > > > the command is running?
> > > > >
> > > > > I'll do some playing with actimeo to
> see if
> > I'm
> > > > actually
> > > > > understanding this.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Andrew
> > > > >
> > > > >
> > > > > --- On Thu, 3/31/11, Steven Procter
> <[email protected]>
> > > > > wrote:
> > > > >
> > > > > > This is due to client caching.?
> > > > > > When the second ls -l runs the
> cache
> > > > > > contains an entry for the
> directory.?
> > The client
> > > > can
> > > > > > check if the cached
> > > > > > directory data is still valid by
> > issuing a
> > > > GETATTR on
> > > > > the
> > > > > > directory.
> > > > > >
> > > > > > But this only validates the names,
> not
> > the
> > > > > attributes,
> > > > > > which are not
> > > > > > actually part of the directory.?
> Those
> > must be
> > > > > > refetched.? So the client
> > > > > > issues a GETATTR for each entry in
> the
> > > > directory.?
> > > > > It
> > > > > > issues them
> > > > > > sequentially, probably as ls
> calls
> > readdir() and
> > > > then
> > > > > > stat()
> > > > > > sequentially on the directory
> entries.
> > > > > >
> > > > > > This takes so long that the cache
> entry
> > times out
> > > > and
> > > > > the
> > > > > > next time you
> > > > > > run ls -l the client reloads the
> > directory using
> > > > > > READDIRPLUS.
> > > > > >
> > > > > > --Steven
> > > > > >
> > > > > > > X-Mailer:
> YahooMailClassic/12.0.2
> > > > > >
> YahooMailWebService/0.8.109.295617
> > > > > > > Date:??? Thu, 31 Mar 2011
> > 15:24:15
> > > > > > -0700 (PDT)
> > > > > > > From:??? Andrew Klaassen
> <[email protected]>
> > > > > > > Subject: readdirplus/getattr
> > > > > > > To:??? [email protected]
> > > > > > > Sender:??? [email protected]
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've been trying to get my
> Linux
> > NFS clients
> > > > to
> > > > > be a
> > > > > > little snappier about listing
> large
> > directories
> > > > from
> > > > > > heavily-loaded servers.? I found
> the
> > following
> > > > > > fascinating behaviour (this is
> with
> > > > > 2.6.31.14-0.6-desktop,
> > > > > > x86_64, from openSUSE 11.3,
> Solaris
> > Express 11
> > > > NFS
> > > > > server):
> > > > > > >
> > > > > > > With "ls -l --color=none" on
> a
> > directory
> > > > with
> > > > > 2500
> > > > > > files:
> > > > > > >
> > > > > > >? ? ? ? ? ? ?
> > > > > > |? ? ? rdirplus???|?
> > > > > > ? nordirplus???|
> > > > > > >? ? ? ? ? ? ?
> > > > > > |1st? |2nd? |1st? |1st?
> |2nd?
> > > > > > |1st? |
> > > > > > >? ? ? ? ? ? ?
> > > > > > |run? |run? |run? |run?
> |run?
> > > > > > |run? |
> > > > > > >? ? ? ? ? ? ?
> > > > > >
> |light|light|heavy|light|light|heavy|
> > > > > > >? ? ? ? ? ? ? |load
> > > > > > |load |load |load |load |load |
> > > > > > >
> > > > >
> > --------------------------------------------------
> > > > > > > readdir? ? ? |???0
> > > > > > |???0 |???0 |? 25
> > > > > > |???0 |? 25 |
> > > > > > > readdirplus? | 209 |???0
> |
> > 276
> > > > > > |???0 |???0
> > > > > > |???0 |
> > > > > > > lookup? ? ???|? 16
> > > > > > |???0 |? 10 |2316 |???0
> > > > > > |2473 |
> > > > > > > getattr? ? ? |???1
> |2501
> > > > > > |2452 |???1 |2465 |???1 |
> > > > > > >
> > > > > > > The most interesting case is
> with
> > rdirplus
> > > > > specified
> > > > > > as a mount option to a heavily
> loaded
> > server.?
> > > > The
> > > > > NFS
> > > > > > client keeps switching back and
> forth
> > between
> > > > > readdirplus
> > > > > > and getattr:
> > > > > > >
> > > > > > >? ~10 seconds doing ~70
> > readdirplus calls,
> > > > > > followed by
> > > > > > >? ~150 seconds doing ~800
> gettattr
> > calls,
> > > > > followed
> > > > > > by
> > > > > > >? ~12 seconds doing ~70
> > readdirplus calls,
> > > > > > followed by
> > > > > > >? ~200 seconds doing ~800
> gettattr
> > calls,
> > > > > followed
> > > > > > by
> > > > > > >? ~20 seconds doing ~130
> > readdirplus calls,
> > > > > > followed by
> > > > > > >? ~220 seconds doing ~800
> gettattr
> > calls
> > > > > > >
> > > > > > > All the calls appear to get
> > reasonably
> > > > prompt
> > > > > replies
> > > > > > (never more than a second or so),
> which
> > makes me
> > > > > wonder why
> > > > > > it keeps switching back and forth
> > between the
> > > > > > strategies.? (Especially since
> I've
> > specified
> > > > > rdirplus
> > > > > > as a mount option.)
> > > > > > >
> > > > > > > Is it supposed to do that?
> > > > > > >
> > > > > > > I'd really like to see how it
> does
> > with
> > > > > readdirplus
> > > > > > ~only~, no getattr calls, since
> it's
> > spending
> > > > only 40
> > > > > > seconds in total on readdirplus
> calls
> > compared to
> > > > 570
> > > > > > seconds in total on (redundant, I
> > think, based on
> > > > the
> > > > > > lightly-loaded case) getattr
> calls.
> > > > > > >
> > > > > > > It'd also be nice to be able
> to
> > force
> > > > > readdirplus
> > > > > > calls instead of getattr calls
> for
> > second and
> > > > > subsequent
> > > > > > listings of a directory.
> > > > > > >
> > > > > > > I saw a recent thread
> talking
> > about
> > > > readdirplus
> > > > > > changes in 2.6.37, so I'll give
> that a
> > try when I
> > > > get
> > > > > a
> > > > > > chance to see how it behaves.
> > > > > > >
> > > > > > > Andrew
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this
> list:
> > send the
> > > > line
> > > > > > "unsubscribe linux-nfs" in
> > > > > > > the body of a message to [email protected]
> > > > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list:
> send the
> > line
> > > > > "unsubscribe
> > > > > > linux-nfs" in
> > > > > > the body of a message to [email protected]
> > > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > > --
> > > > > To unsubscribe from this list: send
> the
> > line
> > > > "unsubscribe
> > > > > linux-nfs" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > --
> > > > To unsubscribe from this list: send the
> line
> > "unsubscribe
> > > > linux-nfs" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line
> > "unsubscribe linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at? http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at? http://vger.kernel.org/majordomo-info.html
>

2011-04-05 19:46:14

by Benny Halevy

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On 2011-04-05 12:34, Jim Rees wrote:
> Benny Halevy wrote:
>
> At any rate, the broader issue is that the posix system call API assumes a
> local file system and is not network not cluster file-system aware.
>
> It would be interesting to see someone who is affected by this implement the
> readdirplus system call, modify ls to use it, and report the results. I
> suspect the system call won't go anywhere in linux without some evidence
> that it solves a real problem for real users.

How hard would it be to implement a micro-benchmark with pynfs to
compare performance of readdir + stats vs. readdirplus as a first cut
estimate of the benefits?

Benny


2011-04-07 21:25:16

by Andrew Klaassen

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

--- On Thu, 4/7/11, Chuck Lever <[email protected]> wrote:

> On Apr 7, 2011, at 10:34 AM, Andrew Klaassen wrote:
>
> > --- On Thu, 4/7/11, Chuck Lever <[email protected]>
> wrote:
> >> Have you created enough nfsd threads on your Linux NFS
> >> servers?
> >
> > That was one of the things I varied as part of my
> > testing; almost every power of 2 from 8 threads up to 1024
> > threads.? It actually made things slightly worse, not
> > better, but I can certainly give it a try again.
>
> Is the server an SMP system?

Yes.

> Do you see high CPU load during times of slow performance?

I've found two cases when throughput is high and "ls -l" is slow:

- disk-bound, when I'm (purposely) pumping through more data than can be held in the cache. In this case, CPU idle time generally remains between 30-60%.

- fully-cached read-only, when... well, it looks like I'll have to get back to you on this one, since ext4lazyinit has to finish its thing before I can reproduce this case.

I do notice that ksoftirqd is eating up 100% of a core when I'm loading the server heavily. I assume that's because I'm not using jumbo frames and the ethernet cards are spitting out interrupts as fast as they're able.

Unfortunately, upgrading the hardware - either disks or CPU - is not an option. This is why I was hoping there was some way to send readdirplus/getattr/lookup calls to the front of the queue on the server side even when the hardware has reached its limits.

Andrew



2011-04-04 16:22:08

by Steven Procter

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup


I'd recommend using a packet sniffer to see what is going on at the
protocol level when there are performance issues. I've found that
wireshark works well for this kind of investigation.

--Steven

> X-Mailer: YahooMailClassic/12.0.2 YahooMailWebService/0.8.109.295617
> Date: Mon, 4 Apr 2011 06:31:09 -0700 (PDT)
> From: Andrew Klaassen <[email protected]>
> Subject: Prioritizing readdirplus/getattr/lookup
> To: [email protected]
> Sender: [email protected]
>=20
> How difficult would it be to make nfsd give priority to the calls gen=
erated by "ls -l" (i.e. readdirplus, getattr, lookup) over read and wri=
te calls?=C2=A0 Is it a matter of tweaking a couple of sysctls or chang=
ing a few lines of code, or would it mean a major re-write?
>=20
> I'm working in an environment where it's important to have reasonably=
good throughput for the HPC farm (50-200 machines reading and writing =
5-10MB files as fast as they can pump them through), while simultaneous=
ly providing snappy responses to "ls -l" and equivalents for people rev=
iewing file sizes and times and browsing through the filesystem and con=
structing new jobs and whatnot.
>=20
> I've tried a small handful of server OSes (Solaris, Exastore, various=
Linux flavours and tunings and nfsd counts) that do great on the throu=
ghput side but horrible on the "ls -l" under load side (as mentioned in=
my previous emails).
>=20
> However, I know what I need is possible, because Netapp GX on very si=
milar hardware (similar processor, memory, and spindle count), does sli=
ghtly worse (20% or so) on total throughput but much better (10-100 tim=
es better than Solaris/Linux/Exastore) on under-load "ls -l" responsive=
ness.
>=20
> In the Linux case, I think I've narrowed to the problem down to nfsd =
rather than filesystem or VM system.=C2=A0 It's not filesystem or VM sy=
stem, because when the server is under heavy local load equivalent to m=
y HPC farm load, both local and remote "ls -l" commands are fast.=C2=A0=
It's not that the NFS load overwhelms the server, because when the ser=
ver is under heavy HPC farm load, local "ls -l" commands are still fast=
=2E
>=20
> It's only when there's an NFS load and an NFS "ls -l" that the "ls -l=
" is slow. Like so:
>=20
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 throughput=C2=A0 =C2=A0=
=C2=A0=C2=A0ls -l
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=C2=A0 =C2=A0=C2=A0=C2=A0=3D=3D=3D=3D=3D
> Heavy local load, local ls -l=C2=A0 =C2=A0=C2=A0=C2=A0fast=C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0fast
> Heavy local load, NFS ls -l=C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0fast=C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0fast
> Heavy NFS load, local ls -l=C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0fast=C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0fast
> Heavy NFS load, NFS ls -l=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0fast=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0very slow
>=20
> This suggests to me that it's nfsd that's slowing down the ls -l resp=
onse times rather than the filesystem or VM system.
>=20
> Would fixing the bottom-right-corner case - even if it meant a modest=
throughput slowdown - be an easy tweak/patch?=C2=A0 Or major re-write?=
(Or just a kernel upgrade?)
>=20
> I know it's doable because the Netapp does it; the question is how la=
rge a job would it be on Linux.
>=20
> Thanks again.
>=20
>=20
> FWIW, here's what I've tried so far to try to make this problem go aw=
ay without success:
>=20
> Server side:
>=20
> kernels (all x86_64): 2.6.32-[something] on Scientific Linux 6.0, 2.6=
=2E32.4 on Slackware, 2.6.37.5 on Slackware
> filesystems: xfs, ext4
> nfsd counts: 8,32,64,127,128,256,1024
> schedulers: cfq,deadline
> export options: async,no_root_squash
>=20
> Client side:
> kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE 11.3
> hard,intr,noatime,vers=3D3,mountvers=3D3 # always on
> rsize,wsize:=C2=A0 32768,65536
> proto:=C2=A0 =C2=A0 =C2=A0 =C2=A0 tcp,udp
> nolock=C2=A0 =C2=A0 =C2=A0 =C2=A0 # on or off
> noac=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # on or off
> actimeo:=C2=A0 =C2=A0 =C2=A0 3,5,60,240,600=C2=A0 # I had really hope=
d this would help
>=20
> Andrew
>=20
>=20
> --- On Thu, 3/31/11, Andrew Klaassen <[email protected]> wrote:
>=20
> > Setting actimeo=3D600 gave me part of
> > the behaviour I expected; on the first directory listing,
> > the calls were all readdirplus and no getattr.
> >=20
> > However, there were now long stretches where nothing was
> > happening.=C2=A0 During a single directory listing to a
> > loaded server, there'd be:
> >=20
> >=C2=A0 ~10 seconds of readdirplus calls and replies, followed by
> >=C2=A0 ~70 seconds of nothing, followed by
> >=C2=A0 ~10 seconds of readdirplus calls and replies, followed by
> >=C2=A0 ~100 seconds of nothing, followed by
> >=C2=A0 ~10 seconds of readdirplus calls and replies, followed by
> >=C2=A0 ~110 seconds of nothing, followed by
> >=C2=A0 ~2 seconds of readdirplus calls and replies
> >=20
> > Why the long stretches of nothing?=C2=A0 If I'm reading my
> > tshark output properly, it doesn't seem like the client was
> > waiting for a server response.=C2=A0 Here are a couple of
> > lines before and after a long stretch of nothing:
> >=20
> >=C2=A0 28.575537 192.168.10.158 -> 192.168.10.5 NFS V3
> > READDIRPLUS Call, FH:0xa216e302
> >=C2=A0 28.593943 192.168.10.5 -> 192.168.10.158 NFS V3
> > READDIRPLUS Reply (Call In 358) random_1168.exr
> > random_2159.exr random_2188
> > .exr random_0969.exr random_1662.exr random_0022.exr
> > random_0785.exr random_2316.exr random_0831.exr
> > random_0443.exr random_
> > 1203.exr random_1907.exr
> >=C2=A0 28.594006 192.168.10.158 -> 192.168.10.5 NFS V3
> > READDIRPLUS Call, FH:0xa216e302
> >=C2=A0 28.623736 192.168.10.5 -> 192.168.10.158 NFS V3
> > READDIRPLUS Reply (Call In 362) random_1575.exr
> > random_0492.exr random_0335
> > .exr random_2460.exr random_0754.exr random_1114.exr
> > random_2001.exr random_2298.exr random_1858.exr
> > random_1889.exr random_
> > 2249.exr random_0782.exr
> > 103.811801 192.168.10.158 -> 192.168.10.5 NFS V3
> > READDIRPLUS Call, FH:0xa216e302
> > 103.883930 192.168.10.5 -> 192.168.10.158 NFS V3
> > READDIRPLUS Reply (Call In 2311) random_0025.exr
> > random_1665.exr random_231
> > 1.exr random_1204.exr random_0444.exr random_0836.exr
> > random_0332.exr random_0495.exr random_1572.exr
> > random_1900.exr random
> > _2467.exr random_1113.exr
> > 103.884014 192.168.10.158 -> 192.168.10.5 NFS V3
> > READDIRPLUS Call, FH:0xa216e302
> > 103.965167 192.168.10.5 -> 192.168.10.158 NFS V3
> > READDIRPLUS Reply (Call In 2316) random_0753.exr
> > random_2006.exr random_021
> > 6.exr random_1824.exr random_1456.exr random_1790.exr
> > random_1037.exr random_0677.exr random_2122.exr
> > random_0101.exr random
> > _1741.exr random_2235.exr
> >=20
> > Calls are sent and replies received at the 28 second mark,
> > and then... nothing... until the 103 second mark.=C2=A0 I'm
> > sure the server must be somehow telling the client that it's
> > busy, but - at least with the tools I'm looking at - I don't
> > see how.=C2=A0 Is tshark just hiding TCP delays and
> > retransmits from me?
> >=20
> > Thanks again.
> >=20
> > Andrew
> >=20
> >=20
> > --- On Thu, 3/31/11, Andrew Klaassen <[email protected]>
> > wrote:
> >=20
> > > Interesting.=C2=A0 So the reason it's
> > > switching back and forth between readdirplus and
> > getattr
> > > during the same ls command is because the command is
> > taking
> > > so long to run that the cache is periodically expiring
> > as
> > > the command is running?
> > >=20
> > > I'll do some playing with actimeo to see if I'm
> > actually
> > > understanding this.
> > >=20
> > > Thanks!
> > >=20
> > > Andrew
> > >=20
> > >=20
> > > --- On Thu, 3/31/11, Steven Procter <[email protected]>
> > > wrote:
> > >=20
> > > > This is due to client caching.=C2=A0
> > > > When the second ls -l runs the cache
> > > > contains an entry for the directory.=C2=A0 The client
> > can
> > > > check if the cached
> > > > directory data is still valid by issuing a
> > GETATTR on
> > > the
> > > > directory.
> > > >=20
> > > > But this only validates the names, not the
> > > attributes,
> > > > which are not
> > > > actually part of the directory.=C2=A0 Those must be
> > > > refetched.=C2=A0 So the client
> > > > issues a GETATTR for each entry in the
> > directory.=C2=A0
> > > It
> > > > issues them
> > > > sequentially, probably as ls calls readdir() and
> > then
> > > > stat()
> > > > sequentially on the directory entries.
> > > >=20
> > > > This takes so long that the cache entry times out
> > and
> > > the
> > > > next time you
> > > > run ls -l the client reloads the directory using
> > > > READDIRPLUS.
> > > >=20
> > > > --Steven
> > > >=20
> > > > > X-Mailer: YahooMailClassic/12.0.2
> > > > YahooMailWebService/0.8.109.295617
> > > > > Date:=C2=A0=C2=A0=C2=A0 Thu, 31 Mar 2011 15:24:15
> > > > -0700 (PDT)
> > > > > From:=C2=A0=C2=A0=C2=A0 Andrew Klaassen <[email protected]>
> > > > > Subject: readdirplus/getattr
> > > > > To:=C2=A0=C2=A0=C2=A0 [email protected]
> > > > > Sender:=C2=A0=C2=A0=C2=A0 [email protected]
> > > > >=20
> > > > > Hi,
> > > > >=20
> > > > > I've been trying to get my Linux NFS clients
> > to
> > > be a
> > > > little snappier about listing large directories
> > from
> > > > heavily-loaded servers.=C2=A0 I found the following
> > > > fascinating behaviour (this is with
> > > 2.6.31.14-0.6-desktop,
> > > > x86_64, from openSUSE 11.3, Solaris Express 11
> > NFS
> > > server):
> > > > >=20
> > > > > With "ls -l --color=3Dnone" on a directory
> > with
> > > 2500
> > > > files:
> > > > >=20
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> > > > |=C2=A0 =C2=A0 =C2=A0 rdirplus=C2=A0=C2=A0=C2=A0|=C2=A0
> > > > =C2=A0 nordirplus=C2=A0=C2=A0=C2=A0|
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> > > > |1st=C2=A0 |2nd=C2=A0 |1st=C2=A0 |1st=C2=A0 |2nd=C2=A0
> > > > |1st=C2=A0 |
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> > > > |run=C2=A0 |run=C2=A0 |run=C2=A0 |run=C2=A0 |run=C2=A0
> > > > |run=C2=A0 |
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> > > > |light|light|heavy|light|light|heavy|
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |load
> > > > |load |load |load |load |load |
> > > > >
> > > --------------------------------------------------
> > > > > readdir=C2=A0 =C2=A0 =C2=A0 |=C2=A0=C2=A0=C2=A00
> > > > |=C2=A0=C2=A0=C2=A00 |=C2=A0=C2=A0=C2=A00 |=C2=A0 25
> > > > |=C2=A0=C2=A0=C2=A00 |=C2=A0 25 |
> > > > > readdirplus=C2=A0 | 209 |=C2=A0=C2=A0=C2=A00 | 276
> > > > |=C2=A0=C2=A0=C2=A00 |=C2=A0=C2=A0=C2=A00
> > > > |=C2=A0=C2=A0=C2=A00 |
> > > > > lookup=C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0|=C2=A0 16
> > > > |=C2=A0=C2=A0=C2=A00 |=C2=A0 10 |2316 |=C2=A0=C2=A0=C2=A00
> > > > |2473 |
> > > > > getattr=C2=A0 =C2=A0 =C2=A0 |=C2=A0=C2=A0=C2=A01 |2501
> > > > |2452 |=C2=A0=C2=A0=C2=A01 |2465 |=C2=A0=C2=A0=C2=A01 |
> > > > >=20
> > > > > The most interesting case is with rdirplus
> > > specified
> > > > as a mount option to a heavily loaded server.=C2=A0
> > The
> > > NFS
> > > > client keeps switching back and forth between
> > > readdirplus
> > > > and getattr:
> > > > >=20
> > > > >=C2=A0 ~10 seconds doing ~70 readdirplus calls,
> > > > followed by
> > > > >=C2=A0 ~150 seconds doing ~800 gettattr calls,
> > > followed
> > > > by
> > > > >=C2=A0 ~12 seconds doing ~70 readdirplus calls,
> > > > followed by
> > > > >=C2=A0 ~200 seconds doing ~800 gettattr calls,
> > > followed
> > > > by
> > > > >=C2=A0 ~20 seconds doing ~130 readdirplus calls,
> > > > followed by
> > > > >=C2=A0 ~220 seconds doing ~800 gettattr calls
> > > > >=20
> > > > > All the calls appear to get reasonably
> > prompt
> > > replies
> > > > (never more than a second or so), which makes me
> > > wonder why
> > > > it keeps switching back and forth between the
> > > > strategies.=C2=A0 (Especially since I've specified
> > > rdirplus
> > > > as a mount option.)
> > > > >=20
> > > > > Is it supposed to do that?
> > > > >=20
> > > > > I'd really like to see how it does with
> > > readdirplus
> > > > ~only~, no getattr calls, since it's spending
> > only 40
> > > > seconds in total on readdirplus calls compared to
> > 570
> > > > seconds in total on (redundant, I think, based on
> > the
> > > > lightly-loaded case) getattr calls.
> > > > >=20
> > > > > It'd also be nice to be able to force
> > > readdirplus
> > > > calls instead of getattr calls for second and
> > > subsequent
> > > > listings of a directory.
> > > > >=20
> > > > > I saw a recent thread talking about
> > readdirplus
> > > > changes in 2.6.37, so I'll give that a try when I
> > get
> > > a
> > > > chance to see how it behaves.
> > > > >=20
> > > > > Andrew
> > > > >=20
> > > > >=20
> > > > > --
> > > > > To unsubscribe from this list: send the
> > line
> > > > "unsubscribe linux-nfs" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at=C2=A0 http://vger.kernel.org/majordomo=
-info.html
> > > > --
> > > > To unsubscribe from this list: send the line
> > > "unsubscribe
> > > > linux-nfs" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at=C2=A0 http://vger.kernel.org/majordomo-i=
nfo.html
> > > >=20
> > > --
> > > To unsubscribe from this list: send the line
> > "unsubscribe
> > > linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at=C2=A0 http://vger.kernel.org/majordomo-inf=
o.html
> > >=20
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at=C2=A0 http://vger.kernel.org/majordomo-info.=
html
> >
>=20
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-04-05 00:39:47

by Jim Rees

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

Benny Halevy wrote:

Like this? :)
http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf

Yes, exactly, but that never went anywhere, did it?

2011-04-05 13:30:10

by Benny Halevy

[permalink] [raw]
Subject: Re: Prioritizing readdirplus/getattr/lookup

On 2011-04-04 17:39, Jim Rees wrote:
> Benny Halevy wrote:
>
> Like this? :)
> http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf
>
> Yes, exactly, but that never went anywhere, did it?

Correct. It did not go anywhere AFAIK.

Benny