2008-08-01 07:23:27

by Dave Chinner

[permalink] [raw]
Subject: Re: high latency NFS

On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> On Wednesday July 30, [email protected] wrote:
> > >
> > > I was only using the default 8 nfsd threads on the server. When I raised
> > > this to 256, the read bandwidth went from about 6 MB/sec to about 95
> > > MB/sec, at 100ms of netem-induced latency.
> >
> > So this is yet another reminder that someone needs to implement some
> > kind of automatic tuning of the number of threads.
> >
> > I guess the first question is what exactly the policy for that should
> > be? How do we decide when to add another thread? How do we decide when
> > there are too many?
>
> Or should the first question be "what are we trying to achieve?"?
>
> Do we want to:
> Automatically choose a number of threads that would match what a
> well informed sysadmin might choose
> or
> regularly adjust the number of threads to find an optimal balance
> between prompt request processing (minimal queue length),
> minimal resource usage (idle threads waste memory)
> and not overloading the filesystem (how much concurrency does the
> filesystem/storage subsystem realistically support.
>
> And then we need to think about how this relates to NUMA situations
> where we have different numbers of threads on each node.
>
>
> I think we really want to aim for the first of the above options, but
> that the result will end up looking a bit like a very simplistic
> attempt at the second. "simplicitic" is key - we don't want
> "complex".

Having implemented the second option on a different NUMA aware
OS and NFS server, I can say that it isn't that complex, nor that
hard to screw up.

1. spawn a new thread only if all NFSDs are busy and there
are still requests queued to be serviced.
2. rate limit the speed at which you spawn new NFSD threads.
About 5/s per node was about right.
3. define an idle time for each thread before they
terminate. That is, is a thread has not been asked to
do any work for 30s, exit.
4. use the NFSD thread pools to allow per-pool independence.

> I think that in the NUMA case we probably want to balance each node
> independently.
>
> The difficulties - I think - are:
> - make sure we can handle a sudden surge of requests, certainly a
> surge up to levels that we have previously seen.
> I think the means we either don't kill excess threads, or
> only kill them up to a limit: e.g. never fewer than 50% of
> the maximum number of threads

You only want to increase the number of threads for sustained
loads or regular peaks of load. You don't want simple transients
to cause massive numbers of threads to spawn so rate limiting
the spawning rate is needed.

> - make sure we don't create too many threads if something clags up
> and nothing is getting through. This means we need to monitor the
> number of requests dequeued and not make new threads when that is
> zero.

That second case is easy - only allow a new thread to be spawned when a
request is dequeued. Hence if all the NFSDs are clagged, then we
won't waste resources clagging more of them.

> So how about:
> For each node we watch the length of the queue of
> requests-awaiting-threads and the queue of threads
> awaiting requests and maintain these values:
> - max number of threads ever concurrently running
> - number of requests dequeued
> - min length request queue
> - min length of thread queue
>
> Then every few (5?) seconds we sample these numbers and reset them
> (except the first).
> If
> the min request queue length is non-zero and
> the number of requests dequeued is non-zero
> Then
> start a new thread
> If
> the number of threads exceeds half the maximum and
> the min length of the thread queue exceeds 0
> Then
> stop one (idle) thread

The period of adjustment is really too low to be useful - a single
extra thread is meaningless if you go from 8 to 9 when you really need
30 or 40 nfsds. Taking minutes to get to the required number is
really too slow. You want to go from 8 to 40 within a few seconds of
that load starting....

> You might want to track the max length of the request queue too and
> start more threads if the queue is long, to allow a quick ramp-up.

Right, but even request queue depth is not a good indicator. You
need to leep track of how many NFSDs are actually doing useful
work. That is, if you've got an NFSD on the CPU that is hitting
the cache and not blocking, you don't need more NFSDs to handle
that load because they can't do any more work than the NFSD
that is currently running is.

i.e. take the solution that Greg banks used for the CPU scheduler
overload issue (limiting the number of nfsds woken but not yet on
the CPU), and apply that criteria to spawning new threads. i.e.
we've tried to wake an NFSD, but there are none available so that
means more NFSDs are needed for the given load. If we've already
tried to wake one and it hasn't run yet, then we've got enough
NFSDs....

Also, NFSD scheduling needs to be LIFO so that unused NFSDs
accumulate idle time and so can be culled easily. If you RR the
nfsds, they'll all appear to be doing useful work so it's hard to
tell if you've got any idle at all.

HTH.

Cheers,

Dave.
--
Dave Chinner
[email protected]


2008-08-01 19:16:06

by J. Bruce Fields

[permalink] [raw]
Subject: Re: high latency NFS

On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > You might want to track the max length of the request queue too and
> > start more threads if the queue is long, to allow a quick ramp-up.
>
> Right, but even request queue depth is not a good indicator. You
> need to leep track of how many NFSDs are actually doing useful
> work. That is, if you've got an NFSD on the CPU that is hitting
> the cache and not blocking, you don't need more NFSDs to handle
> that load because they can't do any more work than the NFSD
> that is currently running is.
>
> i.e. take the solution that Greg banks used for the CPU scheduler
> overload issue (limiting the number of nfsds woken but not yet on
> the CPU),

I don't remember that, or wasn't watching when it happened.... Do you
have a pointer?

> and apply that criteria to spawning new threads. i.e.
> we've tried to wake an NFSD, but there are none available so that
> means more NFSDs are needed for the given load. If we've already
> tried to wake one and it hasn't run yet, then we've got enough
> NFSDs....

OK, so you do that instead of trying to directly measure

> Also, NFSD scheduling needs to be LIFO so that unused NFSDs
> accumulate idle time and so can be culled easily. If you RR the
> nfsds, they'll all appear to be doing useful work so it's hard to
> tell if you've got any idle at all.

Those all sound like good ideas, thanks.

(Still waiting for a volunteer for now, alas.)

--b.

2008-08-01 19:23:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: high latency NFS

On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> Having implemented the second option on a different NUMA aware
> OS and NFS server, I can say that it isn't that complex, nor that
> hard to screw up.
>
> 1. spawn a new thread only if all NFSDs are busy and there
> are still requests queued to be serviced.
> 2. rate limit the speed at which you spawn new NFSD threads.
> About 5/s per node was about right.
> 3. define an idle time for each thread before they
> terminate. That is, is a thread has not been asked to
> do any work for 30s, exit.
> 4. use the NFSD thread pools to allow per-pool independence.

Actually, I lost you on #4. You mean that you apply 1-3 independently
on each thread pool? Or something else?

--b.

2008-08-04 00:32:12

by Dave Chinner

[permalink] [raw]
Subject: Re: high latency NFS

On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > You might want to track the max length of the request queue too and
> > > start more threads if the queue is long, to allow a quick ramp-up.
> >
> > Right, but even request queue depth is not a good indicator. You
> > need to leep track of how many NFSDs are actually doing useful
> > work. That is, if you've got an NFSD on the CPU that is hitting
> > the cache and not blocking, you don't need more NFSDs to handle
> > that load because they can't do any more work than the NFSD
> > that is currently running is.
> >
> > i.e. take the solution that Greg banks used for the CPU scheduler
> > overload issue (limiting the number of nfsds woken but not yet on
> > the CPU),
>
> I don't remember that, or wasn't watching when it happened.... Do you
> have a pointer?

Ah, I thought that had been sent to mainline because it was
mentioned in his LCA talk at the start of the year. Slides
65-67 here:

http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-08-04 00:39:02

by Dave Chinner

[permalink] [raw]
Subject: Re: high latency NFS

On Fri, Aug 01, 2008 at 03:23:43PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > Having implemented the second option on a different NUMA aware
> > OS and NFS server, I can say that it isn't that complex, nor that
> > hard to screw up.
> >
> > 1. spawn a new thread only if all NFSDs are busy and there
> > are still requests queued to be serviced.
> > 2. rate limit the speed at which you spawn new NFSD threads.
> > About 5/s per node was about right.
> > 3. define an idle time for each thread before they
> > terminate. That is, is a thread has not been asked to
> > do any work for 30s, exit.
> > 4. use the NFSD thread pools to allow per-pool independence.
>
> Actually, I lost you on #4. You mean that you apply 1-3 independently
> on each thread pool? Or something else?

The former. i.e when you have a NUMA machine with a pool-per-node or
an SMP machine with a pool-per-cpu configuration, you can configure
the pools the differently according to the hardware config and
interrupt vectoring. This is especially useful if want to prevent
NFSDs from dominating the CPU taking disk interrupts or running user
code....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-08-04 01:12:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: high latency NFS

On Mon, Aug 04, 2008 at 10:32:06AM +1000, Dave Chinner wrote:
> On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> > On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > > You might want to track the max length of the request queue too and
> > > > start more threads if the queue is long, to allow a quick ramp-up.
> > >
> > > Right, but even request queue depth is not a good indicator. You
> > > need to leep track of how many NFSDs are actually doing useful
> > > work. That is, if you've got an NFSD on the CPU that is hitting
> > > the cache and not blocking, you don't need more NFSDs to handle
> > > that load because they can't do any more work than the NFSD
> > > that is currently running is.
> > >
> > > i.e. take the solution that Greg banks used for the CPU scheduler
> > > overload issue (limiting the number of nfsds woken but not yet on
> > > the CPU),
> >
> > I don't remember that, or wasn't watching when it happened.... Do you
> > have a pointer?
>
> Ah, I thought that had been sent to mainline because it was
> mentioned in his LCA talk at the start of the year. Slides
> 65-67 here:
>
> http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf

OK, so to summarize: when the rate of incoming rpc's is very high (and,
I guess, when we're serving everything out of cache and don't have IO
wait), all the nfsd threads will stay runable all the time. That keeps
userspace processes from running (possibly for "minutes"). And that's a
problem even on a server dedicated only to nfs, since it affects portmap
and rpc.mountd.

The solution is given just as "limit the # of nfsd's woken but not yet
on CPU." It'd be interesting to see more details.

Off hand, this seems like it should be at least partly the scheduler's
job. E.g. could we tell it to schedule all the nfsd threads as a group?
I suppose the disadvantage to that is that we'd lose information about
how many threads are actually needed, hence lose the chance to reap
unneeded threads?

--b.

2008-08-04 01:29:55

by NeilBrown

[permalink] [raw]
Subject: Re: high latency NFS

On Mon, August 4, 2008 10:32 am, Dave Chinner wrote:
> On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
>> On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:

>> > i.e. take the solution that Greg banks used for the CPU scheduler
>> > overload issue (limiting the number of nfsds woken but not yet on
>> > the CPU),
>>
>> I don't remember that, or wasn't watching when it happened.... Do you
>> have a pointer?
>
> Ah, I thought that had been sent to mainline because it was
> mentioned in his LCA talk at the start of the year. Slides
> 65-67 here:
>
> http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf

Ahh... I remembered Greg talking about that, went looking, and
couldn't find it. I couldn't even find any mail about it, yet I'm
sure I saw a patch..

Greg: Do you remember what happened to this? Did I reject it for some
reason, or did it never get sent? or ...

NeilBrown


2008-08-04 02:14:33

by Dave Chinner

[permalink] [raw]
Subject: Re: high latency NFS

On Sun, Aug 03, 2008 at 09:11:58PM -0400, J. Bruce Fields wrote:
> On Mon, Aug 04, 2008 at 10:32:06AM +1000, Dave Chinner wrote:
> > On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> > > On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > > > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > > > You might want to track the max length of the request queue too and
> > > > > start more threads if the queue is long, to allow a quick ramp-up.
> > > >
> > > > Right, but even request queue depth is not a good indicator. You
> > > > need to leep track of how many NFSDs are actually doing useful
> > > > work. That is, if you've got an NFSD on the CPU that is hitting
> > > > the cache and not blocking, you don't need more NFSDs to handle
> > > > that load because they can't do any more work than the NFSD
> > > > that is currently running is.
> > > >
> > > > i.e. take the solution that Greg banks used for the CPU scheduler
> > > > overload issue (limiting the number of nfsds woken but not yet on
> > > > the CPU),
> > >
> > > I don't remember that, or wasn't watching when it happened.... Do you
> > > have a pointer?
> >
> > Ah, I thought that had been sent to mainline because it was
> > mentioned in his LCA talk at the start of the year. Slides
> > 65-67 here:
> >
> > http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf
>
> OK, so to summarize: when the rate of incoming rpc's is very high (and,
> I guess, when we're serving everything out of cache and don't have IO
> wait), all the nfsd threads will stay runable all the time. That keeps
> userspace processes from running (possibly for "minutes"). And that's a
> problem even on a server dedicated only to nfs, since it affects portmap
> and rpc.mountd.

In a nutshell.

> The solution is given just as "limit the # of nfsd's woken but not yet
> on CPU." It'd be interesting to see more details.

Simple counters, IIRC (memory hazy so it might be a bit different).
Basically, when we queue a request we check a wakeup counter. If
the wakeup counter is less than a certain threshold (e.g. 5) we
issue a wakeup to get another NFSD running. When the NFSD first
runs and dequeues a request, it then decrements the wakeup counter,
effectively marking that NFSD as busy doing work. IIRC a small
threshold was necessary to ensure we always had enough NFSDs ready
to run if there was some I/O going on (i.e. a mixture of blocking
and non-blocking RPCs).

i.e. we need to track the wakeup-to-run latency to prevent waking too
many NFSDs and loading the run queue unnecessarily.

> Off hand, this seems like it should be at least partly the scheduler's
> job.

Partly, yes, in that the scheduler overhead shouldn't increase when we
do this. However, from an efficiency point of view, if we are blindly
waking NFSDs when it is not necessary then (IMO) we've got an NFSD
problem....

> E.g. could we tell it to schedule all the nfsd threads as a group?
> I suppose the disadvantage to that is that we'd lose information about
> how many threads are actually needed, hence lose the chance to reap
> unneeded threads?

I don't know enough about how the group scheduling works to be able
to comment in detail. In theory it sounds like it would prevent
the starvation problems, but if it prevents implementation of
dynamic NFSD pools then I don't think it's a good idea....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-08-04 09:18:21

by Bernd Schubert

[permalink] [raw]
Subject: Re: high latency NFS

On Monday 04 August 2008 03:11:58 J. Bruce Fields wrote:
> On Mon, Aug 04, 2008 at 10:32:06AM +1000, Dave Chinner wrote:
> > On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> > > On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > > > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > > > You might want to track the max length of the request queue too and
> > > > > start more threads if the queue is long, to allow a quick ramp-up.
> > > >
> > > > Right, but even request queue depth is not a good indicator. You
> > > > need to leep track of how many NFSDs are actually doing useful
> > > > work. That is, if you've got an NFSD on the CPU that is hitting
> > > > the cache and not blocking, you don't need more NFSDs to handle
> > > > that load because they can't do any more work than the NFSD
> > > > that is currently running is.
> > > >
> > > > i.e. take the solution that Greg banks used for the CPU scheduler
> > > > overload issue (limiting the number of nfsds woken but not yet on
> > > > the CPU),
> > >
> > > I don't remember that, or wasn't watching when it happened.... Do you
> > > have a pointer?
> >
> > Ah, I thought that had been sent to mainline because it was
> > mentioned in his LCA talk at the start of the year. Slides
> > 65-67 here:
> >
> > http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf
>
> OK, so to summarize: when the rate of incoming rpc's is very high (and,
> I guess, when we're serving everything out of cache and don't have IO
> wait), all the nfsd threads will stay runable all the time. That keeps
> userspace processes from running (possibly for "minutes"). And that's a
> problem even on a server dedicated only to nfs, since it affects portmap
> and rpc.mountd.

Even worse, it affects user space HA software such as heartbeat and everyone
with reasonable timeouts will see spurious 'failures'.


--
Bernd Schubert
Q-Leap Networks GmbH