LinuxLists.cc - NAS server avalanche overload

2004-03-03 08:37:24

Subject: NAS server avalanche overload

We have a NAS server currently running a basically stock 2.4.22 kernel,
on a single 2.4GHz Xeon HT with 512MB RAM and 4 SATA 200GB disks on a
3ware 8506-12 in RAID-5.

It serves two machines via switched gigabit (1500 mtu, no switch
support) which each run large numbers of User-mode Linux processes.
Each kernel (there are ~60 total) has one or more open files at all
times, backing the normal ext3 filesystems for these virtual machines.
They use copy-on-write such that the main distro files are on local
disks and only the differences against this are stored, per machine, on
the NAS. /home filesystems and such are flat, sparse files. No attempt
is made at this point to turn atime off (but I can patch the kernel to
change the default if necessary).

The problem we're having is that every once in a while the entire system
grinds to a screeching halt with the load average on the NAS box spiking
to 17-18 (with 16 nfsd processes, this means every last one is wedged),
which quickly causes the load on the two client machines to spike as
requests they're making get stuck. This eventually clears up, but can
last anywhere from 15 seconds to 15+ minutes. In the meantime, however,
any disk-based operation inside the virtual machines can take a minute
or more to complete.

I've been trying for a long time to track this down with no luck, so now
it's time to see if anyone here has any ideas.

First major datapoint: early in the debugging cycle a large-ish number
of RRD datasets were kept on the NAS box, being updated regularly in an
attempt to spot the culprit. This instead made the problem
significantly more frequent. Moving the archives to another machine and
off NFS entirely immediately trimmed 100-200 I/O's per second average
off the NAS box, and the problem eased greatly.

Second: the whole process can easily be replicated by running bonnie++
on any of the machines (the NAS, the client, or a virtual machine), and
it appears clearly related to the I/O's per second, but only in cases
where I/O's are not linear. *Reading* a huge file either locally or
over NFS will cause a very mild form of the overload, but *writing* can
cause it almost instantaneously.

I've tried playing around with bdflush parameters, but without a
dramatically clearer mental picture of how that whole subsystem works, I
have no real chance of coming up with the best direction to move. A
gradual search isn't really feasible because the spikes are
unpredictable, and artificially generated loads (writing huge files) are
*too* stressful to see any differences.

I've graphed this thing utterly to death, and anyone interested in
checking it out can see tonight's fiasco at:

http://narsil.pdxcolo.net/graphs/?start=200403022200&duration=1hr

The aforementioned switch away from NAS-based RRD archives can be seen
quite easily at:

http://narsil.pdxcolo.net/graphs/?start=20040208&duration=1week

The graph pages are designed for a full 1600x1200 screen (mine), so it
may be hard to see everything clearly on smaller screens. Try adding
&width=100&height=50 maybe. The most relevant link is the NAS debug
page (nasdebug.php?...), which shows more information than the main
graphs page.

What I'd like to know is if anyone has any idea what's really going on
here, or suggestions as to what other data I might gather that would
help diagnose the problem. Easy solutions (add RAM, tweak a sysctl,
etc.) would be *greatly* appreciated ;-)
--
- Omega
aka Erik Walthinsen
[email protected]

-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-03 22:08:09

by Erik Walthinsen

[permalink] [raw]

Subject: Re: NAS server avalanche overload

Greg Banks said:
> Are you using the "async" export option on the server? It causes
> similar symptoms when used with large NFS writes. Use "sync".

Mount options as reported by /proc/mounts are:
rw,noatime,rsize=3D4096,wsize=3D4096,intr,soft,noac,tcp

I'm pretty sure the default here is async, as I had sync on there
earlier and it actually caused a noticeable drop in performance.

What I'm wondering is if the default bdflush settings are putting a hard
cap on how much data can be write-cached, forcing the system to block
writes too early. With 512MB of RAM, say half available as write-cache,
even at the rate of 5MB/sec, we should be able to run for almost a
minute with complete disk starvation before things start to wedge. And
since this doesn't look like complete starvation at all (graphs show
I/O's are completing the whole time), it should last even longer.

If anyone has any ideas on what to tweak in bdflush, it seems that there
*is* some pattern in the spikes, with them occurring at 11:25pm and
12:00am every day for at least the last 3 days.

Philippe Gramouli=E9 said:
> Is there anything that prevent you from running a 2.4.25 kernel ?

It's a production machine with those 60+ virtual machines running on it,
so the only opportunity I have to change anything of this sort is during
our quarterly downtime, the next one being early April.

Williamson, Jay (John G) said:
> Hi. I have no experience with your particular setup but have had
> similar problems when our clients were running a pre-2.4.20 kernel and
> using UDP for the NFS mounts. If that fits your client setup then try
> either upgrading the kernel or switching to TCP.

We're using TCP, as it also had performance advantages in our early
tests.

David Dougall said:
> My experience is that ext3 is dreadfully slow and RAID5 is dreadfully
> slow. These 2 combined can cause significant problems. The
> suggestions that have come from the list before are to change to
> RAID10 and use another filesystem such as reiserfs or xfs. I saw
> significant speedup moving away from ext3.

The NAS itself is using reiserfs, only the virtual machines are using
ext3. The question there is what kind of read/write load differences
one might have between the two. Certainly there's a possibility that
the journaling writes have something to do with it, but I wouldn't think
they would cluster to the degree things seem to be.

RAID 1+0 is an option with the 8506-12, but the migration is extremely
painful. We have to acquire a whole new set of disks (would probably
get 6), construct the array, then copy half a TB of data across. Much
of the data is sparse files, so the process would take even longer. At
least a large chunk of it is non-production files (mirrors), so probably
1/2 to 2/3 can be done without downtime.

- Omega
aka Erik Walthinsen
[email protected]

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-03 22:40:05

by Erik Walthinsen

[permalink] [raw]

Subject: RE: NAS server avalanche overload

On Wed, 2004-03-03 at 14:20, Lever, Charles wrote:
> careful: greg said "export options" not "mount options."
> what export options are you using on your NFS server?

Hmm, async is specified in /etc/exports. However, in my early testing I
found that there was no discernible performance difference (with
bonnie++ iirc) between setting the exports one way and setting the
client mount option instead. Either one set to sync would cause
performance drops, as expected, but to the same degree.

I can't change either the exports or client mount options for another
month however, until our quarterly downtime.

- Omega
aka Erik Walthinsen
[email protected]

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 00:10:39

by Greg Banks

[permalink] [raw]

Subject: Re: NAS server avalanche overload

On Wed, Mar 03, 2004 at 02:02:23PM -0800, Erik Walthinsen wrote:
> Greg Banks said:
> > Are you using the "async" export option on the server? It causes
> > similar symptoms when used with large NFS writes. Use "sync".
>
> Mount options as reported by /proc/mounts are:
> rw,noatime,rsize=4096,wsize=4096,intr,soft,noac,tcp

And the export options are? cat /etc/exports on the server.

The noatime option has no effect over NFS. Your [rw]sizes are
really quite small, try 8K. Also, try turning off noac.

> I'm pretty sure the default here is async, as I had sync on there
> earlier and it actually caused a noticeable drop in performance.

So did you get the collapse with sync ?

> What I'm wondering is if the default bdflush settings are putting a hard
> cap on how much data can be write-cached, forcing the system to block
> writes too early. With 512MB of RAM, say half available as write-cache,
> even at the rate of 5MB/sec, we should be able to run for almost a
> minute with complete disk starvation before things start to wedge. And
> since this doesn't look like complete starvation at all (graphs show
> I/O's are completing the whole time), it should last even longer.

The problem I've seen is that the data is written out from the page
cache with the BKL held, which prevents any nfsd thread from waking up
and responding to incoming requests, and NFS traffic drops to zero.
In addition, if any of the nfsd's owned some other lock when this
happened, some local processes can be blocked too. This is an
inevitable result of the "async" export option.

> If anyone has any ideas on what to tweak in bdflush, it seems that there
> *is* some pattern in the spikes, with them occurring at 11:25pm and
> 12:00am every day for at least the last 3 days.

You could try reducing the 1st parameter in /proc/sys/vm/bdflush
to say 5 and decrease the 5th parameter by a similar factor. This
will activate kupdated more frequently and it will write data out
earlier. But, did I mention the "sync" export option?

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 00:26:35

by Erik Walthinsen

[permalink] [raw]

Subject: Re: NAS server avalanche overload

On Wed, 2004-03-03 at 16:04, Greg Banks wrote:
> And the export options are? cat /etc/exports on the server.
/array/01-moria *.nas.pdxcolo.net(rw,no_root_squash,async)

> The noatime option has no effect over NFS. Your [rw]sizes are
> really quite small, try 8K. Also, try turning off noac.
The rwsizes were set based on a set of experiments with different sizes,
with 4k yielding by far the best bandwidth performance. The limiting
factor there is that the gig switch we have atm doesn't handle jumbo
frames. If increasing the rwsizes will reduce actual IO/sec load, then
it's worth trying even if it does reduce max bandwidth. In reality, the
theoretical max is almost never approached, so it's probably not a huge
deal.

> So did you get the collapse with sync ?
The sync/async testing was done before migrating everything to the NAS,
and these spikes started showing up as load increased many months later
on. Can only try sync after the next downtime coming up in about a
month.

> The problem I've seen is that the data is written out from the page
> cache with the BKL held, which prevents any nfsd thread from waking up
> and responding to incoming requests, and NFS traffic drops to zero.
> In addition, if any of the nfsd's owned some other lock when this
> happened, some local processes can be blocked too. This is an
> inevitable result of the "async" export option.
That sounds like the kind of scenario I've been imagining. Are there
any (stable) patches to get rid of the BKL in this case, or do I have to
wait until we move to 2.6 for that? Alternately, would reducing the
number of nfsd's help? Since there are only 2 heavy and 2-3 light
physical clients, is 16 overkill?

> You could try reducing the 1st parameter in /proc/sys/vm/bdflush
> to say 5 and decrease the 5th parameter by a similar factor. This
> will activate kupdated more frequently and it will write data out
> earlier. But, did I mention the "sync" export option?
OK, I'll give that a shot and see if it makes a dent in tonight's
spike(s). . . . Oddly enough, I just checked the graphs and a spike is
going on now, but looks like I caught the tail end only. I made the
bdflush changes, but cannot determine whether it's the end of the spike
or a bdflush-related termination.

- Omega
aka Erik Walthinsen
[email protected]

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 01:46:13

by Greg Banks

[permalink] [raw]

Subject: Re: NAS server avalanche overload

Erik Walthinsen wrote:
>
> On Wed, 2004-03-03 at 16:04, Greg Banks wrote:
> > And the export options are? cat /etc/exports on the server.
> /array/01-moria *.nas.pdxcolo.net(rw,no_root_squash,async)

Aha.

> > The noatime option has no effect over NFS. Your [rw]sizes are
> > really quite small, try 8K. Also, try turning off noac.
> The rwsizes were set based on a set of experiments with different sizes,
> with 4k yielding by far the best bandwidth performance. The limiting
> factor there is that the gig switch we have atm doesn't handle jumbo
> frames.

That really shouldn't make a difference; both 4K and 8K IOs will end
up being split over multiple ethernet frames.

> If increasing the rwsizes will reduce actual IO/sec load, then
> it's worth trying even if it does reduce max bandwidth. In reality, the
> theoretical max is almost never approached, so it's probably not a huge
> deal.

Generally speaking, larger IOs are more efficient for heavy streaming
reads and writes. The best parameters will depend on your workload.

> > The problem I've seen is that the data is written out from the page
> > cache with the BKL held, which prevents any nfsd thread from waking up
> > and responding to incoming requests, and NFS traffic drops to zero.
> > In addition, if any of the nfsd's owned some other lock when this
> > happened, some local processes can be blocked too. This is an
> > inevitable result of the "async" export option.
> That sounds like the kind of scenario I've been imagining. Are there
> any (stable) patches to get rid of the BKL in this case,

Not AFAIK. Trond? It would in any case be a fairly adventurous patch.

> or do I have to
> wait until we move to 2.6 for that?

Sorry, I haven't tried this on 2.6 yet.

> Alternately, would reducing the
> number of nfsd's help?

No, that will either have no effect or reduce the throughput when things
are going well.

> Since there are only 2 heavy and 2-3 light
> physical clients, is 16 overkill?

Probably not.

You need more than 1 nfsd per client, because (assuming you don't have
the sync *mount* option on the clients) the clients will be issuing
multiple (up to 16 for 2.4.x) rpc calls in parallel each. The number
that is useful is limited (at least on my machines) by the nfsd's doing
time consuming memcpy()s with the BKL; absent this even more nfsds
would be useful.

> > You could try reducing the 1st parameter in /proc/sys/vm/bdflush[...]
> OK, I'll give that a shot and see if it makes a dent in tonight's
> spike(s). . . . Oddly enough, I just checked the graphs and a spike is
> going on now, but looks like I caught the tail end only. I made the
> bdflush changes, but cannot determine whether it's the end of the spike
> or a bdflush-related termination.

With the bdflush changes I mentioned you'll still get spikes, just hopefully
they'll be short enough that you won't notice. What you want to do is get
them so short that the clients don't hit a major RPC timeout.

Also, making the changes won't help a spike in progress.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 02:13:17

by Lever, Charles

[permalink] [raw]

Subject: RE: NAS server avalanche overload

> > > The noatime option has no effect over NFS. Your [rw]sizes are
> > > really quite small, try 8K. Also, try turning off noac.
> > The rwsizes were set based on a set of experiments with=20
> different sizes,
> > with 4k yielding by far the best bandwidth performance. =20
> The limiting
> > factor there is that the gig switch we have atm doesn't handle jumbo
> > frames.=20
>=20
> That really shouldn't make a difference; both 4K and 8K IOs will end
> up being split over multiple ethernet frames.

it does make a difference if you are using UDP and your network
suffers from bursty congestion or buffer overruns.

erik, you should look for network problems so you can boost your
transfer size without suffering a loss in performance.

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 02:23:25

by Trond Myklebust

[permalink] [raw]

Subject: Re: NAS server avalanche overload

P=E5 on , 03/03/2004 klokka 20:40, skreiv Greg Banks:

> > > The problem I've seen is that the data is written out from the page
> > > cache with the BKL held, which prevents any nfsd thread from waking u=
p
> > > and responding to incoming requests, and NFS traffic drops to zero.
> > > In addition, if any of the nfsd's owned some other lock when this
> > > happened, some local processes can be blocked too. This is an
> > > inevitable result of the "async" export option.
> > That sounds like the kind of scenario I've been imagining. Are there
> > any (stable) patches to get rid of the BKL in this case,=20
>=20
> Not AFAIK. Trond? It would in any case be a fairly adventurous patch.

I'm not aware of anything, but this is more of a question for Neil.

I *really* doubt anyone has bothered to do it for 2.4.x...

Cheers,
Trond

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 04:42:02

by Ian Kent

[permalink] [raw]

Subject: Re: NAS server avalanche overload

On Thu, 4 Mar 2004, Greg Banks wrote:

>
> > Alternately, would reducing the
> > number of nfsd's help?
>
> No, that will either have no effect or reduce the throughput when things
> are going well.
>
> > Since there are only 2 heavy and 2-3 light
> > physical clients, is 16 overkill?
>
> Probably not.

Never seen a situation were less nfs threads was useful but I haven't
worked much with Linux as a server.

>
> You need more than 1 nfsd per client, because (assuming you don't have
> the sync *mount* option on the clients) the clients will be issuing
> multiple (up to 16 for 2.4.x) rpc calls in parallel each. The number
> that is useful is limited (at least on my machines) by the nfsd's doing
> time consuming memcpy()s with the BKL; absent this even more nfsds
> would be useful.

Bumping the number of threads to 32 or 64 or more would be the first
thing I would try.

Ian

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 05:37:54

by Erik Walthinsen

[permalink] [raw]

Subject: Re: NAS server avalanche overload

On Wed, 2004-03-03 at 20:39, Ian Kent wrote:
> Bumping the number of threads to 32 or 64 or more would be the first
> thing I would try.

Only problem is that if the spike occurs anyway, the load will be 33 or
65 instead of 17. I don't know if it's ctxswitching itself to death
while all these nfsd's are blocking, but if so, doubling or quadrupling
the number of blocked processes will just make the spike worse.

Also, I've learned to be extremely careful about messing with nfsd while
clients are connected. I've had cases where just running exportfs
*seems* to have permanently killed the client's session, yet the client
of course still has the filesystem mounted. This required a complete
*hard* kill of every UML instance on the client, which did a significant
amount of damage to the filesystems that were mounted at the time.

- Omega
aka Erik Walthinsen
[email protected]

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 05:53:27

by Greg Banks

[permalink] [raw]

Subject: Re: NAS server avalanche overload

Erik Walthinsen wrote:
>
> On Wed, 2004-03-03 at 20:39, Ian Kent wrote:
> > Bumping the number of threads to 32 or 64 or more would be the first
> > thing I would try.
>
> Only problem is that if the spike occurs anyway, the load will be 33 or
> 65 instead of 17. I don't know if it's ctxswitching itself to death
> while all these nfsd's are blocking,

If it's the same as I've seen, an nfsd will be spinning in schedule()
trying to reacquire the BKL. Changing the number of nfsds is not
going to affect the amount of CPU wasted (i.e. pretty much all of
it) just the number of runnable processes.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-04 14:42:41

by Ian Kent

[permalink] [raw]

Subject: Re: NAS server avalanche overload

On Wed, 3 Mar 2004, Erik Walthinsen wrote:

> On Wed, 2004-03-03 at 20:39, Ian Kent wrote:
> > Bumping the number of threads to 32 or 64 or more would be the first
> > thing I would try.
>
> Only problem is that if the spike occurs anyway, the load will be 33 or
> 65 instead of 17. I don't know if it's ctxswitching itself to death
> while all these nfsd's are blocking, but if so, doubling or quadrupling
> the number of blocked processes will just make the spike worse.

I'm assuming there is not a problem with the NFS implementation of course.
Never the less you generally need many more than 16 threads for a busy
NFS server.

>
> Also, I've learned to be extremely careful about messing with nfsd while
> clients are connected. I've had cases where just running exportfs

I wouldn't change this on the fly either. I'm suggesting this type
of change be made when during scheduled downtime.

Ian

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs