2002-06-27 20:05:33

by Poul Petersen

[permalink] [raw]
Subject: NFSD seems to stall

Our primary NFS server handles several file-systems that are used
24-hours a day to build software. We have been seeing a problem wherein no
less than once a night different hosts fail their builds with strange errors
- usually some header file is reported missing (the compilers are all
installed in NFS). On advice of a previous post here and the NFS FAQ sec
5.4, we increased the rec-q for nfsd to 1MB. I then began monitoring the
"rec-q" (netstat -anu), as well as /proc/net/rpc/nfsd, system load (sar -q)
and the ethernet traffic (sar -n DEV) in 5-second increments. Last night, as
usual the network was moving a fairly consistent 4MB/sec and the system load
was at about 2, when the rec-q suddenly shot up to the max (1048572). The Rx
network fell to about 100KB/s and /proc/net/rpc/nfsd showed no activity.
System load began falling. This state continued for about 40 seconds at
which point the rec-q dropped abruptly to zero and /proc/net/rpc/nfsd showed
an unreasonable amount of activity for a five second interval (far left
column is seconds since epoch):

1025147631:th 8 110295 4567.090 367.210 101.080 0.000 88.970 33.010 23.320
12.600 0.000 150.380
1025147636:th 8 110662 4568.240 367.410 101.080 0.000 88.970 33.010 23.320
12.600 0.000 194.300

Since those data points are 5 seconds apart, the 10-th column would
indicate that during the 5-second interval, all 8-nfsd were busy for 44
seconds - clearly /proc/net/rpc/nfsd was not being updated during the prior
40 second interval.

About 10-seconds later, the Rx traffic spiked up to 20-30MB/s and
system load shot up to 4 for 15 seconds and then settled back to 4MB/sec and
a load of 2. There were some smaller spikes in the rec-q (400k) and
/proc/net/rpc/nfsd data during this 15 seconds.

And indeed we lost some builds during this time period - one in
particular gave the error:

Error 5: "/package/1/compilers/hp/aCC333/opt/aCC/include_std/rw/rwlocale",
line 745 # Unable to open file
/package/1/compilers/hp/aCC333/opt/aCC/include_std/rw/vendor; Connection
timed out (238).

The server did not report any errors during this time period
(nothing in /var/log/messages or dmesg). Based on this information, it would
seem that for some reason, the nfs daemons stopped responding. I'm presuming
that the Rx traffic subsided because the clients were waiting for some kind
of NFS ACK (The server is primarily an NFS server, but it should be noted
that the network port was still active, just not NFS). After 40 seconds, the
nfs daemons came back processed the queue and then everything went back to
normal, except that some of our builds failed due to the temporary
unresponsiveness. What would cause this? Are there other things I could
monitor to help isolate the problem?

I suppose some hardware specifics would help : The NFS server is a
Dell 2550, Dual PIII-933, running RedHat 7.2 with kernel 2.4.18 and LVM
1.0.4. The network card is an Intel PRO/1000 (GigE Fiber) running the Intel
version 4.0.7 e1000 driver. There is also a Qlogic qla2200 card running the
Qlogic 6.0b20 driver which is connected to a Zzyzx RocketStor 2000 RAID
device with 2.2TB of disk space. That space is served up using LVM 1.0.4 and
ext3, though we saw the problem back in our XFS days as well. We are running
nfs-utils-0.3.3-1. Other than the LVM 1.0.4 patch and a pvmove patch, the
2.4.18 kernel is stock.

Ah, one other thing - the network card reports error in
/proc/net/PRO_LAN_Adapters and the only non-zero registers are:

Rx_Errors 547
Rx_FIFO_Errors 547
Rx_Missed_Errors 547

I also noticed this description in the Intel driver documentation:

RxIntDelay
Valid Range: 0-65535 (0=off)
Default Value: 64
This value delays the generation of receive interrupts in units of 1.024
microseconds. Receive interrupt reduction can improve CPU efficiency
if properly tuned for specific network traffic. Increasing this value
adds extra latency to frame reception and can end up decreasing the
throughput of TCP traffic. If the system is reporting dropped receives,
this value may be set too high, causing the driver to run out of
available receive descriptors.

Now this would seem more suspicious, but we are not running NFS over
TCP, and this would seem to apply to "TCP traffic". Also, if this was the
case, why would nfsd stall but the rec-q be full? I suppose part of my
confusion stems from not really understanding what the rec-q is and who
manages it (who fills it, and how is the program connected to the socket
notified that data is ready?)

Many thanks for any insight,

-poul


-------------------------------------------------------
Sponsored by:
ThinkGeek at http://www.ThinkGeek.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-06-27 21:09:51

by NeilBrown

[permalink] [raw]
Subject: Re: NFSD seems to stall

On Thursday June 27, [email protected] wrote:
>
> Error 5: "/package/1/compilers/hp/aCC333/opt/aCC/include_std/rw/rwlocale",
> line 745 # Unable to open file
> /package/1/compilers/hp/aCC333/opt/aCC/include_std/rw/vendor; Connection
> timed out (238).

"Connection timed out" messages shouldn't be propagated up to
userspace by the NFS client.... unless you are using "soft" mounts.
You aren't doing that are you? Please say you aren't

> That space is served up using LVM 1.0.4 and
> ext3, though we saw the problem back in our XFS days as well.


I have seen exactly these symptoms being due to ext3 bugs. The bugs
have been present for a while, but seem to manifest more in 2.4.18.

The patch Andrew Morton posted at
http://www.redhat.com/mailing-lists/ext3-users/msg03635.html

should fix it for you. Maybe XFS has a similar problem... maybe it
was a totally different problem that time.

NeilBrown


-------------------------------------------------------
Sponsored by:
ThinkGeek at http://www.ThinkGeek.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs