2001-04-03 18:57:31

by Caleb Epstein

[permalink] [raw]
Subject: NFS client code slow in 2.4.3


I am having problems with timeouts and generaly throughput in
the 2.4.3 NFS client side code which are not present in the
2.4.2 kernel running in the same configuraiton on the same
hardware. The machines are on a 100 Mbit switched local
network with essentially no other trafic.

In both cases, testing against a 2.4.3 NFS server (using
knfsd). My tests involved using "dd" to read a large file on
an NFS mounted directory and running the "connectathon" NFS
test suite.

When I boot my client machine with 2.4.3, reading a 327 Mbyte
file over NFS takes on the order of 5-6 minutes to complete.
If I run the same command witrh the client running kernel
2.4.2, the command completes in about 1 minute.

Running the "cthon01" test suite, the 2.4.3 client machine
basically hangs in the "read + write" test section and I
didn't bother waiting for it to finish. Again, when switching
back to 2.4.2, the client runs through the tests quite
quickly.

From my tests I'm pretty convinced that something in either
the NFS client code or the networking layer has changed which
has drastically reduced NFS client speeds in 2.4.3.

Is this a known problem? Can I provide any additional
information to help debug it?

--
cae at bklyn dot org | Caleb Epstein | bklyn . org | Brooklyn Dust Bunny Mfg.


2001-04-03 19:10:06

by Caleb Epstein

[permalink] [raw]
Subject: Re: NFS client code slow in 2.4.3

On Tue, Apr 03, 2001 at 02:56:15PM -0400, Caleb Epstein wrote:

> I am having problems with timeouts and generaly throughput in
> the 2.4.3 NFS client side code which are not present in the 2.4.2
> kernel running in the same configuraiton on the same hardware. The
> machines are on a 100 Mbit switched local network with essentially
> no other trafic.

On second thought, it looks like 2.4.2 may also exhibit the
same behaviro after a little while. Now that the machine has
been up for a half hour or so, NFS traffic has become slow on
my 2.4.2 client again. I am seeing messages like this in my
kernel log:

Apr 3 15:01:54 hagrid kernel: nfs: server tela not responding, still trying
Apr 3 15:01:54 hagrid kernel: nfs: server tela OK

The machines are *not* having any connectivity problems, at
least judging from TCP sessions I have open between them.

So it would seem that NFS performace degrades over a very
short window in 2.4.2+. It seems to fairly fly when the
machine is freshly booted, but after 30 minutes or less, the
performance is severely degraded.

Is anyone using 2.4.2+ as a NFS server/client with success?
Am I missing something?

--
cae at bklyn dot org | Caleb Epstein | bklyn . org | Brooklyn Dust Bunny Mfg.

2001-04-03 20:01:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client code slow in 2.4.3

>>>>> " " == Caleb Epstein <[email protected]> writes:

> On Tue, Apr 03, 2001 at 02:56:15PM -0400, Caleb Epstein wrote:
>> I am having problems with timeouts and generaly throughput in
>> the 2.4.3 NFS client side code which are not present in the
>> 2.4.2 kernel running in the same configuraiton on the same
>> hardware. The machines are on a 100 Mbit switched local
>> network with essentially no other trafic.

> On second thought, it looks like 2.4.2 may also exhibit the
> same behaviro after a little while. Now that the machine has
> been up for a half hour or so, NFS traffic has become slow on
> my 2.4.2 client again. I am seeing messages like this in my
> kernel log:

> Apr 3 15:01:54 hagrid kernel: nfs: server tela not responding,
> still trying Apr 3 15:01:54 hagrid kernel: nfs: server tela OK

The above is a generic message that simply is stating that NFS traffic
is congested because the server isn't responding for whatever reason.

In 99% of all cases, this means that the server is not seeing all the
packets that the client is sending it. This forces the client to
throttle back the number of requests it can have on the fly, and then
to wait until the given packet times out, and then to resend.

Try checking whether or not the server is seeing all the packets that
the client is sending by comparing the output of tcpdump/ethereal
between the client and the server.
If the packet loss is large, try fiddling with the hardware: typically
stuff such as overriding the NIC autoconfiguration, swapping the NIC,
checking for noisy cables,...

If you're unable to trace the problem, try playing around with
rsize/wsize, timeo and retrans (man 5 nfs). The smaller the packet,
the less the chances are of UDP fragments getting lost.
You might also want to try out the NFS ping patch from
http://www.fys.uio.no/~trondmy/src/2.4.2

Cheers,
Trond