2002-04-21 13:09:13

by Gavin Woodhatch

[permalink] [raw]
Subject: nfs performance: read only/gigE/nolock/1Tb per day

Hi Jason,

I have seen (as posted before) our linux-nfs boxes receiving 9 - 10
MByte/s over a 100 Mbit Network. This was as i was doing some
testing.

Is has to be noted that this is a sequential read from a 500 MB file.
When reading the "real" data, i hit about 1 - 2 MByte/s

In my Setup, i have not seen a great speed increase in using TCP.
I also don't know how good the linux-nfs Server is at that. I am just
using the client and am using a dedicated NAS Server.

The Block sizes on NFSv2 are from 1024 - 8192. With NFSv3 the Max. is
32768. The Blocksize is dependent on the NFS Version, not on TCP or
UDP Transport. I am using a Stock 2.4.17 Kernel with Trond's NFS-All
Patch.


Kind Regards

Gavin Woodhatch

NetZone Ltd.


> is it possible to change NFS mount size from 1024 to 8192 (especially with GigE).
> i have tried this and was seeing slowdowns in nfs access, so reverted back to
> 1024 block size.

> the NFS clients mount the filesystems with:
> ro,rsize=1024,nolock,hard,bg,intr,retrans=8,nfsvers=3,timeo=10


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-04-22 14:50:11

by Lever, Charles

[permalink] [raw]
Subject: RE: nfs performance: read only/gigE/nolock/1Tb per day

i'm looking at a similar problem (1K rsize works, but 8K rsize
doesn't behave under load; only a server reboot will fix the
problem). the environment is also a web server running an
NFS client, but the back-end is a NetApp filer. the NFS traffic
goes over a private switched 100MB network.

try with NFSv3 and TCP. my guess is you have a network problem
of some kind that causes packet loss. this triggers the UDP
timeout/recovery mechanism which will slow you down and maybe
even get the server and client out of sync with each other.
you might also check your GbE settings -- flow control should
be enabled, and make sure both ends of all your links have
identically configured autonegotiation parameters.

(trond- losing sync may be a client problem since it appears to
happen with different server implementations. what can we do
to get better information about this?)

also, jason, can you post the output of "nfsstat -c" ?

if your network is behaving, r/wsize=8K and jumbo packets over
GbE should work well, as long as you have the CPU power on both
client and server to handle the interrupt load. before trying
this, though, you should ensure that your network is healthy
with regular frame size.

> I've been trying to sort out NFS issues for the last 12 months due to
> the increase in traffic we have at our opensource archive
> (planetmirror.com).
>
> We started with a RedHat 7.0 deployment and the 2.2 kernel
> series and moved
> onto 2.4 to try to address some performance issues.
>
> We are now using a Redhat 7.2 deployment and have recently
> upgraded to the
> 2.4.18-0.22 kernel tree in an effort to deal with NFS lock
> ups and performance
> issues.
>
> At peak we are pushing between 700-1000Gigabytes of traffic
> daily. I am not
> sure if that's at the upper boundries of what NFS testing is
> done at or not.
>
> I don't believe there are any back end bandwidth issues from
> the disk - there
> are two QLA2200 HBAs each with 2 LUNs coming from a separate
> fibrechannel
> RAID server (PV650F) with 10 disks in each lun (36G 10,000RPM
> fibrechannel)
>
> Testing has shown the ability to exceed > 50Mbyte/sec from
> the disk subsystem.
>
>
> Some questions/queries:
>
> o we have upgraded our backbone so that the server and all
> clients have gigE
> cards (previously the server had gigE and the clients had
> 100Bt) into a
> unmanaged switch on a private NFS backbone (i.e separate
> physical interface
> for nfs exports/client mounts from the "outbound"
> application interface)
>
> is there any benefit in jumbo packets and setting the MTU to 9000 ?
>
> o we have periodic lockups - these were pretty bad with 2.4.9
> or older with
> a lockup almost twice a day. restarting the NFS subsystem
> made no difference
> and only a reboot of the server would clear it.
>
> we have been able to reproduce this with the 2.4.9-31
> kernel though it is much
> rarer (once every 2-3 days).
>
> in an effort to avoid this, we've upgraded to 2.4.18-0.22
> redhat rawhide kernel
> and i will monitor it over the next few days to see how it goes.
>
> o we are using read only nfs - are there any optimizations or
> other tweaks that can
> be done knowing our front end boxes only mount filesystems
> as read only ?
>
> i have already turned off NFS locking on the server and client.
>
>
> o is it possible to change NFS mount size from 1024 to 8192
> (especially with GigE).
> i have tried this and was seeing slowdowns in nfs access,
> so reverted back to
> 1024 block size.
>
> o is there any benefit of nfs over TCP rather than UDP, when
> using a local gigE
> switch between server and clients ? and any benefit in
> increased block size
> 16K or 32K? if using tcp ?
>
>
> o is there an easy way to work out what if any patches by
> Trond or Neil Brown are
> applied to redhat kernels ? i'm having a hard time
> figuring out if i should be
> applying NFS_ALL patches to redhat rawhide trees. and in
> particular, Neil has
> a patch that should make a significant performance
> improvement to SMP NFS servers
> which i'd like to see. trying to track stuff through
> bugzilla, the varioous changelogs
> and manually is proving difficult.
>
>
> o current config/performance.
>
> currently, top shows me that "system" uses about 70% of
> resources on both cpus,
> with the system around 30% idle. i have 256 nfsds running
> on the server. exports
> are ro with no_subtree_check.
>
> on the client, about 50-60% of cpu is spent in system, with
> average load around
> 10-25. at times it will spike to 100-200. the front end
> box is attempting to
> service > 1000 apache clients and > 250 ftp clients.
>
>
> the NFS server filesystems are mounted ext3 with:
> rw,async,auto,noexec,nosuid,nouser,noatime,nodev
>
> the NFS clients mount the filesystems with:
> ro,rsize=1024,nolock,hard,bg,intr,retrans=8,nfsvers=3,timeo=10
>
>
>
> cheers,
>
> -jason
>
>
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-22 15:33:08

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day

>>>>> " " == Charles Lever <Lever> writes:

> i'm looking at a similar problem (1K rsize works, but 8K rsize
> doesn't behave under load; only a server reboot will fix the
> problem). the environment is also a web server running an NFS
> client, but the back-end is a NetApp filer. the NFS traffic
> goes over a private switched 100MB network.

> try with NFSv3 and TCP. my guess is you have a network problem
> of some kind that causes packet loss. this triggers the UDP
> timeout/recovery mechanism which will slow you down and maybe
> even get the server and client out of sync with each other.
> you might also check your GbE settings -- flow control should
> be enabled, and make sure both ends of all your links have
> identically configured autonegotiation parameters.

> (trond- losing sync may be a client problem since it appears to
> happen with different server implementations. what can we do
> to get better information about this?)

I'm not sure what you mean by this. There is no 'sync' with UDP: each
packet going down the wire is either a UDP header or a fragment.

What might perhaps be happening is that the cards are somehow getting
messed up due to data flooding. Have you tried playing around with
driver parameters such as 'max_interrupt_work', 'max_rx_desc' and/or
other interrupt-related variables? (see 'modinfo -p <module>' for the
list of supported paramenters)

Cheers,
Trond

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-22 16:26:43

by Andrew Ryan

[permalink] [raw]
Subject: RE: nfs performance: read only/gigE/nolock/1Tb per day

Using NFSv3/TCP (with Trond's patches!) is good advice, the performance is
generally better from my tests, and if UDP is hanging on you, trying TCP
can't seriously hurt. Note that with NFSv3/TCP, you may experience hangs
under load as well, as I did, unless you use the latest 2.4.19-pre kernel
with Trond's patches.

Jason, as to your earlier question about applying Trond's patches to RH
kernels, the short answer is that yes, you can get them to apply (at least
the last time I checked, which was the 2.4.9 RPM). But RH already includes
some NFS patches, so you'd need to remove those and put in Trond's. You
will need to be comfortable hacking up a RPM specfile and have some
patience and diligence to get the resulting kernel RPM to build, however.
And when you're done you won't have a strictly RH kernel, which won't be a
problem unless you pay for technical support and expect to ever get it. But
since RH seems to give very little attention to a stable, reliable NFS
client implementation in their kernels, if you're stuck using NFS on linux,
it may be your only choice.


andrew


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-22 18:06:34

by Pedro M. Rodrigues

[permalink] [raw]
Subject: RE: nfs performance: read only/gigE/nolock/1Tb per day


Indeed. The NFS client part of RH kernels is really lacking. They
work pretty well at NFS serving though, enough for me to have them in
several servers without complains.


/Pedro

On 22 Apr 2002 at 9:23, Andrew Ryan wrote:

> technical support and expect to ever get it. But since RH seems to
> give very little attention to a stable, reliable NFS client
> implementation in their kernels, if you're stuck using NFS on linux,
> it may be your only choice.
>
>
> andrew
>
>


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-22 18:52:35

by Bogdan Costescu

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day

On 22 Apr 2002, Trond Myklebust wrote:

> What might perhaps be happening is that the cards are somehow getting
> messed up due to data flooding. Have you tried playing around with
> driver parameters such as 'max_interrupt_work', 'max_rx_desc' and/or
> other interrupt-related variables? (see 'modinfo -p <module>' for the
> list of supported paramenters)

In case of network problems, some more info can be obtained from
/proc/net/; f.e. /proc/net/dev can give some ideea about low level
(driver) problems, where the most interesting might be "Rx overruns"
(the computer can't process packets as fast as they arrive and has to drop
them as soon as the Rx ring becomes full - if your network driver has such
parameter like "max_rx_desc" it should be increased). I don't know how to
interpret all the data that is there, but either using the source or
asking the Linux network developers at [email protected] might help.

"max_interrupt_work" should not be modified unless a message like "ethx:
Too much work in interrupt!" is logged by the kernel. In some cases,
increasing "max_interrupt_work" without also increasing the Rx ring size
would not help...

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-22 21:46:02

by Heflin, Roger A.

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day



> Date: Mon, 22 Apr 2002 20:52:23 +0200 (CEST)
> From: Bogdan Costescu <[email protected]>
> To: [email protected]
> cc: "Lever, Charles" <[email protected]>,
> "'jason andrade'" <[email protected]>
> Subject: Re: [NFS] nfs performance: read only/gigE/nolock/1Tb per day
>=20
> On 22 Apr 2002, Trond Myklebust wrote:
>=20
> > What might perhaps be happening is that the cards are somehow =
getting
> > messed up due to data flooding. Have you tried playing around with
> > driver parameters such as 'max_interrupt_work', 'max_rx_desc' and/or
> > other interrupt-related variables? (see 'modinfo -p <module>' for =
the
> > list of supported paramenters)
>=20
> In case of network problems, some more info can be obtained from=20
> /proc/net/; f.e. /proc/net/dev can give some ideea about low level=20
> (driver) problems, where the most interesting might be "Rx overruns"
> (the computer can't process packets as fast as they arrive and has to =
drop=20
> them as soon as the Rx ring becomes full - if your network driver has =
such=20
> parameter like "max_rx_desc" it should be increased). I don't know how =
to=20
> interpret all the data that is there, but either using the source or=20
> asking the Linux network developers at [email protected] might help.
>=20
> "max_interrupt_work" should not be modified unless a message like =
"ethx:=20
> Too much work in interrupt!" is logged by the kernel. In some cases,=20
> increasing "max_interrupt_work" without also increasing the Rx ring =
size=20
> would not help.
>=20
I would suggest using "netstat -s" as things are a bit easier to read,
and most things that you will need to make sure aren't rising are
there (maybe all). Errors, timeouts, invalids, and fails rising too =
quickly
are signs of problems with the underlying network, and packets are
getting misplaced. I have found that if you lose even a small percent
of the packets you will take a large speed hit, and it will be quite a =
bit
worse with UDP vs. TCP, and the larger the UDP packet size the worse
it will be as you need to retransmit the entire packet with UDP.

Roger =20

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-23 10:40:01

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day

>>>>> " " == Bogdan Costescu <[email protected]> writes:

> "max_interrupt_work" should not be modified unless a message
> like "ethx: Too much work in interrupt!" is logged by the
> kernel. In some cases, increasing "max_interrupt_work" without
> also increasing the Rx ring size would not help...

So what would an avalanche of ICMP Time Exceeded messages usually
indicate as far as the driver/card is concerned?

At the networking levels, a single Time Exceeded message means that
some fragment(s) got dropped and/or lost, so some datagram never got
reassembled within /proc/sys/net/ipv4/ipfrag_low_thresh seconds
(as per RFC1122).

In the avalanching case that I've sometimes observed, then it looks as
if *no* datagrams are getting rebuilt.
IOW: the client is just sitting there sending off ICMP messages, and
never reading the reply. Changing card/driver did not help in the
cases I observed, but shutting down the network, and then bringing it
up again sometimes did. Any suggestions?

Cheers,
Trond

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-23 15:14:41

by Bogdan Costescu

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day

On 23 Apr 2002, Trond Myklebust wrote:

> So what would an avalanche of ICMP Time Exceeded messages usually
> indicate as far as the driver/card is concerned?

Many and nothing 8-) As you say, this message is issued when the datagram
couldn't be reassembled. There can be many low-level (driver/card/switch)
causes why a packet doesn't make it in time to the destination, these are
those that I can think of:

1. the server can't send the packet
1.1 it's slower in producing packets than the NIC can handle -> Tx underrun
usually associated with bus (PCI) problems.
1.2 it produces too many packets (usually small ones and for datagram
protocols) and the NIC can't send them as fast -> Tx queue full,
in extreme cases (5 seconds in most drivers in 2.4 kernels) a Tx
timeout occurs.
1.3 (actually could be included in the previous one) the NIC can't send
packets because of network congestion, usually happens on
half-duplex links (and mostly with hubs) because of collisions ->
Tx queue full, then maybe Tx timeout. Some cards/drivers can
continue to try sending the packet indefinitely, some can just
drop the packet, some stall the tramission path after some number
of collisions and resetting it can take some time.
1.4 link speed mismatch between NIC and hub/switch -> packets are
randomly dropped, there are frame errors, etc.
1.5 the server has interrupt problems (APIC errors) and Tx interrupts
can be missed, such that the Tx queue is not emptied in time
(with interrupt mitigation)-> Tx timeout.
2. the hub/switch doesn't send the packet
2.1 dual speed hub/switches have to buffer the packet(s) coming from
the fast ports and send them with lower speeds; in some cases this
buffer can be filled and packets are dropped.
2.2 switches that have to deal with oversized (Jumbo) frames and split
them in normal (max. 1500 bytes payload) packets. Depending on how
well the splitting is handled (usually directly proportional
with how much the switch costs), packets can be dropped.
2.3 switches under broadcast storms act just like hubs, packets can be
dropped.
3. the client can't receive the packet
3.1 the client is too loaded or there are bus (PCI) problems and
the CPU cannot process packets as fast as they arrive -> Rx
overruns. As soon as the Rx ring is full, packets are dropped by
the NIC. If this happens only occasionaly, a larger Rx ring helps
taking the peaks.
3.2 the client has interrupt problems (APIC errors) and it uses
Rx interrupt mitigation, such that a missed interrupt doesn't
start the processing of the packets. It's less likely than the Tx
interrupt mitigation case, because there is usually also a timer
based interrupt (in 2.4 only hardware support, in 2.5 also
software support from NAPI).
3.3 the client has interrupt problems which manifest as some device
(other than the NIC) keeping interrupts disabled for too long (IDE
is one such example). The NIC generates the interrupt, but the
driver receives it with delay, such that the Rx ring can be
already full and Rx overruns occur. This situation is usually
associated with the "Too much work in interrupt" message, as the
driver has to process the Rx ring plus maybe some Tx interrupts,
media related interrupts, statistics interrupts, etc. (although
usually the Rx processing produces the highest number of loops,
that's why I included it here and not on the server/Tx side).
3.4 link speed mismatch between NIC and hub/switch (see 1.4)
4. different fragments take different times to travel
4.1 a router/switch with higher layer processing somewhere in the middle
might delay/drop packets
4.2 even for computers connected to the same switch, it might happen
with channel bonding

Of course, the roles of server and client are here depicted only as
transmitter and receiver respectively. In a bidirectional protocol, the
roles alternate.

Again I have to state the obvious: the above situations can happen alone
or associated. When they are associated it's much harder to cure all of
them, as some people say plainly "it just doesn't work" or give up too
soon in solving them (f.e. "I fixed the link speed autonegotiation
problem, but I still get dropped packets" which can be related to some
congestion).

> In the avalanching case that I've sometimes observed, then it looks as
> if *no* datagrams are getting rebuilt.

How big are the datagrams compared with the MTU ? With 32K datagrams over
Ethernet, you're talking about roughly a full Rx ring worth of packets (32
is common for the Rx ring size)...

> IOW: the client is just sitting there sending off ICMP messages, and
> never reading the reply.

Does the other side sees these messages ? If so, are there any response
messages sent out (but which don't make it back to the client) ?

> Changing card/driver did not help in the
> cases I observed, but shutting down the network, and then bringing it
> up again sometimes did. Any suggestions?

Down/up was on the sending or receiving/reassembling side ?
Shutting down an interface should clear all buffers/queues associated with
it, so a restart gets a "clean" state. For reassembling, it probably
means droping all incomplete datagrams, but I'm not 100% sure, it may get
more complicated when packets can take different ways between sender and
receiver.

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]






_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-23 16:36:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day

>>>>> " " == Bogdan Costescu <[email protected]> writes:

> How big are the datagrams compared with the MTU ? With 32K
> datagrams over Ethernet, you're talking about roughly a full Rx
> ring worth of packets (32 is common for the Rx ring size)...

It has been a while ago (I've since mothballed the machine) but I saw
it on a Pentium 90 with only 8k write sizes. 4k was fine, 8k gave
avalanches.

>> IOW: the client is just sitting there sending off ICMP
>> messages, and never reading the reply.

> Does the other side sees these messages ? If so, are there any
> response messages sent out (but which don't make it back to the
> client) ?

IIRC, yes, and the server was resending the datagrams. From the code,
it looks as if there is no attempt to stop loopback situations
occurring when this goes on:
i.e. resending an ICMP when the server resends a datagram which times
out again appears to be possible. This might be what was happening...

> Down/up was on the sending or receiving/reassembling side ?

Down/up on the receiving/reassembling side.

Cheers,
Trond

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-04-23 18:16:15

by Bogdan Costescu

[permalink] [raw]
Subject: Re: nfs performance: read only/gigE/nolock/1Tb per day


[ cc-ed to netdev; the discussion was about receiving bursts of ICMP Time
Exceeded messages after some large NFS datagrams could not be reassembled;
sometimes down/up the interface on the receiver/reassembly side cures it ]

On 23 Apr 2002, Trond Myklebust wrote:

> > How big are the datagrams compared with the MTU ? With 32K
> > datagrams over Ethernet, you're talking about roughly a full Rx
> > ring worth of packets (32 is common for the Rx ring size)...
>
> It has been a while ago (I've since mothballed the machine) but I saw
> it on a Pentium 90 with only 8k write sizes. 4k was fine, 8k gave
> avalanches.

IMHO you can't comletely eliminate hardware related problems: apart from
having a slow CPU, some early PCI implementations were buggy (although you
don't say if it's PCI or ISA and what's the link speed).

> > Does the other side sees these messages ?
>
> IIRC, yes, and the server was resending the datagrams. From the code,
> it looks as if there is no attempt to stop loopback situations
> occurring when this goes on:
> i.e. resending an ICMP when the server resends a datagram which times
> out again appears to be possible. This might be what was happening...

That's why I cc-ed netdev. My knowledge above the driver level is close to
non-existant...

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]




_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs