Hello
The NFS server on a PC with a 2.6.0 (release) kernel slows down to a crawl
or stops completely.
Searched archives - nothing fits exact enough.
The server (PC1) is a 900MHz Duron with 384M RAM and a tulip 10/100
(LinkSys) network card (Linksys Network Everywhere Fast Ethernet 10/100
model NC100 (rev 17)).
Clients:
PC2 - Pentium 133MHz with 24M RAM and an onboard Lance 79C970 10mbps
network card,
a SA1100 platform (Tuxscreen / Shannon) with 16M RAM, PCMCIA Netgear
10/100mbps ne2000-compatible (pcnet_cs + 8390) card
a PXA250 platform (Inphinity / Triton starter-kit) with 64M RAM, onboard
SMC91C11xFD (smc91x driver) 10/100 chip
In the tests below I was copying a 4M file from an NFS-mounted
directory to a RAM-based fs (ramfs / tmpfs). Here are results:
server with 2.6.0 kernel:
fast:2.6.0-test11 2m21s (*)
fast:2.4.20 16.5s
SA1100:2.4 never finishes (*)
PXA:2.4.21-rmk1-pxa1 as above
PXA:2.6.0-rmk1-pxa as above
server: 2.4.21
fast:2.6.0-test11 6s
fast:2.4.20 5s
SA1100:2.4.19-rmk7 3.22s
PXA:2.4.21-rmk1-pxa1 7s
PXA:2.6.0-rmk2-pxa 1) 50s (**)
(***) 2) 27s (**)
(*) Messages "NFS server not responding" / "NFS server OK", "mount version
older than kernel" on mount
(**) Messages "NFS server not responding" / "NFS server OK", "mount version
older than kernel" on mount, trafic shows as several peaks
(***) 2.6.0-rmk2-pxa corresponds to the 2.6.0-rmk2 kernel with a PXA-patch
forward-ported from diff-2.6.0-test2-rmk1-pxa1.
The LinkSys card I bought recently, before I used a RTL (3c59x) card, only
capable of 10mbps, I never saw such problems with it, but I, probably,
never tried NFS under 2.6.0 with it - have to try too.
It is not just a problem of 2.6 with those specific network configurations
- ftp / http / tftp transfers work fine. E.g. wget of the same file on the
PXA with 2.6.0 from the PC1 with 2.4.21 over http takes about 2s. So, it
is 2.6 + NFS.
Is it fixed somewhere (2.6.1-rcx?), or what should I try / what further
information is required?
Thanks
Guennadi
---
Guennadi Liakhovetski
On Tue, 6 Jan 2004, Guennadi Liakhovetski wrote:
> server with 2.6.0 kernel:
>
> fast:2.6.0-test11 2m21s (*)
> fast:2.4.20 16.5s
> SA1100:2.4 never finishes (*)
> PXA:2.4.21-rmk1-pxa1 as above
> PXA:2.6.0-rmk1-pxa as above
>
> server: 2.4.21
>
> fast:2.6.0-test11 6s
> fast:2.4.20 5s
> SA1100:2.4.19-rmk7 3.22s
> PXA:2.4.21-rmk1-pxa1 7s
> PXA:2.6.0-rmk2-pxa 1) 50s (**)
> (***) 2) 27s (**)
s/fast/PC2/
Further, I tried the old 3c59x card - same problems persist. Also tried
PC2 as the server - same. nfs-utils version 1.0.6 (Debian Sarge). I sent a
copy of the yesterday's email + new details to [email protected],
[email protected], [email protected].
Strange, that nobody is seeing this problem, but it looks pretty bad here.
Unless I missed some necessary update somewhere? The only one that seemed
relevant - nfs-utils on the server(s) from Documentation/Changes I
checked.
Thanks
Guennadi
---
Guennadi Liakhovetski
On Tue, Jan 06, 2004 at 01:46:30AM +0100, Guennadi Liakhovetski wrote:
> It is not just a problem of 2.6 with those specific network configurations
> - ftp / http / tftp transfers work fine. E.g. wget of the same file on the
> PXA with 2.6.0 from the PC1 with 2.4.21 over http takes about 2s. So, it
> is 2.6 + NFS.
>
> Is it fixed somewhere (2.6.1-rcx?), or what should I try / what further
> information is required?
You will probably need to look at some tcpdump output to debug the problem...
On Wed, 7 Jan 2004, Mike Fedyk wrote:
> On Tue, Jan 06, 2004 at 01:46:30AM +0100, Guennadi Liakhovetski wrote:
> > It is not just a problem of 2.6 with those specific network configurations
> > - ftp / http / tftp transfers work fine. E.g. wget of the same file on the
> > PXA with 2.6.0 from the PC1 with 2.4.21 over http takes about 2s. So, it
> > is 2.6 + NFS.
> >
> > Is it fixed somewhere (2.6.1-rcx?), or what should I try / what further
> > information is required?
>
> You will probably need to look at some tcpdump output to debug the problem...
Yep, just have done that - well, they differ... First obvious thing that I
noticed is that 2.6 is trying to read bigger blocks (32K instead of 8K),
but then - so far I cannot interpret what happens after the start of the
actual file-read. 2.6 starts getting big delays immediately, even in
cases, where eventually the file is transferred (2 PCs with 2.6). If
someone can get some information from the logs, I'll happily send them.
The bz2 tarball is 50k big, so, not too bad for the list either, but it is
not a common practice to send compressed attachments to the list, right?
It's 5M uncompressed.
Thanks
Guennadi
---
Guennadi Liakhovetski
On Wed, Jan 07, 2004 at 07:13:46PM +0100, Guennadi Liakhovetski wrote:
> On Wed, 7 Jan 2004, Mike Fedyk wrote:
>
> > On Tue, Jan 06, 2004 at 01:46:30AM +0100, Guennadi Liakhovetski wrote:
> > > It is not just a problem of 2.6 with those specific network configurations
> > > - ftp / http / tftp transfers work fine. E.g. wget of the same file on the
> > > PXA with 2.6.0 from the PC1 with 2.4.21 over http takes about 2s. So, it
> > > is 2.6 + NFS.
> > >
> > > Is it fixed somewhere (2.6.1-rcx?), or what should I try / what further
> > > information is required?
> >
> > You will probably need to look at some tcpdump output to debug the problem...
>
> Yep, just have done that - well, they differ... First obvious thing that I
> noticed is that 2.6 is trying to read bigger blocks (32K instead of 8K),
You mean it's trying to do 32K nfs block size on the wire?
> The bz2 tarball is 50k big, so, not too bad for the list either, but it is
> not a common practice to send compressed attachments to the list, right?
> It's 5M uncompressed.
Just post a few samples of the lines that differ. Any files should be sent
off-list.
On Wed, 7 Jan 2004, Mike Fedyk wrote:
> On Wed, Jan 07, 2004 at 07:13:46PM +0100, Guennadi Liakhovetski wrote:
> > noticed is that 2.6 is trying to read bigger blocks (32K instead of 8K),
>
> You mean it's trying to do 32K nfs block size on the wire?
Emn, no, if I understand it correctly. NFS-client requests 32K of data at
a time, but that is sent in several fragments. Actually, client is the
same (2.6 kernel), and it requests 32 or 8K depending on the
kernel-version of the server...
> Just post a few samples of the lines that differ. Any files should be sent
> off-list.
Well, I am afraid, I won't be able to identify the important differring
packets. I did
tcpdump -l -i eth0 -exX -vvv -s0
, so, the log contains complete packet dumps. Ok, I'll try just to quote
headers. poirot is the server (PC1, 2.4 / 2.6), fast is the client (PC2,
2.6). Following the first request for data (diff only in length)
2.6:
18:42:28.374430 0:80:5f:d2:53:f0 0:50:bf:a4:59:71 ip 162:
fast.grange.462443716 > poirot.grange.nfs: 120 read fh Unknown/1 32768
bytes @ 0x000008000 (DF) (ttl 64, id 15, len 148)
2.4:
18:48:57.794687 0:80:5f:d2:53:f0 0:50:bf:a4:59:71 ip 162:
fast.grange.1972393156 > poirot.grange.nfs: 120 read fh Unknown/1 8192
bytes @ 0x000002000 (DF) (ttl 64, id 6, len 148)
the server (PC1) sends the following packets:
2.6:
18:42:28.374554 0:50:bf:a4:59:71 0:80:5f:d2:53:f0 ip 1514:
poirot.grange.nfs > fast.grange.445666500: reply ok 1472 read REG 100644
ids 0/0 sz 0x00007a120 nlink 1 rdev 0/0 fsid 0x000000000 nodeid
0x000000000 a/m/ctime 1073497348.374212040 2477.000000 1064093242.000000
32768 bytes (frag 40553:1480@0+) (ttl 64, len 1500)
18:42:28.374560 0:50:bf:a4:59:71 0:80:5f:d2:53:f0 ip 1514: poirot.grange >
fast.grange: (frag 40553:1480@1480+) (ttl 64, len 1500)
2.4:
18:48:57.806270 0:50:bf:a4:59:71 0:80:5f:d2:53:f0 ip 962: poirot.grange >
fast.grange: (frag 39126:928@7400) (ttl 64, len 948)
18:48:57.806291 0:50:bf:a4:59:71 0:80:5f:d2:53:f0 ip 1514: poirot.grange >
fast.grange: (frag 39126:1480@5920+) (ttl 64, len 1500)
Well, maybe important is this place in 2.6 log - when it got the first
(2.5s) delay:
18:42:28.414903 1:80:c2:0:0:1 1:80:c2:0:0:1 8808 60:
18:42:31.033837 0:80:5f:d2:53:f0 0:50:bf:a4:59:71 ip 162:
fast.grange.479220932 > poirot.grange.nfs: 120 read fh Unknown/1 32768
bytes @ 0x000010000 (DF) (ttl 64, id 18, len 148)
18:42:31.034244 0:50:bf:a4:59:71 0:80:5f:d2:53:f0 ip 1514:
poirot.grange.nfs > fast.grange.479220932: reply ok 1472 read REG 100644
ids 0/0 sz 0x00007a120 nlink 1 rdev 0/0 fsid 0x000000000 nodeid
0x000000000 a/m/ctime 1073497351.33807720 2477.000000 1064093242.000000
32768 bytes (frag 40557:1480@0+) (ttl 64, len 1500)
So, does it say anything?
Thanks
Guennadi
---
Guennadi Liakhovetski
In article <[email protected]>,
Guennadi Liakhovetski <[email protected]> wrote:
| On Tue, 6 Jan 2004, Guennadi Liakhovetski wrote:
|
| > server with 2.6.0 kernel:
| >
| > fast:2.6.0-test11 2m21s (*)
| > fast:2.4.20 16.5s
| > SA1100:2.4 never finishes (*)
| > PXA:2.4.21-rmk1-pxa1 as above
| > PXA:2.6.0-rmk1-pxa as above
| >
| > server: 2.4.21
| >
| > fast:2.6.0-test11 6s
| > fast:2.4.20 5s
| > SA1100:2.4.19-rmk7 3.22s
| > PXA:2.4.21-rmk1-pxa1 7s
| > PXA:2.6.0-rmk2-pxa 1) 50s (**)
| > (***) 2) 27s (**)
|
| s/fast/PC2/
|
| Further, I tried the old 3c59x card - same problems persist. Also tried
| PC2 as the server - same. nfs-utils version 1.0.6 (Debian Sarge). I sent a
| copy of the yesterday's email + new details to [email protected],
| [email protected], [email protected].
|
| Strange, that nobody is seeing this problem, but it looks pretty bad here.
| Unless I missed some necessary update somewhere? The only one that seemed
| relevant - nfs-utils on the server(s) from Documentation/Changes I
| checked.
I'm sure you checked this, but does mii-tool show that you have
negotiated the proper connection to the hub or switch? I found that my
3cXXX and eepro100 cards were negotiating half duplex with the switches
and cable modems, causing the throughput to go forth and conjugate the
verb "to suck" until I fixed it.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On 7 Jan 2004, bill davidsen wrote:
> I'm sure you checked this, but does mii-tool show that you have
> negotiated the proper connection to the hub or switch? I found that my
> 3cXXX and eepro100 cards were negotiating half duplex with the switches
> and cable modems, causing the throughput to go forth and conjugate the
> verb "to suck" until I fixed it.
Actually, I didn't. Just tried - mii-tool says
SIOCGMIIPHY on 'eth0' failed: Operation not supported
no MII interfaces found
And if you look in the tulip driver you'll see the same - ADMtek Comet
doesn't have the HAS_MII flag set:-( Is it really that bad? It was a cheap
(damn it, when will I learn not to buy cheap stuff, even if it says
"Linux supported"...) card, no technical documentation, none on
http://www.sitecom.com either. I've sent them a service-request though. Also
funny, on their site they say, it's a realtek 8139 chip... I would even
less hope that the other 3c59x card, which can only do half-duplex (I
think) 10mbps, has mii... But - the light on the hub and on the card say
it's 100mbps full-duplex. Actually, if something was wrong with network
settings - ftp wouldn't work reliable either, right? Well, maybe it
affects UDP only somehow. Well, yes - with TCP it seems to work. Only now
I have another problem with my PC2 with 2.6 (Pentium 133MHz, 24M). It
swaps like a mad already under very mild load... Have to narrow it down
though... So, 4M file I was able to copy without problem to tmpfs, 120M to
the disk lasts already many minutes (~30) with hard swapping. Problem is I
can't use terminals in such situation - they become nearly irresponsive.
So, only sysrq's work... Hm, just looked at the backtrace of the cp
process - it looks completely sick.
__wake_up_common
preempt_schedule
__wake_up
wakeup_kswapd
__alloc_pages
read_swap_cache_async
read_swap_cache_async
swapin_readahead
do_swap_page
handle_mm_fault
do_page_fault
do_page_fault
do_DC390_Interrupt
handle_IRQ_event
end_8259A_irq
do_IRQ
error_code
So, since TCP works - shall we consider the case closed, or shall UDP also
be fixed? Ok, presumably, Linux-Linux can always use TCP, what about other
UNIXes? Can they also do mount -otcp?
...and what is this:
RPC request reserved 0 but used 116
Thanks
Guennadi
---
Guennadi Liakhovetski
On Wed, 7 Jan 2004, Mike Fedyk wrote:
> Just post a few samples of the lines that differ. Any files should be sent
> off-list.
Ok, This is the problem:
10:38:30.867306 0:40:f4:23:ac:91 0:50:bf:a4:59:71 ip 590: tuxscreen.grange > poirot.grange: icmp: ip reassembly time exceeded [tos 0xc0]
A similar effect was reported in 1999 with kernel 2.3.13, also between 2
100mbps cards. It also was occurring with UDP NFS:
http://www.ussg.iu.edu/hypermail/linux/net/9908.2/0039.html
But there were no answers, so, I am CC-ing Sam, hoping to hear, if he's
found the reason and a cure for his problem. Apart from this message I
didn't find any other relevant hits with Google.
Is it some physical network problem, which somehow only becomes visible
under 2.6 now, with UDP (NFS) with 100mbps?
Guennadi
---
Guennadi Liakhovetski
On Fri, Jan 09, 2004 at 11:08:29AM +0100, Guennadi Liakhovetski wrote:
> On Wed, 7 Jan 2004, Mike Fedyk wrote:
>
> > Just post a few samples of the lines that differ. Any files should be sent
> > off-list.
>
> Ok, This is the problem:
>
> 10:38:30.867306 0:40:f4:23:ac:91 0:50:bf:a4:59:71 ip 590: tuxscreen.grange > poirot.grange: icmp: ip reassembly time exceeded [tos 0xc0]
>
> A similar effect was reported in 1999 with kernel 2.3.13, also between 2
> 100mbps cards. It also was occurring with UDP NFS:
>
> http://www.ussg.iu.edu/hypermail/linux/net/9908.2/0039.html
>
> But there were no answers, so, I am CC-ing Sam, hoping to hear, if he's
> found the reason and a cure for his problem. Apart from this message I
> didn't find any other relevant hits with Google.
>
> Is it some physical network problem, which somehow only becomes visible
> under 2.6 now, with UDP (NFS) with 100mbps?
Find out how many packets are being dropped on your two hosts with 2.4 and
2.6.
If they're not dropping packets, maybe the ordering with a large backlog has
chagned between 2.4 and 2.6 that would keep some of the fragments from being
sent in time...
On Fri, 9 Jan 2004, Mike Fedyk wrote:
> On Fri, Jan 09, 2004 at 11:08:29AM +0100, Guennadi Liakhovetski wrote:
> > On Wed, 7 Jan 2004, Mike Fedyk wrote:
> >
> > > Just post a few samples of the lines that differ. Any files should be sent
> > > off-list.
> >
> > Ok, This is the problem:
> >
> > 10:38:30.867306 0:40:f4:23:ac:91 0:50:bf:a4:59:71 ip 590: tuxscreen.grange > poirot.grange: icmp: ip reassembly time exceeded [tos 0xc0]
>
> Find out how many packets are being dropped on your two hosts with 2.4 and
> 2.6.
So, I've run 2 tcpdumps - on server and on client. Woooo... Looks bad.
With 2.4 (_on the server_) the client reads about 8K at a time, which is
sent in 5 fragments 1500 (MTU) bytes each. And that works. Also
interesting, that fragments are sent in the reverse order.
With 2.6 (on the server, same client) the client reads about 16K at a
time, split into 11 fragments, and then packets number 9 and 10 get
lost... This all with a StrongARM client and a PCMCIA network-card. With a
PXA-client (400MHz compared to 200MHz SA) and an on-board eth smc91x, it
gets the first 5 fragments, and then misses every other fragment. Again -
in both cases I was copying files to RAM. Yes, 2.6 sends fragments in
direct order.
Thanks
Guennadi
---
Guennadi Liakhovetski
On Sat, Jan 10, 2004 at 01:38:00AM +0100, Guennadi Liakhovetski wrote:
> On Fri, 9 Jan 2004, Mike Fedyk wrote:
> > Find out how many packets are being dropped on your two hosts with 2.4 and
> > 2.6.
>
> So, I've run 2 tcpdumps - on server and on client. Woooo... Looks bad.
>
> With 2.4 (_on the server_) the client reads about 8K at a time, which is
> sent in 5 fragments 1500 (MTU) bytes each. And that works. Also
> interesting, that fragments are sent in the reverse order.
>
> With 2.6 (on the server, same client) the client reads about 16K at a
> time, split into 11 fragments, and then packets number 9 and 10 get
> lost... This all with a StrongARM client and a PCMCIA network-card. With a
> PXA-client (400MHz compared to 200MHz SA) and an on-board eth smc91x, it
> gets the first 5 fragments, and then misses every other fragment. Again -
> in both cases I was copying files to RAM. Yes, 2.6 sends fragments in
> direct order.
Is that an x86 server, and an arm client?
On Fri, 9 Jan 2004, Mike Fedyk wrote:
> On Sat, Jan 10, 2004 at 01:38:00AM +0100, Guennadi Liakhovetski wrote:
> > On Fri, 9 Jan 2004, Mike Fedyk wrote:
> > > Find out how many packets are being dropped on your two hosts with 2.4 and
> > > 2.6.
> >
> > So, I've run 2 tcpdumps - on server and on client. Woooo... Looks bad.
> >
> > With 2.4 (_on the server_) the client reads about 8K at a time, which is
> > sent in 5 fragments 1500 (MTU) bytes each. And that works. Also
> > interesting, that fragments are sent in the reverse order.
> >
> > With 2.6 (on the server, same client) the client reads about 16K at a
> > time, split into 11 fragments, and then packets number 9 and 10 get
> > lost... This all with a StrongARM client and a PCMCIA network-card. With a
> > PXA-client (400MHz compared to 200MHz SA) and an on-board eth smc91x, it
> > gets the first 5 fragments, and then misses every other fragment. Again -
> > in both cases I was copying files to RAM. Yes, 2.6 sends fragments in
> > direct order.
>
> Is that an x86 server, and an arm client?
Yes. The reason for the problem seems to be the increased default size of
the transfer unit of NFS from 2.4 to 2.6. 8K under 2.4 was still ok, 16K
is too much - only the first 5 fragments pass fine, then data starts to
get lost. If it is a hardware limitation (not all platforms can manage
16K), it should be probably set back to 8K. If the reason is that some
buffer size was not increased correspondingly, then this should be done.
Just checked - mounting with rsize=8192,wsize=8192 fixes the problem -
there are again 5 fragments and they all are received properly.
Anyway, I think, default values should be safe on all platforms, with
further optimisations being possible, where it is safe.
Thanks
Guennadi
---
Guennadi Liakhovetski
P? lau , 10/01/2004 klokka 06:10, skreiv Guennadi Liakhovetski:
> Yes. The reason for the problem seems to be the increased default size of
> the transfer unit of NFS from 2.4 to 2.6. 8K under 2.4 was still ok, 16K
> is too much - only the first 5 fragments pass fine, then data starts to
> get lost. If it is a hardware limitation (not all platforms can manage
> 16K), it should be probably set back to 8K. If the reason is that some
> buffer size was not increased correspondingly, then this should be done.
No! People who have problems with the support for large rsize/wsize
under UDP due to lost fragments can
a) Reduce r/wsize themselves using mount
b) Use TCP instead
The correct solution to this problem is (b). I.e. we convert mount to
use TCP as the default if it is available. That is consistent with what
all other modern implementations do.
Changing a hard maximum on the server in order to fit the lowest common
denominator client is simply wrong.
Cheers,
Trond
Trond Myklebust <[email protected]> writes:
> The correct solution to this problem is (b). I.e. we convert mount to
> use TCP as the default if it is available. That is consistent with what
> all other modern implementations do.
Please do that. Fragmented UDP with 16bit ipid is just russian roulette at
today's network speeds.
One disadvantage is that some older (early 2.4) Linux nfsd servers that have
TCP enabled can cause problems. But I guess we can live with that, they
should be updated anyways.
-Andi
P? lau , 10/01/2004 klokka 11:08, skreiv Andi Kleen:
> Trond Myklebust <[email protected]> writes:
>
> > The correct solution to this problem is (b). I.e. we convert mount to
> > use TCP as the default if it is available. That is consistent with what
> > all other modern implementations do.
>
> Please do that. Fragmented UDP with 16bit ipid is just russian roulette at
> today's network speeds.
I fully agree.
Chuck Lever recently sent an update for the NFS 'mount' utility to
Andries. Among other things, that update changes this default. We're
still waiting for his comments.
Cheers,
Trond
On Sat, 10 Jan 2004, Trond Myklebust wrote:
> P? lau , 10/01/2004 klokka 06:10, skreiv Guennadi Liakhovetski:
> > Yes. The reason for the problem seems to be the increased default size of
> > the transfer unit of NFS from 2.4 to 2.6. 8K under 2.4 was still ok, 16K
> > is too much - only the first 5 fragments pass fine, then data starts to
> > get lost. If it is a hardware limitation (not all platforms can manage
> > 16K), it should be probably set back to 8K. If the reason is that some
> > buffer size was not increased correspondingly, then this should be done.
>
> No! People who have problems with the support for large rsize/wsize
> under UDP due to lost fragments can
>
> a) Reduce r/wsize themselves using mount
> b) Use TCP instead
>
> The correct solution to this problem is (b). I.e. we convert mount to
> use TCP as the default if it is available. That is consistent with what
> all other modern implementations do.
>
> Changing a hard maximum on the server in order to fit the lowest common
> denominator client is simply wrong.
Not change - keep (from 2.4). You see, the problem might be - somebody
updates the NFS-server from 2.4 to 2.6 and then suddenly some clients fail
to work with it. Seems a non-obvious fact, that after upgrading the server
clients' configuration might have to be changed. At the very least this
must be documented in Kconfig.
Thanks
Guennadi
---
Guennadi Liakhovetski
P? lau , 10/01/2004 klokka 15:04, skreiv Guennadi Liakhovetski:
> Not change - keep (from 2.4). You see, the problem might be - somebody
> updates the NFS-server from 2.4 to 2.6 and then suddenly some clients fail
> to work with it. Seems a non-obvious fact, that after upgrading the server
> clients' configuration might have to be changed. At the very least this
> must be documented in Kconfig.
Non-obvious????? You have to change modutils, you have to upgrade
nfs-utils, glibc, gcc... and that's only the beginning of the list.
2.6.x is a new kernel it differs from 2.4.x, which again differs from
2.2.x, ... Get over it! There are workarounds for your problem, so use
them.
Trond
On Sat, Jan 10, 2004 at 04:57:36PM -0500, Trond Myklebust wrote:
> P? lau , 10/01/2004 klokka 15:04, skreiv Guennadi Liakhovetski:
> > Not change - keep (from 2.4). You see, the problem might be - somebody
> > updates the NFS-server from 2.4 to 2.6 and then suddenly some clients fail
> > to work with it. Seems a non-obvious fact, that after upgrading the server
> > clients' configuration might have to be changed. At the very least this
> > must be documented in Kconfig.
>
> Non-obvious????? You have to change modutils, you have to upgrade
> nfs-utils, glibc, gcc... and that's only the beginning of the list.
>
> 2.6.x is a new kernel it differs from 2.4.x, which again differs from
> 2.2.x, ... Get over it! There are workarounds for your problem, so use
> them.
I have to admit, I haven't been following NFS on TCP very much. Is the code
in the stock 2.4 and 2.6 kernels ready for production use? It seemed from
what I read it was still experemental (and even marked as such in the
config).
On Sat, Jan 10, 2004 at 12:10:46PM +0100, Guennadi Liakhovetski wrote:
> On Fri, 9 Jan 2004, Mike Fedyk wrote:
>
> > On Sat, Jan 10, 2004 at 01:38:00AM +0100, Guennadi Liakhovetski wrote:
> > > With 2.6 (on the server, same client) the client reads about 16K at a
> > > time, split into 11 fragments, and then packets number 9 and 10 get
> > > lost... This all with a StrongARM client and a PCMCIA network-card. With a
> > > PXA-client (400MHz compared to 200MHz SA) and an on-board eth smc91x, it
> > > gets the first 5 fragments, and then misses every other fragment. Again -
> > > in both cases I was copying files to RAM. Yes, 2.6 sends fragments in
> > > direct order.
> >
> > Is that an x86 server, and an arm client?
>
> Yes. The reason for the problem seems to be the increased default size of
> the transfer unit of NFS from 2.4 to 2.6. 8K under 2.4 was still ok, 16K
> is too much - only the first 5 fragments pass fine, then data starts to
> get lost. If it is a hardware limitation (not all platforms can manage
> 16K), it should be probably set back to 8K. If the reason is that some
> buffer size was not increased correspondingly, then this should be done.
>
> Just checked - mounting with rsize=8192,wsize=8192 fixes the problem -
> there are again 5 fragments and they all are received properly.
What version is the arm kernel you're running on the client, and where is it
from?
On Sat, 10 Jan 2004, Trond Myklebust wrote:
> P? lau , 10/01/2004 klokka 15:04, skreiv Guennadi Liakhovetski:
> > Not change - keep (from 2.4). You see, the problem might be - somebody
> > updates the NFS-server from 2.4 to 2.6 and then suddenly some clients fail
> > to work with it. Seems a non-obvious fact, that after upgrading the server
> > clients' configuration might have to be changed. At the very least this
> > must be documented in Kconfig.
>
> Non-obvious????? You have to change modutils, you have to upgrade
> nfs-utils, glibc, gcc... and that's only the beginning of the list.
>
> 2.6.x is a new kernel it differs from 2.4.x, which again differs from
> 2.2.x, ... Get over it! There are workarounds for your problem, so use
> them.
Please, calm down:-)), I am not fighting, I am just thinking aloud, I have
no intention whatsoever to attack your aor anybody else's work / ideas /
decisions, etc.
The only my doubt was - yes, you upgrade the __server__, so, you look in
Changes, upgrade all necessary stuff, or just upgrade blindly (as does
happen sometimes, I believe) a distribution - and the server works, fine.
What I find non-obvious, is that on updating the server you have to
re-configure __clients__, see? Just think about a network somewhere in a
uni / company / whatever. Sysadmins update the server, and then
NFS-clients suddenly cannot use NFS any more...
Thanks
Guennadi
---
Guennadi Liakhovetski
P? lau , 10/01/2004 klokka 17:14, skreiv Mike Fedyk:
> I have to admit, I haven't been following NFS on TCP very much. Is the code
> in the stock 2.4 and 2.6 kernels ready for production use? It seemed from
> what I read it was still experemental (and even marked as such in the
> config).
The client code has been very heavily tested. It is not marked as
experimental.
The server code is marked as "officially experimental, but seems to work
well". You'll have to talk to Neil to find out what that means. In
practice, though, it performs at least as well as the UDP code.
If you are in a production environment and really don't want to trust
the TCP code, you can disable it, and use the option I mentioned earlier
of setting a low value of r/wsize.
Or better still: fix your network setup so that you don't lose all those
UDP fragments (check switches, NICs, drivers,...). The icmp time
exceeded error is a sign of a lossy network, NOT a broken NFS
implementation.
Trond
On Sat, 10 Jan 2004, Mike Fedyk wrote:
> What version is the arm kernel you're running on the client, and where is it
> from?
2.4.19-rmk7, 24.4.21-rmk1-pxa1, 2.6.0-rmk2-pxa. All self-compiled with
self-ported platform-specific patches. Sure, none of those patches touches
any NFS / network general code. It might modify some (including network)
drivers, and, of course the core functionality (interrupt-handling,
memory, DMA, etc.) The first 2 also had real-time patches (RTAI), 2.6 on
PXA didn't. The pxa-patch for 2.6 was self-ported from 2.6.0-rmk1-test2,
IIRC. So, theoretically, you can blame any of those modifications, but I
highly doubt, that I managed to mess up all 3 kernels on 2 different
platforms to produce the same error, whereas all the rest (of course,
those, that I checked, i.e. ftp, http, telnet, tftp, tcp-nfs) network
protocols work.
Guennadi
---
Guennadi Liakhovetski
On Sat, Jan 10, 2004 at 11:52:16PM +0100, Guennadi Liakhovetski wrote:
> On Sat, 10 Jan 2004, Mike Fedyk wrote:
>
> > What version is the arm kernel you're running on the client, and where is it
> > from?
>
> 2.4.19-rmk7, 24.4.21-rmk1-pxa1, 2.6.0-rmk2-pxa. All self-compiled with
> self-ported platform-specific patches. Sure, none of those patches touches
> any NFS / network general code. It might modify some (including network)
> drivers, and, of course the core functionality (interrupt-handling,
> memory, DMA, etc.) The first 2 also had real-time patches (RTAI), 2.6 on
> PXA didn't. The pxa-patch for 2.6 was self-ported from 2.6.0-rmk1-test2,
> IIRC. So, theoretically, you can blame any of those modifications, but I
> highly doubt, that I managed to mess up all 3 kernels on 2 different
> platforms to produce the same error, whereas all the rest (of course,
> those, that I checked, i.e. ftp, http, telnet, tftp, tcp-nfs) network
> protocols work.
Can you double check with a vanilla kernel.org 2.4.24 x86 client?
On Sat, 10 Jan 2004, Mike Fedyk wrote:
> Can you double check with a vanilla kernel.org 2.4.24 x86 client?
Hm, I would have to get hold of one more 100mbps card... So, not
immediately, unfortunately.
Guennadi
---
Guennadi Liakhovetski
On Sat, 10 Jan 2004, Guennadi Liakhovetski wrote:
> On Sat, 10 Jan 2004, Trond Myklebust wrote:
>
> > P? lau , 10/01/2004 klokka 15:04, skreiv Guennadi Liakhovetski:
> > > Not change - keep (from 2.4). You see, the problem might be - somebody
> > > updates the NFS-server from 2.4 to 2.6 and then suddenly some clients fail
> > > to work with it. Seems a non-obvious fact, that after upgrading the server
> > > clients' configuration might have to be changed. At the very least this
> > > must be documented in Kconfig.
> >
> > Non-obvious????? You have to change modutils, you have to upgrade
> > nfs-utils, glibc, gcc... and that's only the beginning of the list.
> >
> > 2.6.x is a new kernel it differs from 2.4.x, which again differs from
> > 2.2.x, ... Get over it! There are workarounds for your problem, so use
> > them.
>
> Please, calm down:-)), I am not fighting, I am just thinking aloud, I have
> no intention whatsoever to attack your aor anybody else's work / ideas /
> decisions, etc.
>
> The only my doubt was - yes, you upgrade the __server__, so, you look in
> Changes, upgrade all necessary stuff, or just upgrade blindly (as does
> happen sometimes, I believe) a distribution - and the server works, fine.
> What I find non-obvious, is that on updating the server you have to
> re-configure __clients__, see? Just think about a network somewhere in a
> uni / company / whatever. Sysadmins update the server, and then
> NFS-clients suddenly cannot use NFS any more...
>
Ever tried upgrading a WinNT server to Win2k or Win2003 Server? Don't
expect all your Win95, Win98 and WinNT clients to just work the same as
they did previously...
Same goes with other OSs. Software that has requirements on both the client and
server side naturally has to be kept in sync, and NFS is not the only case
where not everything is 100% backwards compatible.
This shouldn't really be surprising.
-- Jesper Juhl
On Sat, Jan 10, 2004 at 11:42:45PM +0100, Guennadi Liakhovetski wrote:
>
> The only my doubt was - yes, you upgrade the __server__, so, you look in
> Changes, upgrade all necessary stuff, or just upgrade blindly (as does
> happen sometimes, I believe) a distribution - and the server works, fine.
> What I find non-obvious, is that on updating the server you have to
> re-configure __clients__, see? Just think about a network somewhere in a
If you upgrade the server and read "Changes", then a note in changes might
say that "you need to configure carefully or some clients could get in trouble."
(If the current "Changes" don't have that - post a documentation patch.)
If you use a distro, then hopefully the distro takes care of the
problem for you. Or at least brings it to your attention somehow.
It should not come as a surprise that changing a server might have an
effect on the clients - clients and servers are connected after all!
Helge Hafting
On Sun, Jan 11, 2004 at 02:18:57PM +0100, Helge Hafting wrote:
> On Sat, Jan 10, 2004 at 11:42:45PM +0100, Guennadi Liakhovetski wrote:
> > The only my doubt was - yes, you upgrade the __server__, so, you look in
> > Changes, upgrade all necessary stuff, or just upgrade blindly (as does
> > happen sometimes, I believe) a distribution - and the server works, fine.
> > What I find non-obvious, is that on updating the server you have to
> > re-configure __clients__, see? Just think about a network somewhere in a
>
> If you upgrade the server and read "Changes", then a note in changes might
> say that "you need to configure carefully or some clients could get in trouble."
> (If the current "Changes" don't have that - post a documentation patch.)
[This is more to Guennadi than Helge]
I don't see why such a patch to "Changes" should be necessary. The
problem is most definitely with the client hardware, and not the
server software.
The crux of this problem comes down to the SMC91C111 having only a
small on-board packet buffer, which is capable of storing only about
4 packets (both TX and RX). This means that if you receive 8 packets
with high enough interrupt latency, you _will_ drop some of those
packets.
Note that this is independent of whether you're using DMA mode with
the SMC91C111 - DMA mode only allows you to off load the packets from
the chip faster once you've discovered you have a packet to off load
via an interrupt.
It won't be just NFS that's affected - eg, if you have 4kB NFS packets
and several machines broadcast an ARP at the same time, you'll again
run out of packet space on the SMC91C111. Does that mean you should
somehow change the way ARP works?
Sure, reducing the NFS packet size relieves the problem, but that's
just a work around for the symptom and nothing more. It's exactly
the same type of work around as switching the SMC91C111 to operate at
10mbps only - both work by reducing the rate at which packets are
received by the target, thereby offsetting the interrupt latency
and packet unload times.
Basically, the SMC91C111 is great for use on small, *well controlled*
embedded networks, but anything else is asking for trouble.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
On Sun, 11 Jan 2004, Russell King wrote:
> Basically, the SMC91C111 is great for use on small, *well controlled*
> embedded networks, but anything else is asking for trouble.
Ok, thanks. Well, just out of curiousity (also, why I concluded it might
have been a more general problem - because I had it on both my ARM boards)
- where is the bottleneck likely to be on a SA system with a Netgear-FA411
PCMCIA card (NE2000-compatible)? Just a slow CPU?
Thanks
Guennadi
---
Guennadi Liakhovetski
Trond Myklebust wrote:
> No! People who have problems with the support for large rsize/wsize
> under UDP due to lost fragments can
>
> a) Reduce r/wsize themselves using mount
> b) Use TCP instead
>
> The correct solution to this problem is (b). I.e. we convert mount to
> use TCP as the default if it is available. That is consistent with what
> all other modern implementations do.
>
> Changing a hard maximum on the server in order to fit the lowest common
> denominator client is simply wrong.
So set the default buffer size to 8k if UDP is being used. Other than
getting people to believe 2.6 is broken, you buy nothing. People running
UDP are probably not cutting edge state of the art, let the default be
small and the client negotiate up if desired.
Why do so many Linux people have the idea that because a standard says
they CAN do something, it's fine to do it in a way which doesn't conform
to common practice. And Linux 2.4 practice should count even if you
pretend that Solaris, AIX, Windows and BSD don't count...
--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
P? m? , 12/01/2004 klokka 00:06, skreiv Bill Davidsen:
> Why do so many Linux people have the idea that because a standard says
> they CAN do something, it's fine to do it in a way which doesn't conform
> to common practice. And Linux 2.4 practice should count even if you
> pretend that Solaris, AIX, Windows and BSD don't count...
Wake up and smell the new millennium. Networking has all grown up while
you were asleep. We have these new cool things called "switches", NICs
with bigger buffers,...
The 8k limit that you find in RFC1094 was an ad-hoc "limit" based purely
on testing using pre-1989 hardware. AFAIK most if not all of the
commercial vendors (Solaris, AIX, Windows/Hummingbird, EMC and Netapp)
are all currently setting the defaults to 32k block sizes for both TCP
and UDP.
Most of them want to bump that to a couple of Mbyte in the very near
future.
Linux 2.4 didn't have support for anything beyond 8k. BSD sets 32k for
TCP, and 8k for UDP for some reason.
Trond
Trond Myklebust <[email protected]> writes:
> P? lau , 10/01/2004 klokka 11:08, skreiv Andi Kleen:
> > Trond Myklebust <[email protected]> writes:
> >
> > > The correct solution to this problem is (b). I.e. we convert mount to
> > > use TCP as the default if it is available. That is consistent with what
> > > all other modern implementations do.
> >
> > Please do that. Fragmented UDP with 16bit ipid is just russian roulette at
> > today's network speeds.
>
> I fully agree.
>
> Chuck Lever recently sent an update for the NFS 'mount' utility to
> Andries. Among other things, that update changes this default. We're
> still waiting for his comments.
If mount defaults to trying TCP first then UDP if the TCP mount fails,
should there be separate options for [rw]size depending on what type of
mount actually takes place? e.g. 'ursize' and 'uwsize' for UDP and
'trsize' and 'twsize' for TCP ?
James Pearson
> The 8k limit that you find in RFC1094 was an ad-hoc "limit" based purely
> on testing using pre-1989 hardware. AFAIK most if not all of the
> commercial vendors (Solaris, AIX, Windows/Hummingbird, EMC and Netapp)
> are all currently setting the defaults to 32k block sizes for both TCP
> and UDP.
> Most of them want to bump that to a couple of Mbyte in the very near
> future.
Note: the future Mbyte sizes can, of course, only be supported on TCP
since UDP has an inherent limit at 64k. The de-facto limit on UDP is
therefore likely to remain at 32k (although I think at least one vendor
has already tried pushing it to 48k).
Trond
P? m? , 12/01/2004 klokka 09:40, skreiv James Pearson:
> If mount defaults to trying TCP first then UDP if the TCP mount fails,
> should there be separate options for [rw]size depending on what type of
> mount actually takes place? e.g. 'ursize' and 'uwsize' for UDP and
> 'trsize' and 'twsize' for TCP ?
No. The number of "mount" options is complex enough as it is. I don't
see the above as being useful.
If you need the above tweak, you should be able to get round the above
problem by first attempting to force the TCP protocol yourself, and then
retrying using UDP if it fails.
Changing the default r/wsize should normally be unnecessary. You only
want to play with them if you actually see performance problems under
testing and find that you are unable to fix the cause of the packets
being dropped.
Cheers,
Trond
Hi!
> > > It is not just a problem of 2.6 with those specific network configurations
> > > - ftp / http / tftp transfers work fine. E.g. wget of the same file on the
> > > PXA with 2.6.0 from the PC1 with 2.4.21 over http takes about 2s. So, it
> > > is 2.6 + NFS.
> > >
> > > Is it fixed somewhere (2.6.1-rcx?), or what should I try / what further
> > > information is required?
> >
> > You will probably need to look at some tcpdump output to debug the problem...
>
> Yep, just have done that - well, they differ... First obvious thing that I
> noticed is that 2.6 is trying to read bigger blocks (32K instead of 8K),
> but then - so far I cannot interpret what happens after the start of the
I've seen slow machine (386sx with ne1000) that could not receive 7 full-sized packets
back-to-back. You are sending 22 full packets back-to-back.
I'd expect some of them to be (almost deterministicaly) lost,
and no progress ever made.
In same scenario, TCP detects "congestion" and works mostly okay.
On ne1000 machine, TCP was still able to do 200KB/sec on
10Mbps network. Check if your slow machines are seeing all the packets you
send.
Pavel
On Thu, 8 Jan 2004, Pavel Machek wrote:
> I've seen slow machine (386sx with ne1000) that could not receive 7 full-sized packets
> back-to-back. You are sending 22 full packets back-to-back.
> I'd expect some of them to be (almost deterministicaly) lost,
> and no progress ever made.
As you, probably, have already seen from further emails on this thread, we
did find out that packets were indeed lost due to various performance
reasons. And the best solution does seem to be switching to TCP-NFS, and
making it the default choice for mount (where available) seems to be a
very good idea.
Thanks for replying anyway.
> In same scenario, TCP detects "congestion" and works mostly okay.
Hm, as long as we are already on this - can you give me a hint / pointer
how does TCP _detect_ a congestion? Does it adjust packet sizes, some
other parameters? Just for the curiousity sake.
Thanks
Guennadi
---
Guennadi Liakhovetski
On Tue, 13 Jan 2004, Guennadi Liakhovetski wrote:
> On Thu, 8 Jan 2004, Pavel Machek wrote:
>
> > In same scenario, TCP detects "congestion" and works mostly okay.
>
> Hm, as long as we are already on this - can you give me a hint / pointer
> how does TCP _detect_ a congestion? Does it adjust packet sizes, some
> other parameters? Just for the curiousity sake.
>
RFC 2581 describes this :
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=2581&type=ftp&file_format=txt
3390 updates 2581 :
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=3390&type=ftp&file_format=txt
-- Jesper Juhl
Hi!
> > I've seen slow machine (386sx with ne1000) that could not receive 7 full-sized packets
> > back-to-back. You are sending 22 full packets back-to-back.
> > I'd expect some of them to be (almost deterministicaly) lost,
> > and no progress ever made.
>
> As you, probably, have already seen from further emails on this thread, we
> did find out that packets were indeed lost due to various performance
> reasons. And the best solution does seem to be switching to TCP-NFS, and
> making it the default choice for mount (where available) seems to be a
> very good idea.
>
> Thanks for replying anyway.
>
> > In same scenario, TCP detects "congestion" and works mostly okay.
>
> Hm, as long as we are already on this - can you give me a hint / pointer
> how does TCP _detect_ a congestion? Does it adjust packet sizes, some
> other parameters? Just for the curiousity sake.
If TCP sees packets are lost, it says "oh, congestion", and starts
sending packets more slowly ie introduces delays
between packets. When they no longer get lost, it
speeds up to full speed.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
Hey,
I have experienced extremely poor NFS performance over wireless, when I scp a
piece of information from server to laptop, transfer rates stay stable and
file transfers, when I use NFS transfer rates jump constantly, and most of
the time file is NOT transfering.
I have searched all over the nfs, enabled higher caching on nfs, enabled the
usage of tcp, tried to pass hard, but transfer rates very poor, and only for
nfs transfer, so it doesn't seem my network configurations are wrong as scp,
html, ftp seem to work on full speed.
On other machines on the network (non wireless) running same kernel (2.6.0)
everything seems fine.
Can anyone suggest what I could test to trace this problem?
P? m? , 12/01/2004 klokka 20:55, skreiv Roman Gaufman:
> I have searched all over the nfs, enabled higher caching on nfs, enabled the
> usage of tcp, tried to pass hard, but transfer rates very poor, and only for
> nfs transfer, so it doesn't seem my network configurations are wrong as scp,
> html, ftp seem to work on full speed.
You should definitely enable TCP in this case.
Most likely causes: you may have a problem with echos on your wireless,
or you may have a faulty driver for your NIC.
Try looking at 'netstat -s' on both the server and the client. Monitor
the number of TCP segments sent out, number retransmitted, and number of
segments received on both ends of the connection while doing a set of
writes, then do the same for a set of reads.
Also try monitoring the wireless rates (iwlist <interface> rate), and
quality of link (iwlist <interface> ap) while this is going on. Note: if
your driver doesn't support iwlist, then just typing 'iwconfig
<interface>' might also give you these numbers.
Cheers,
Trond
On Mon, Jan 12, 2004 at 10:22:31AM -0500, Trond Myklebust wrote:
> P? m? , 12/01/2004 klokka 09:40, skreiv James Pearson:
> > If mount defaults to trying TCP first then UDP if the TCP mount fails,
> > should there be separate options for [rw]size depending on what type of
> > mount actually takes place? e.g. 'ursize' and 'uwsize' for UDP and
> > 'trsize' and 'twsize' for TCP ?
>
> No. The number of "mount" options is complex enough as it is. I don't
> see the above as being useful.
> If you need the above tweak, you should be able to get round the above
> problem by first attempting to force the TCP protocol yourself, and then
> retrying using UDP if it fails.
I have a patch, sent to the util-linux maintainer, that adds a couple
of new mount options to nfsmount. Those allow you to force either of
tcp, udp, tcp then udp, and udp then tcp, using the existing proto=xxx
syntax. It's available at
http://www.mulix.org/code/patches/util-linux/tcp-udp-mount-ordering-A3.diff
It also cleans up nfsmount() somewhat, although it could certainly do
with further rewrite^Wcleanups. I'm waiting to hear from the
util-linux maintainer before embarking on that, though.
Cheers,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/
"the nucleus of linux oscillates my world" - gccbot@#offtopic
On Mon, 2004-01-12 at 20:55, Roman Gaufman wrote:
> Hey,
>
> I have experienced extremely poor NFS performance over wireless, when I scp a
> piece of information from server to laptop, transfer rates stay stable and
> file transfers, when I use NFS transfer rates jump constantly, and most of
> the time file is NOT transfering.
I've noticed a similar problem here since upgrading to 2.6. In my case
not has NFS performance gone through the floor vs 2.4.22 but so has
machine performance during the transfer. In 2.4 the machine would be a
bit sluggish but usable...under 2.6 the machine is more or less
*unusable* until the NFS transfer completes. Trying to say, open up
Evolution may take upwards of ten minutes to complete. Unfortunately due
to the extreme performance problem it's not even possible to do any
diagnostics on the machine while it's happening.
P? ty , 13/01/2004 klokka 15:25, skreiv Joshua M. Thompson:
> I've noticed a similar problem here since upgrading to 2.6. In my case
> not has NFS performance gone through the floor vs 2.4.22 but so has
> machine performance during the transfer. In 2.4 the machine would be a
> bit sluggish but usable...under 2.6 the machine is more or less
> *unusable* until the NFS transfer completes. Trying to say, open up
> Evolution may take upwards of ten minutes to complete. Unfortunately due
> to the extreme performance problem it's not even possible to do any
> diagnostics on the machine while it's happening.
There are a couple of performance related patches that should be applied
to stock 2.6.0/2.6.1. One handles a problem with remove_suid()
generating a whole load of SETATTR calls if you are writing to a file
that has the "x" bit set. The other handles an efficiency issue related
to random write + read combinations.
Either look for them on my website (under
http://www.fys.uio.no/~trondmy/src), or apply Andrew's 2.6.1-mm2 patch.
Cheers,
Trond
On Tue, Jan 13, 2004 at 01:39:08AM +0100, Pavel Machek wrote:
> > Hm, as long as we are already on this - can you give me a hint / pointer
> > how does TCP _detect_ a congestion? Does it adjust packet sizes, some
> > other parameters? Just for the curiousity sake.
>
> If TCP sees packets are lost, it says "oh, congestion", and starts
> sending packets more slowly ie introduces delays
> between packets. When they no longer get lost, it
> speeds up to full speed.
You missed the important part... TCP measures latency and adjusts to
that. TCP overreacts on sudden unexpected packetloss by shrinking window
down.
This is why traffic "policing" sucks for TCP, and "shaping" (queuing)
works much better (as latency rises when limit is reached, and TCP
sender adapts by sending slower, thus preventing packet loss).
Regards,
Daniel
On Tue, 13 Jan 2004, Pavel Machek wrote:
> If TCP sees packets are lost, it says "oh, congestion", and starts
> sending packets more slowly ie introduces delays
> between packets. When they no longer get lost, it
> speeds up to full speed.
Thanks to all!
Guennadi
---
Guennadi Liakhovetski
In article <[email protected]>,
Trond Myklebust <[email protected]> wrote:
>There are a couple of performance related patches that should be applied
>to stock 2.6.0/2.6.1. One handles a problem with remove_suid()
>generating a whole load of SETATTR calls if you are writing to a file
>that has the "x" bit set. The other handles an efficiency issue related
>to random write + read combinations.
>
>Either look for them on my website (under
>http://www.fys.uio.no/~trondmy/src), or apply Andrew's 2.6.1-mm2 patch.
If one runs bonnie on a NFS mounted share, what should the rewrite
throughput be?
On an NFS server locally (2.6.1-mm3) I get as write/rewrite/read
speeds 107 / 25 / 110 MB/sec, CPU loads of a few percent.
On an NFS client (2.6.1-mm3, filesystem mounted with options
udp,nfsvers=3,rsize=32768,wsize=32768) I get for the same share as
write/rewrite/read speeds 36 / 4 / 38 MB/sec. CPU load is also
very high on the client for the rewrite case (80%).
That's with back-to-back GigE, full duplex, MTU 9000, P IV 3.0 Ghz.
(I tried MTU 5000 and 1500 as well, doesn't really matter).
Is that what would be expected ?
Mike.
On Thu, 15 Jan 2004 02:33:12, Mike Fedyk wrote:
> On Thu, Jan 15, 2004 at 01:12:07AM +0000, Miquel van Smoorenburg wrote:
> > On an NFS client (2.6.1-mm3, filesystem mounted with options
> > udp,nfsvers=3,rsize=32768,wsize=32768) I get for the same share as
> > write/rewrite/read speeds 36 / 4 / 38 MB/sec. CPU load is also
> > very high on the client for the rewrite case (80%).
> >
>
> What is your throughput on the wire?
Oh, the network is just fine.
# tcpspray -n 100000 192.168.29.132
Transmitted 102400000 bytes in 0.960200 seconds (104144.970 kbytes/s)
> And retry with tcp instead of udp...
I did test with TCP, results are comparable to UDP.
Mike.
P? on , 14/01/2004 klokka 20:12, skreiv Miquel van Smoorenburg:
> On an NFS client (2.6.1-mm3, filesystem mounted with options
> udp,nfsvers=3,rsize=32768,wsize=32768) I get for the same share as
> write/rewrite/read speeds 36 / 4 / 38 MB/sec. CPU load is also
> very high on the client for the rewrite case (80%).
>
> That's with back-to-back GigE, full duplex, MTU 9000, P IV 3.0 Ghz.
> (I tried MTU 5000 and 1500 as well, doesn't really matter).
>
> Is that what would be expected ?
Err.. no...
I didn't have a 2.6.1-mm3 machine ready to go in our GigE testbed (I'm
busy compiling one up right now). However I did run a quick test on
2.6.0-test11. Iozone rather than bonnie, but the results should be
comparable:
Iozone: Performance Test of File I/O
Version $Revision: 3.169 $
Compiled for 32 bit mode.
Build: linux
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million,
Jean-Marc Zucconi, Jeff Blomberg.
Run began: Wed Jan 14 21:32:08 2004
Include close in write timing
File size set to 2097152 KB
Record Size 128 KB
Command line used: /plymouth/trondmy/public/programs/fs/iozone -c -t1 -s 2048m -r 128k -i0 -i1
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 1 process
Each process writes a 2097152 Kbyte file in 128 Kbyte records
Children see throughput for 1 initial writers = 109333.84 KB/sec
Parent sees throughput for 1 initial writers = 109326.48 KB/sec
Min throughput per process = 109333.84 KB/sec
Max throughput per process = 109333.84 KB/sec
Avg throughput per process = 109333.84 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 rewriters = 111377.63 KB/sec
Parent sees throughput for 1 rewriters = 111370.21 KB/sec
Min throughput per process = 111377.63 KB/sec
Max throughput per process = 111377.63 KB/sec
Avg throughput per process = 111377.63 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 readers = 123864.27 KB/sec
Parent sees throughput for 1 readers = 123854.96 KB/sec
Min throughput per process = 123864.27 KB/sec
Max throughput per process = 123864.27 KB/sec
Avg throughput per process = 123864.27 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 re-readers = 167226.50 KB/sec
Parent sees throughput for 1 re-readers = 167222.79 KB/sec
Min throughput per process = 167226.50 KB/sec
Max throughput per process = 167226.50 KB/sec
Avg throughput per process = 167226.50 KB/sec
Min xfer = 2097152.00 KB
That is admittedly with a (very fast) NetApp filer on the receiving end,
so it is only a Linux client test. However as you can see, I'm basically
flat w.r.t. rereads and rewrites. Client is BTW a PowerEdge 2650 w/
built-in Broadcom BCM5703 (no jumbo frames). Note: with TCP, the numbers
degrade a bit to 81MB/sec write, 82MB/sec rewrite, 135MB/sec read and
144MB/sec reread.
Against a Sun server, I get something a lot slower: 32MB/sec write,
22MB/sec rewrite, 38MB/sec read, 28MB/sec reread using UDP, 29/21/29/26
using TCP. There I do indeed see a slight dip in both the rewrite and
the reread figures.
1 question:
- Is bonnie doing a close() or an fsync() of the file after if
finishes the write, and before it goes on to testing for rewrites? I
suspect not, in which case your numbers will be strongly skewed.
Cheers,
Trond
Hi!
> > > The only my doubt was - yes, you upgrade the __server__, so, you look in
> > > Changes, upgrade all necessary stuff, or just upgrade blindly (as does
> > > happen sometimes, I believe) a distribution - and the server works, fine.
> > > What I find non-obvious, is that on updating the server you have to
> > > re-configure __clients__, see? Just think about a network somewhere in a
> >
> > If you upgrade the server and read "Changes", then a note in changes might
> > say that "you need to configure carefully or some clients could get in trouble."
> > (If the current "Changes" don't have that - post a documentation patch.)
>
> [This is more to Guennadi than Helge]
>
> I don't see why such a patch to "Changes" should be necessary. The
> problem is most definitely with the client hardware, and not the
> server software.
>
> The crux of this problem comes down to the SMC91C111 having only a
> small on-board packet buffer, which is capable of storing only about
> 4 packets (both TX and RX). This means that if you receive 8 packets
> with high enough interrupt latency, you _will_ drop some of those
> packets.
I believe problem is in software... basically UDP is broken. I don't
think you can call hw broken just because small RX ring. RX ring has
to have some fixed size, and if the OS is not fast enough, well, some
packets are going on the floor.
I believe SW should deal with RX ring being just one packet big, and
believe that UDP is to blame...
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
P? on , 14/01/2004 klokka 21:35, skreiv Trond Myklebust:
> Err.. no...
>
> I didn't have a 2.6.1-mm3 machine ready to go in our GigE testbed (I'm
> busy compiling one up right now). However I did run a quick test on
> 2.6.0-test11. Iozone rather than bonnie, but the results should be
> comparable:
Hah.... They turned out not to be...
The changeset with key
[email protected][torvalds]|ChangeSet|20031230234945|63435
# ChangeSet
# 2003/12/30 15:49:45-08:00 [email protected]
# [PATCH] readahead: multiple performance fixes
#
# From: Ram Pai <[email protected]>
Has a devastating effect on NFS readahead. Using iozone in sequential
read mode, I get a 90% drop in performance on my GigE against a fast
NetApp filer. See the following test results.
Cheers,
Trond
------------------------
Iozone: Performance Test of File I/O
Version $Revision: 3.169 $
Compiled for 32 bit mode.
Build: linux
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million,
Jean-Marc Zucconi, Jeff Blomberg.
Run began: Thu Jan 15 13:38:42 2004
Include close in write timing
File size set to 2097152 KB
Record Size 128 KB
Command line used: /plymouth/trondmy/public/programs/fs/iozone -c -t1 -s 2048m -r 128k -i0 -i1
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 1 process
Each process writes a 2097152 Kbyte file in 128 Kbyte records
Children see throughput for 1 initial writers = 107477.06 KB/sec
Parent sees throughput for 1 initial writers = 107469.93 KB/sec
Min throughput per process = 107477.06 KB/sec
Max throughput per process = 107477.06 KB/sec
Avg throughput per process = 107477.06 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 rewriters = 108333.84 KB/sec
Parent sees throughput for 1 rewriters = 108326.78 KB/sec
Min throughput per process = 108333.84 KB/sec
Max throughput per process = 108333.84 KB/sec
Avg throughput per process = 108333.84 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 readers = 39179.22 KB/sec
Parent sees throughput for 1 readers = 39178.28 KB/sec
Min throughput per process = 39179.22 KB/sec
Max throughput per process = 39179.22 KB/sec
Avg throughput per process = 39179.22 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 re-readers = 14926.65 KB/sec
Parent sees throughput for 1 re-readers = 14926.62 KB/sec
Min throughput per process = 14926.65 KB/sec
Max throughput per process = 14926.65 KB/sec
Avg throughput per process = 14926.65 KB/sec
Min xfer = 2097152.00 KB
-------------------------------------------------------------------------------
If I remove just that one patch, then read performance is back to what
is expected:
Iozone: Performance Test of File I/O
Version $Revision: 3.169 $
Compiled for 32 bit mode.
Build: linux
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million,
Jean-Marc Zucconi, Jeff Blomberg.
Run began: Thu Jan 15 13:57:51 2004
Include close in write timing
File size set to 2097152 KB
Record Size 128 KB
Command line used: /plymouth/trondmy/public/programs/fs/iozone -c -t1 -s 2048m -r 128k -i0 -i1
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 1 process
Each process writes a 2097152 Kbyte file in 128 Kbyte records
Children see throughput for 1 initial writers = 109917.64 KB/sec
Parent sees throughput for 1 initial writers = 109915.87 KB/sec
Min throughput per process = 109917.64 KB/sec
Max throughput per process = 109917.64 KB/sec
Avg throughput per process = 109917.64 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 rewriters = 110838.23 KB/sec
Parent sees throughput for 1 rewriters = 110830.83 KB/sec
Min throughput per process = 110838.23 KB/sec
Max throughput per process = 110838.23 KB/sec
Avg throughput per process = 110838.23 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 readers = 146490.67 KB/sec
Parent sees throughput for 1 readers = 146487.63 KB/sec
Min throughput per process = 146490.67 KB/sec
Max throughput per process = 146490.67 KB/sec
Avg throughput per process = 146490.67 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 re-readers = 163164.72 KB/sec
Parent sees throughput for 1 re-readers = 163148.56 KB/sec
Min throughput per process = 163164.72 KB/sec
Max throughput per process = 163164.72 KB/sec
Avg throughput per process = 163164.72 KB/sec
Min xfer = 2097152.00 KB
Cheers,
Trond
On Thu, 2004-01-15 at 11:00, Trond Myklebust wrote:
> På on , 14/01/2004 klokka 21:35, skreiv Trond Myklebust:
>
> > Err.. no...
> >
> > I didn't have a 2.6.1-mm3 machine ready to go in our GigE testbed (I'm
> > busy compiling one up right now). However I did run a quick test on
> > 2.6.0-test11. Iozone rather than bonnie, but the results should be
> > comparable:
>
> Hah.... They turned out not to be...
>
> The changeset with key
>
> [email protected][torvalds]|ChangeSet|20031230234945|63435
>
> # ChangeSet
> # 2003/12/30 15:49:45-08:00 [email protected]
> # [PATCH] readahead: multiple performance fixes
> #
> # From: Ram Pai <[email protected]>
>
Yes this problem has been reported earlier. Attaching Andrew's patch
that reverts a change. This should work. Please confirm.
Thanks,
RP
P? to , 15/01/2004 klokka 14:53, skreiv Ram Pai:
> Yes this problem has been reported earlier. Attaching Andrew's patch
> that reverts a change. This should work. Please confirm.
Sorry to disappoint you, but that change appears already in 2.6.1-mm3,
and does not suffice to fix the problem.
Cheers,
Trond
P? to , 15/01/2004 klokka 15:16, skreiv Trond Myklebust:
> P? to , 15/01/2004 klokka 14:53, skreiv Ram Pai:
> > Yes this problem has been reported earlier. Attaching Andrew's patch
> > that reverts a change. This should work. Please confirm.
>
> Sorry to disappoint you, but that change appears already in 2.6.1-mm3,
> and does not suffice to fix the problem.
The following reversion is what fixes my regression. That puts the
sequential read numbers back to the 2.6.0 values of ~140MB/sec (from the
current 2.6.1 values of 14MB/second)...
Cheers,
Trond
diff -u --recursive --new-file linux-2.6.1-mm3/mm/readahead.c linux-2.6.1-fixed/mm/readahead.c
--- linux-2.6.1-mm3/mm/readahead.c 2004-01-09 01:59:07.000000000 -0500
+++ linux-2.6.1-fixed/mm/readahead.c 2004-01-15 15:41:35.118000000 -0500
@@ -474,13 +474,9 @@
/*
* This read request is within the current window. It is time
* to submit I/O for the ahead window while the application is
- * about to step into the ahead window.
- * Heuristic: Defer reading the ahead window till we hit
- * the last page in the current window. (lazy readahead)
- * If we read in earlier we run the risk of wasting
- * the ahead window.
+ * crunching through the current window.
*/
- if (ra->ahead_start == 0 && offset == (ra->start + ra->size -1)) {
+ if (ra->ahead_start == 0) {
ra->ahead_start = ra->start + ra->size;
ra->ahead_size = ra->next_size;
actual = do_page_cache_readahead(mapping, filp,
On Thu, 15 Jan 2004 03:35:35, Trond Myklebust wrote:
> P? on , 14/01/2004 klokka 20:12, skreiv Miquel van Smoorenburg:
> > On an NFS client (2.6.1-mm3, filesystem mounted with options
> > udp,nfsvers=3,rsize=32768,wsize=32768) I get for the same share as
> > write/rewrite/read speeds 36 / 4 / 38 MB/sec. CPU load is also
> > very high on the client for the rewrite case (80%).
> >
> > That's with back-to-back GigE, full duplex, MTU 9000, P IV 3.0 Ghz.
> > (I tried MTU 5000 and 1500 as well, doesn't really matter).
> >
> > Is that what would be expected ?
>
> Err.. no...
>
> I didn't have a 2.6.1-mm3 machine ready to go in our GigE testbed (I'm
> busy compiling one up right now). However I did run a quick test on
> 2.6.0-test11. Iozone rather than bonnie, but the results should be
> comparable:
Hmm OK. After the readahead fixes, my read speeds have increased
dramatically. Write speed is still slower than on local disk (30-40
MB/sec over NFS vs 80-100 MB/sec locally) but that might be a problem
with the 3ware driver, I'm looking at it. Or is it expected that NFS
writes are slower over the net (yes, I'm using fsync() at the end
both locally and remote) ?
Anyway, the bonnie rewrite case is still slow. It's something else than
the iozone rewrite, because that is pretty fast. From the README:
Rewrite. Each BUFSIZ of the file is read with read(2), dirtied, and
rewritten with write(2), requiring an lseek(2). Since no space
allocation is done, and the I/O is well-localized, this should test the
effectiveness of the filesystem cache and the speed of data transfer.
I created a 1 GB ext2 image on ramdisk on the NFS server and exported
that to the client, then ran bonnie++ on the client:
# bonnie++ -u root:root -f -n 0 -r 256
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
meghan 512M 118985 30 5173 99 294404 25 +++++ +++
The client has 512 MB RAM, and bonnie uses files of 2*RAM, but that wouldn't
fit on the 1 GB ramdisk so I told bonnie "I have 256 MB of ram (-r 256),
that's why the sequential input numbers are a bit skewed.
I'm not sure if this operation is supposed to be that slow, but if you
want something to hack on during the weekend, install bonnie++ and see
why its rewrite test is so slow over NFS ;) Note the 99% CPU ..
Oh, the rewrite test is also slow on local filesystems, I get about
25% of sequential write speed locally.
Mike.
On Mon, Jan 12, 2004 at 10:12:03AM -0500, Trond Myklebust wrote:
>
> > The 8k limit that you find in RFC1094 was an ad-hoc "limit" based purely
> > on testing using pre-1989 hardware. AFAIK most if not all of the
> > commercial vendors (Solaris, AIX, Windows/Hummingbird, EMC and Netapp)
> > are all currently setting the defaults to 32k block sizes for both TCP
> > and UDP.
> > Most of them want to bump that to a couple of Mbyte in the very near
> > future.
>
> Note: the future Mbyte sizes can, of course, only be supported on TCP
> since UDP has an inherent limit at 64k. The de-facto limit on UDP is
> therefore likely to remain at 32k (although I think at least one vendor
> has already tried pushing it to 48k).
Does the RPC max size limit change with memory or filesystem?
I have one system (K7 2200, 1.5GB, ext3) where it uses 32K RPCs, and another
(P2 300, 168MB, reiserfs3) and it uses 8k RPCs, even if I request larger max
sizes, and they're both running 2.6.1-bk2.
Strange...
P? fr , 16/01/2004 klokka 00:44, skreiv Mike Fedyk:
> Does the RPC max size limit change with memory or filesystem?
>
> I have one system (K7 2200, 1.5GB, ext3) where it uses 32K RPCs, and another
> (P2 300, 168MB, reiserfs3) and it uses 8k RPCs, even if I request larger max
> sizes, and they're both running 2.6.1-bk2.
The maximum allowable size is set by the server. If the server is
running 2.6.1, then it should normally support 32k reads and writes
(unless there is a bug somewhere).
Cheers,
Trond
On Fri, Jan 16, 2004 at 01:05:45AM -0500, Trond Myklebust wrote:
> P? fr , 16/01/2004 klokka 00:44, skreiv Mike Fedyk:
> > Does the RPC max size limit change with memory or filesystem?
> >
> > I have one system (K7 2200, 1.5GB, ext3) where it uses 32K RPCs, and another
> > (P2 300, 168MB, reiserfs3) and it uses 8k RPCs, even if I request larger max
> > sizes, and they're both running 2.6.1-bk2.
>
> The maximum allowable size is set by the server. If the server is
> running 2.6.1, then it should normally support 32k reads and writes
> (unless there is a bug somewhere).
The two systems above are nfs servers.
Mike
Ever since i upgraded to 2.6 my NFS performance has dropped.
I have a OpenBSD Server running nfsd. The other boxes on my lan
running windows and or the 2.4.x kernel have no performance problems.
Files transfer at normal speeds for a 100mbit connection. My main workstation
which is running 2.6.1-mm4 (i have also had 2.6.1 and 2.6.0 on it) has
almost zero nfs performance. I get at the most 500K/s. Anyone have ideas?
I upgraded to -mm4 having seen some NFS fixes in the patch but none of them
seem to have applied to my situation. Thanks in advance.
--Greg
On Sun, Jan 18, 2004 at 01:04:04AM -0500, Greg Fitzgerald wrote:
> Ever since i upgraded to 2.6 my NFS performance has dropped.
> I have a OpenBSD Server running nfsd. The other boxes on my lan
> running windows and or the 2.4.x kernel have no performance problems.
> Files transfer at normal speeds for a 100mbit connection. My main workstation
> which is running 2.6.1-mm4 (i have also had 2.6.1 and 2.6.0 on it) has
> almost zero nfs performance. I get at the most 500K/s. Anyone have ideas?
> I upgraded to -mm4 having seen some NFS fixes in the patch but none of them
> seem to have applied to my situation. Thanks in advance.
Are you using nfsd on 2.6 too?
check /proc/mounts for what your wsize and rsize are for the nfs mounts, and
lower it and see if that helps, 2.4 uses 8192.
Try tcp-nfs also.