Hello list,
Setup;
NFS server (dual opteron, HW RAID, SCA disk enclosure) on 2.6.11.6
NFS client (dual PIII) on 2.6.11.6
Both on switched gigabit ethernet - I use NFSv3 over UDP (tried TCP but
this makes no difference).
Problem; during simple tests such as a 'cp largefile0 largefile1' on the
client (under the mountpoint from the NFS server), the client becomes
extremely laggy, NFS writes are slow, and I see very high CPU
utilization by bdflush and rpciod.
For example, writing a single 8G file with dd will give me about
20MB/sec (I get 60+ MB/sec locally on the server), and the client rarely
drops below 40% system CPU utilization.
I tried profiling the client (booting with profile=2), but the profile
traces do not make sense; a profile from a single write test where the
client did not at any time drop below 30% system time (and frequently
were at 40-50%) gives me something like:
raven:~# less profile3 | sort -nr | head
257922 total 2.6394
254739 default_idle 5789.5227
960 smp_call_function 4.0000
888 __might_sleep 5.6923
569 finish_task_switch 4.7417
176 kmap_atomic 1.7600
113 __wake_up 1.8833
74 kmap 1.5417
64 kunmap_atomic 5.3333
The difference between default_idle and total is 1.2% - but I never saw
system CPU utilization under 30%...
Besides, there's basically nothing in the profile that rhymes with
rpciod or bdflush (the two high-hitters on top during the test).
What do I do?
Performance sucks and the profiles do not make sense...
Any suggestions would be greatly appreciated,
Thank you!
--
/ jakob
on den 06.04.2005 Klokka 18:01 (+0200) skreiv Jakob Oestergaard:
> What do I do?
>
> Performance sucks and the profiles do not make sense...
>
> Any suggestions would be greatly appreciated,
A look at "nfsstat" might help, as might "netstat -s".
In particular, I suggest looking at the "retrans" counter in nfsstat.
When you say that TCP did not help, please note that if retrans is high,
then using TCP with a large value for timeo (for instance -otimeo=600)
is a good idea. It is IMHO a bug for the "mount" program to be setting
default timeout values of less than 30 seconds when using TCP.
Cheers,
Trond
--
Trond Myklebust <[email protected]>
On Wed, Apr 06, 2005 at 06:01:23PM +0200, Jakob Oestergaard wrote:
>
> Problem; during simple tests such as a 'cp largefile0 largefile1' on the
> client (under the mountpoint from the NFS server), the client becomes
> extremely laggy, NFS writes are slow, and I see very high CPU
> utilization by bdflush and rpciod.
>
> For example, writing a single 8G file with dd will give me about
> 20MB/sec (I get 60+ MB/sec locally on the server), and the client rarely
> drops below 40% system CPU utilization.
How large is the client's RAM? What does the following command report
before and during the write?
egrep 'nfs_page|nfs_write_data' /proc/slabinfo
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
On Wed, Apr 06, 2005 at 05:28:56PM -0400, Trond Myklebust wrote:
...
> A look at "nfsstat" might help, as might "netstat -s".
>
> In particular, I suggest looking at the "retrans" counter in nfsstat.
When doing a 'cp largefile1 largefile2' on the client, I see approx. 10
retransmissions per second in nfsstat.
I don't really know if this is a lot...
I also see packets dropped in ifconfig - approx. 10 per second... I
wonder if these two are related.
Client has an intel e1000 card - I just set the RX ring buffer to the
max. of 4096 (up from the default of 256), but this doesn't seem to help
a lot (I see the 10 drops/sec with the large RX buffer).
I use NAPI - is there anything else I can do to make the card not drop
packets? I'm just assuming that this might at least be a part of the
problem, but with large RX ring and NAPI I don't know how much else I
can do to not make the box drop incoming data...
> When you say that TCP did not help, please note that if retrans is high,
> then using TCP with a large value for timeo (for instance -otimeo=600)
> is a good idea. It is IMHO a bug for the "mount" program to be setting
> default timeout values of less than 30 seconds when using TCP.
I can try that.
Thanks!
--
/ jakob
On Thu, Apr 07, 2005 at 09:19:06AM +1000, Greg Banks wrote:
...
> How large is the client's RAM?
2GB - (32 bit kernel because it's dual PIII, so I use highmem)
A few more details:
With standard VM settings, the client will be laggy during the copy, but
it will also have a load average around 10 (!) And really, the only
thing I do with it is one single 'cp' operation. The CPU hogs are
pdflush, rpciod/0 and rpciod/1.
I tweaked the VM a bit, put the following in /etc/sysctl.conf:
vm.dirty_writeback_centisecs=100
vm.dirty_expire_centisecs=200
The defaults are 500 and 3000 respectively...
This improved things a lot; the client is now "almost not very laggy",
and load stays in the saner 1-2 range.
Still, system CPU utilization is very high (still from rpciod and
pdflush - more rpciod and less pdflush though), and the file copying
performance over NFS is roughly half of what I get locally on the server
(8G file copy with 16MB/sec over NFS versus 32 MB/sec locally).
(I run with plenty of knfsd threads on the server, and generally the
server is not very loaded when the client is pounding it as much as it
can)
> What does the following command report
> before and during the write?
>
> egrep 'nfs_page|nfs_write_data' /proc/slabinfo
During the copy I typically see:
nfs_write_data 681 952 480 8 1 : tunables 54 27 8 : slabdata 119 119 108
nfs_page 15639 18300 64 61 1 : tunables 120 60 8 : slabdata 300 300 180
The "18300" above typically goes from 12000 to 25000...
After the copy I see:
nfs_write_data 36 48 480 8 1 : tunables 54 27 8 : slabdata 5 6 0
nfs_page 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
--
/ jakob
On Thu, Apr 07, 2005 at 05:38:48PM +0200, Jakob Oestergaard wrote:
> On Thu, Apr 07, 2005 at 09:19:06AM +1000, Greg Banks wrote:
> ...
> > How large is the client's RAM?
>
> 2GB - (32 bit kernel because it's dual PIII, so I use highmem)
Ok, that's probably not enough to fully trigger some of the problems
I've seen on large-memory NFS clients.
> A few more details:
>
> With standard VM settings, the client will be laggy during the copy, but
> it will also have a load average around 10 (!) And really, the only
> thing I do with it is one single 'cp' operation. The CPU hogs are
> pdflush, rpciod/0 and rpciod/1.
NFS writes of single files much larger than client RAM still have
interesting issues.
> I tweaked the VM a bit, put the following in /etc/sysctl.conf:
> vm.dirty_writeback_centisecs=100
> vm.dirty_expire_centisecs=200
>
> The defaults are 500 and 3000 respectively...
Yes, you want more frequent and smaller writebacks. It may help to
reduce vm.dirty_ratio and possibly vm.dirty_background_ratio.
> This improved things a lot; the client is now "almost not very laggy",
> and load stays in the saner 1-2 range.
>
> Still, system CPU utilization is very high (still from rpciod and
> pdflush - more rpciod and less pdflush though),
This is probably the rpciod's and pdflush all trying to do things
at the same time and contending for the BKL.
> During the copy I typically see:
>
> nfs_write_data 681 952 480 8 1 : tunables 54 27 8 : slabdata 119 119 108
> nfs_page 15639 18300 64 61 1 : tunables 120 60 8 : slabdata 300 300 180
That's not so bad, it's only about 3% of the system's pages.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
to den 07.04.2005 Klokka 17:38 (+0200) skreiv Jakob Oestergaard:
> I tweaked the VM a bit, put the following in /etc/sysctl.conf:
> vm.dirty_writeback_centisecs=100
> vm.dirty_expire_centisecs=200
>
> The defaults are 500 and 3000 respectively...
>
> This improved things a lot; the client is now "almost not very laggy",
> and load stays in the saner 1-2 range.
OK. That hints at what is causing the latencies on the server: I'll bet
it is the fact that the page reclaim code tries to be clever, and uses
NFSv3 STABLE writes in order to be able to free up the dirty pages
immediately. Could you try the following patch, and see if that makes a
difference too?
Cheers,
Trond
----
fs/nfs/write.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6.12-rc2/fs/nfs/write.c
===================================================================
--- linux-2.6.12-rc2.orig/fs/nfs/write.c
+++ linux-2.6.12-rc2/fs/nfs/write.c
@@ -305,7 +305,7 @@ do_it:
if (err >= 0) {
err = 0;
if (wbc->for_reclaim)
- nfs_flush_inode(inode, 0, 0, FLUSH_STABLE);
+ nfs_flush_inode(inode, 0, 0, 0);
}
} else {
err = nfs_writepage_sync(ctx, inode, page, 0,
--
Trond Myklebust <[email protected]>
On Thu, Apr 07, 2005 at 12:17:51PM -0400, Trond Myklebust wrote:
> to den 07.04.2005 Klokka 17:38 (+0200) skreiv Jakob Oestergaard:
>
> > I tweaked the VM a bit, put the following in /etc/sysctl.conf:
> > vm.dirty_writeback_centisecs=100
> > vm.dirty_expire_centisecs=200
> >
> > The defaults are 500 and 3000 respectively...
> >
> > This improved things a lot; the client is now "almost not very laggy",
> > and load stays in the saner 1-2 range.
>
> OK. That hints at what is causing the latencies on the server: I'll bet
> it is the fact that the page reclaim code tries to be clever, and uses
> NFSv3 STABLE writes in order to be able to free up the dirty pages
> immediately. Could you try the following patch, and see if that makes a
> difference too?
The patch alone without the tweaked VM settings doesn't cure the lag - I
think it's better than without the patch (I can actually type this mail
with a large copy running). With the tweaked VM settings too, it's
pretty good - still a little lag, but not enough to really make it
annoying.
Performance is pretty much the same as before (copying an 8GiB file with
15-16MiB/sec - about half the performance of what I get locally on the
file server).
Something that worries me; It seems that 2.4.25 is a lot faster as NFS
client than 2.6.11.6, most notably on writes - see the following
tiobench results (2000 KiB file, tests with 1, 2 and 4 threads) up
against the same NFS server:
2.4.25: (dual athlon MP 1.4GHz, 1G RAM, Intel e1000)
File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 2000 4096 1 58.87 54.9% 5.615 5.03% 44.40 44.2% 4.534 8.41%
. 2000 4096 2 56.98 58.3% 6.926 6.64% 41.61 58.0% 4.462 10.8%
. 2000 4096 4 53.90 59.0% 7.764 9.44% 39.85 61.5% 4.256 10.8%
2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)
File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 2000 4096 1 38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
. 2000 4096 2 52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
. 2000 4096 4 62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%
44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
this could be improved?
(note; the write performance doesn't change notably with VM tuning nor
with the one-liner change that Trond suggested)
--
/ jakob
lau den 09.04.2005 Klokka 23:35 (+0200) skreiv Jakob Oestergaard:
> 2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)
>
> File Block Num Seq Read Rand Read Seq Write Rand Write
> Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> ------- ------ ------- --- ----------- ----------- ----------- -----------
> . 2000 4096 1 38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
> . 2000 4096 2 52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
> . 2000 4096 4 62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%
>
>
> 44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
> this could be improved?
What happened to the retransmission rates when you changed to TCP?
Note that on TCP (besides bumping the value for timeo) I would also
recommend using a full 32k r/wsize instead of 4k (if the network is of
decent quality, I'd recommend 32k for UDP too).
The other tweak you can apply for TCP is to bump the value
for /proc/sys/sunrpc/tcp_slot_table_entries. That will allow you to have
several more RPC requests in flight (although that will also tie up more
threads on the server).
Don't forget that you need to unmount then mount again after making
these changes (-oremount won't suffice).
Cheers,
Trond
--
Trond Myklebust <[email protected]>
On Sat, Apr 09, 2005 at 05:52:32PM -0400, Trond Myklebust wrote:
> lau den 09.04.2005 Klokka 23:35 (+0200) skreiv Jakob Oestergaard:
>
> > 2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)
> >
> > File Block Num Seq Read Rand Read Seq Write Rand Write
> > Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> > ------- ------ ------- --- ----------- ----------- ----------- -----------
> > . 2000 4096 1 38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
> > . 2000 4096 2 52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
> > . 2000 4096 4 62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%
> >
> >
> > 44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
> > this could be improved?
>
> What happened to the retransmission rates when you changed to TCP?
tcp with timeo=600 causes retransmits (as seen with nfsstat) to drop to
zero.
>
> Note that on TCP (besides bumping the value for timeo) I would also
> recommend using a full 32k r/wsize instead of 4k (if the network is of
> decent quality, I'd recommend 32k for UDP too).
32k seems to be default for both UDP and TCP.
The network should be of decent quality - e1000 on client, tg3 on
server, both with short cables into a gigabit switch with plenty of
backplane headroom.
> The other tweak you can apply for TCP is to bump the value
> for /proc/sys/sunrpc/tcp_slot_table_entries. That will allow you to have
> several more RPC requests in flight (although that will also tie up more
> threads on the server).
Changing only to TCP gives me:
File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 2000 4096 1 47.04 65.2% 50.57 26.2% 24.24 29.7% 6.896 28.7%
. 2000 4096 2 55.77 66.1% 61.72 31.9% 24.13 33.0% 7.646 26.6%
. 2000 4096 4 61.94 68.9% 70.52 42.5% 25.65 35.6% 8.042 26.7%
Pretty much the same as before - with writes being suspiciously slow
(compared to good ole' 2.4.25)
With tcp_slot_table_entries bumped to 64 on the client (128 knfsd
threads on the server, same as in all tests), I see:
File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 2000 4096 1 60.50 67.6% 30.12 14.4% 22.54 30.1% 7.075 27.8%
. 2000 4096 2 59.87 69.0% 34.34 19.0% 24.09 35.2% 7.805 30.0%
. 2000 4096 4 62.27 69.8% 44.87 29.9% 23.07 34.3% 8.239 30.9%
So, reads start off better, it seems, but writes are still half speed of
2.4.25.
I should say that it is common to see a single rpciod thread hogging
100% CPU for 20-30 seconds - that looks suspicious to me, other than
that, I can't really point my finger at anything in this setup.
Any suggestions Trond? I'd be happy to run some tests for you if you
have any idea how we can speed up those writes (or reads for that
matter, although I am fairly happy with those).
--
/ jakob
må den 11.04.2005 Klokka 09:48 (+0200) skreiv Jakob Oestergaard:
> tcp with timeo=600 causes retransmits (as seen with nfsstat) to drop to
> zero.
Good.
> File Block Num Seq Read Rand Read Seq Write Rand Write
> Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> ------- ------ ------- --- ----------- ----------- ----------- -----------
> . 2000 4096 1 60.50 67.6% 30.12 14.4% 22.54 30.1% 7.075 27.8%
> . 2000 4096 2 59.87 69.0% 34.34 19.0% 24.09 35.2% 7.805 30.0%
> . 2000 4096 4 62.27 69.8% 44.87 29.9% 23.07 34.3% 8.239 30.9%
>
> So, reads start off better, it seems, but writes are still half speed of
> 2.4.25.
>
> I should say that it is common to see a single rpciod thread hogging
> 100% CPU for 20-30 seconds - that looks suspicious to me, other than
> that, I can't really point my finger at anything in this setup.
That certainly shouldn't be the case (and isn't on any of my setups). Is
the behaviour identical same on both the PIII and the Opteron systems?
As for the WRITE rates, could you send me a short tcpdump from the
"sequential write" section of the above test? Just use "tcpdump -s 90000
-w binary.dmp" just for a couple of seconds. I'd like to check the
latencies, and just check that you are indeed sending unstable writes
with not too many commit or getattr calls.
Cheers
Trond
--
Trond Myklebust <[email protected]>
On Mon, Apr 11, 2005 at 08:35:39AM -0400, Trond Myklebust wrote:
...
> That certainly shouldn't be the case (and isn't on any of my setups). Is
> the behaviour identical same on both the PIII and the Opteron systems?
The dual opteron is the nfs server
The dual athlon is the 2.4 nfs client
The dual PIII is the 2.6 nfs client
> As for the WRITE rates, could you send me a short tcpdump from the
> "sequential write" section of the above test? Just use "tcpdump -s 90000
> -w binary.dmp" just for a couple of seconds. I'd like to check the
> latencies, and just check that you are indeed sending unstable writes
> with not too many commit or getattr calls.
Certainly;
http://unthought.net/binary.dmp.bz2
I got an 'invalid snaplen' with the 90000 you suggested, the above dump
is done with 9000 - if you need another snaplen please just let me know.
A little explanation for the IPs you see;
sparrow/10.0.1.20 - nfs server
raven/10.0.1.7 - 2.6 nfs client
osprey/10.0.1.13 - NIS/DNS server
Thanks,
--
/ jakob
må den 11.04.2005 Klokka 15:47 (+0200) skreiv Jakob Oestergaard:
> Certainly;
>
> http://unthought.net/binary.dmp.bz2
>
> I got an 'invalid snaplen' with the 90000 you suggested, the above dump
> is done with 9000 - if you need another snaplen please just let me know.
So, the RPC itself looks good, but it also looks as if after a while you
are running into some heavy retransmission problems with TCP too (at the
TCP level now, instead of at the RPC level). When you get into that
mode, it looks as if every 2nd or 3rd TCP segment being sent from the
client is being lost...
That can mean either that the server is dropping fragments, or that the
client is dropping the replies. Can you generate a similar tcpdump on
the server?
Cheers,
Trond
--
Trond Myklebust <[email protected]>
On Mon, Apr 11, 2005 at 10:35:25AM -0400, Trond Myklebust wrote:
> m? den 11.04.2005 Klokka 15:47 (+0200) skreiv Jakob Oestergaard:
>
> > Certainly;
> >
> > http://unthought.net/binary.dmp.bz2
> >
> > I got an 'invalid snaplen' with the 90000 you suggested, the above dump
> > is done with 9000 - if you need another snaplen please just let me know.
>
> So, the RPC itself looks good, but it also looks as if after a while you
> are running into some heavy retransmission problems with TCP too (at the
> TCP level now, instead of at the RPC level). When you get into that
> mode, it looks as if every 2nd or 3rd TCP segment being sent from the
> client is being lost...
Odd...
I'm really sorry for using your time if this ends up being just a
networking problem.
> That can mean either that the server is dropping fragments, or that the
> client is dropping the replies. Can you generate a similar tcpdump on
> the server?
Certainly; http://unthought.net/sparrow.dmp.bz2
--
/ jakob
må den 11.04.2005 Klokka 16:41 (+0200) skreiv Jakob Oestergaard:
> > That can mean either that the server is dropping fragments, or that the
> > client is dropping the replies. Can you generate a similar tcpdump on
> > the server?
>
> Certainly; http://unthought.net/sparrow.dmp.bz2
So, it looks to me as if "sparrow" is indeed dropping packets (missed
sequences). Is it running with NAPI enabled too?
Cheers,
Trond
--
Trond Myklebust <[email protected]>
On Mon, Apr 11, 2005 at 11:21:45AM -0400, Trond Myklebust wrote:
> m? den 11.04.2005 Klokka 16:41 (+0200) skreiv Jakob Oestergaard:
>
> > > That can mean either that the server is dropping fragments, or that the
> > > client is dropping the replies. Can you generate a similar tcpdump on
> > > the server?
> >
> > Certainly; http://unthought.net/sparrow.dmp.bz2
>
> So, it looks to me as if "sparrow" is indeed dropping packets (missed
> sequences). Is it running with NAPI enabled too?
Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
option of enabling/disabling RX polling (if we agree that is what we're
talking about), but looking in tg3.c it seems that it *always*
unconditionally uses NAPI...
sparrow:~# ifconfig
eth0 Link encap:Ethernet HWaddr 00:09:3D:10:BB:1E
inet addr:10.0.1.20 Bcast:10.0.1.255 Mask:255.255.255.0
inet6 addr: fe80::209:3dff:fe10:bb1e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2304578 errors:0 dropped:0 overruns:0 frame:0
TX packets:2330829 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2381644307 (2.2 GiB) TX bytes:2191756317 (2.0 GiB)
Interrupt:169
No dropped packets... I wonder if the tg3 driver is being completely
honest about this...
Still, 2.4 manages to perform twice as fast against the same server.
And, the 2.6 client still has extremely heavy CPU usage (from rpciod
mainly, which doesn't show up in profiles)
The plot thickens...
Trond (or anyone else feeling they might have some insight they would
like to share on this one), I'll do anything you say (ok, *almost*
anything you say) - any ideas?
--
/ jakob
On Tue, 2005-04-12 at 01:42, Jakob Oestergaard wrote:
> Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
> option of enabling/disabling RX polling (if we agree that is what we're
> talking about), but looking in tg3.c it seems that it *always*
> unconditionally uses NAPI...
I've whined and moaned about this in the past, but for all its
faults NAPI on tg3 doesn't lose packets. It does cause a huge
increase in irq cpu time on multiple fast CPUs. What irq rate
are you seeing?
I did once post a patch to make NAPI for tg3 selectable at
configure time.
http://marc.theaimsgroup.com/?l=linux-netdev&m=107183822710263&w=2
> No dropped packets... I wonder if the tg3 driver is being completely
> honest about this...
At one point it wasn't, since this patch it is:
http://marc.theaimsgroup.com/?l=linux-netdev&m=108433829603319&w=2
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
On Tue, Apr 12, 2005 at 11:03:29AM +1000, Greg Banks wrote:
> On Tue, 2005-04-12 at 01:42, Jakob Oestergaard wrote:
> > Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
> > option of enabling/disabling RX polling (if we agree that is what we're
> > talking about), but looking in tg3.c it seems that it *always*
> > unconditionally uses NAPI...
>
> I've whined and moaned about this in the past, but for all its
> faults NAPI on tg3 doesn't lose packets. It does cause a huge
> increase in irq cpu time on multiple fast CPUs. What irq rate
> are you seeing?
Around 20.000 interrupts per second during the large write, on the IRQ
where eth0 is (this is not shared with anything else).
[sparrow:joe] $ cat /proc/interrupts
CPU0 CPU1
...
169: 3853488 412570512 IO-APIC-level eth0
...
But still, guys, it is the *same* server with tg3 that runs well with a
2.4 client but poorly with a 2.6 client.
Maybe I'm just staring myself blind at this, but I can't see how a
general problem on the server (such as packet loss, latency or whatever)
would cause no problems with a 2.4 client but major problems with a 2.6
client.
--
/ jakob
On Tue, Apr 12, 2005 at 11:28:43AM +0200, Jakob Oestergaard wrote:
...
>
> But still, guys, it is the *same* server with tg3 that runs well with a
> 2.4 client but poorly with a 2.6 client.
>
> Maybe I'm just staring myself blind at this, but I can't see how a
> general problem on the server (such as packet loss, latency or whatever)
> would cause no problems with a 2.4 client but major problems with a 2.6
> client.
Another data point;
I upgraded my mom's machine from an earlier 2.6 (don't remember which,
but I can find out) to 2.6.11.6.
It mounts a home directory from a 2.6.6 NFS server - the client and
server are on a hub'ed 100Mbit network.
On the earlier 2.6 client I/O performance was as one would expect on
hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
MB/sec and decent interactivity.
The typical workload here is storing or retrieving large TIFF files on
the client, while working with other things in KDE. So, if the
large-file NFS I/O causes NFS client stalls, it will be noticable on the
desktop (probably as Konqueror or whatever is accessing configuration or
cache files).
With 2.6.11.6 the client is virtually unusable when large files are
transferred. A "df -h" will hang on the mounted filesystem for several
seconds, and I have my mom on the phone complaining that various windows
won't close and that her machine is too slow (*again* it's no more than
half a year ago she got the new P4) ;)
Now there's plenty of things to start optimizing; RPC over TCP, using a
switch or crossover cable instead of the hub, etc. etc.
However, what changed here was the client kernel going from en earlier
2.6 to 2.6.11.6.
A lot happened to the NFS client in 2.6.11 - I wonder if there's any of
these patches that are worth trying to revert? I have several setups
that suck currently, so I'm willing to try a thing or two :)
I would try
---
<[email protected]>
RPC: Convert rpciod into a work queue for greater flexibility.
Signed-off-by: Trond Myklebust <[email protected]>
---
if noone has a better idea... But that's just a hunch based solely on
my observation of rpciod being a CPU hog on one of the earlier client
systems. I didn't observe large 'sy' times in vmstat on this client
while it hung on NFS though...
Any suggestions would be greatly appreciated,
--
/ jakob
ty den 19.04.2005 Klokka 21:45 (+0200) skreiv Jakob Oestergaard:
> It mounts a home directory from a 2.6.6 NFS server - the client and
> server are on a hub'ed 100Mbit network.
>
> On the earlier 2.6 client I/O performance was as one would expect on
> hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
> MB/sec and decent interactivity.
OK, hold it right there...
So, IIRC the problem was that you were seeing abominable retrans rates
on UDP and TCP, and you are using a 100Mbit hub rather than a switch?
What does the collision LED look like, when you see these performance
problems?
Also, does that hub support NICs that do autonegotiation? (I'll bet the
answer is "no").
Cheers,
Trond
--
Trond Myklebust <[email protected]>
On Tue, Apr 19, 2005 at 06:46:28PM -0400, Trond Myklebust wrote:
> ty den 19.04.2005 Klokka 21:45 (+0200) skreiv Jakob Oestergaard:
>
> > It mounts a home directory from a 2.6.6 NFS server - the client and
> > server are on a hub'ed 100Mbit network.
> >
> > On the earlier 2.6 client I/O performance was as one would expect on
> > hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
> > MB/sec and decent interactivity.
>
> OK, hold it right there...
>
...
> Also, does that hub support NICs that do autonegotiation? (I'll bet the
> answer is "no").
*blush*
Ok Trond, you got me there - I don't know why upgrading the client made
the problem much more visible though, but the *server* had negotiated
full duplex rather than half (the client negotiated half ok). Fixing
that on the server side made the client pleasent to work with again.
Mom's a happy camper now again ;)
Sorry for jumping the gun there...
To get back to the original problem;
I wonder if (as was discussed) the tg3 driver on my NFS server is
dropping packets, causing the 2.6.11 NFS client to misbehave... This
didn't make sense to me before (since earlier clients worked well), but
having just seen this other case where a broken server setup caused
2.6.11 clients to misbehave (where earlier clients were fine), maybe it
could be an explanation...
Will try either changing tg3 driver or putting in an e1000 on my NFS
server - I will let you know about the status on this when I know more.
Thanks all,
--
/ jakob
On Wed, Apr 20, 2005 at 03:57:58PM +0200, Jakob Oestergaard wrote:
...
> Will try either changing tg3 driver or putting in an e1000 on my NFS
> server - I will let you know about the status on this when I know more.
tg3 or e1000 on the NFS server doesn't make a noticable difference.
Now, I tried booting the 2.6.11 NFS client in uniprocessor mode
(thinking the rpciod threads might be wasting their time contending for
a lock), and that turned out to be interesting.
Performance on SMP NFS client:
File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 2000 4096 1 47.53 80.0% 5.013 2.79% 22.34 32.2% 6.510 14.9%
. 2000 4096 2 45.29 78.6% 8.068 5.44% 24.53 34.1% 7.042 14.9%
. 2000 4096 4 45.38 78.0% 11.02 7.95% 25.13 35.1% 7.525 18.0%
Performance on UP NFS client:
File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 2000 4096 1 57.11 54.7% 69.60 24.9% 35.09 14.2% 6.656 19.1%
. 2000 4096 2 60.11 58.8% 70.99 30.8% 33.82 14.1% 7.283 25.1%
. 2000 4096 4 67.89 59.8% 42.10 19.1% 29.86 12.7% 7.850 26.4%
So, by booting the NFS client in uniprocessor mode, I got a 50% write
performance boost, 20% read perforamance boost, and the tests use about
half the CPU time.
Isn't this a little disturbing? :)
--
/ jakob
su den 24.04.2005 Klokka 09:15 (+0200) skreiv Jakob Oestergaard:
> Performance on SMP NFS client:
> File Block Num Seq Read Rand Read Seq Write Rand Write
> Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> ------- ------ ------- --- ----------- ----------- ----------- -----------
> . 2000 4096 1 47.53 80.0% 5.013 2.79% 22.34 32.2% 6.510 14.9%
> . 2000 4096 2 45.29 78.6% 8.068 5.44% 24.53 34.1% 7.042 14.9%
> . 2000 4096 4 45.38 78.0% 11.02 7.95% 25.13 35.1% 7.525 18.0%
>
> Performance on UP NFS client:
> File Block Num Seq Read Rand Read Seq Write Rand Write
> Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> ------- ------ ------- --- ----------- ----------- ----------- -----------
> . 2000 4096 1 57.11 54.7% 69.60 24.9% 35.09 14.2% 6.656 19.1%
> . 2000 4096 2 60.11 58.8% 70.99 30.8% 33.82 14.1% 7.283 25.1%
> . 2000 4096 4 67.89 59.8% 42.10 19.1% 29.86 12.7% 7.850 26.4%
>
> So, by booting the NFS client in uniprocessor mode, I got a 50% write
> performance boost, 20% read perforamance boost, and the tests use about
> half the CPU time.
>
> Isn't this a little disturbing? :)
Actually, the most telling difference here is with the random read rates
which shows up to 1000% difference. I seriously doubt that has much to
do with lock contention (given that the sequential reads show 20% as you
said).
Could you once again have a look at the retransmission rates (both UDP
and TCP), comparing the SMP and UP cases?
Cheers,
Trond
--
Trond Myklebust <[email protected]>
On Sun, Apr 24, 2005 at 11:09:58PM -0400, Trond Myklebust wrote:
...
> Actually, the most telling difference here is with the random read rates
> which shows up to 1000% difference. I seriously doubt that has much to
> do with lock contention (given that the sequential reads show 20% as you
> said).
>
> Could you once again have a look at the retransmission rates (both UDP
> and TCP), comparing the SMP and UP cases?
Ok Trond, I've spent the better part of today producing new numbers -
can't say I have anything conclusive, except that
1) 2.4 is a better NFS client for these benchmarks than 2.6, most
notably wrt. writes
2) Performance over NFS is roughly half of local disk performance
I'm going to just go with the performance I'm seeing as of now, but keep
an eye out for improvements - if you come up with performance related
patches you want tested, I'll be happy to give them a spin.
Numbers etc. follow; in case you're interested - hey, I wasted most of
today producing them, so I might as well send them to the list in case
anyone can use them for anything ;)
I ran the benchmarks again with UP/SMP, UDP/TCP - I get around 5-20
retransmisssions per second during writes when using UDP mounts,
typically a few retransmissions per second (1-3) during reads when
using UDP mounts, and 0 (zero) retransmissions on TCP mounts. UP/SMP
makes no noticable difference here.
I also tried booting a 2.6.10 (SMP) kernel on the client, and I have
rerun all the benchmarks with considarably larger files (it was stupid
to use 2G test files when both client and server have 2G memory - not
that it makes a huge difference though, but it could explain the
extremely high random read/write rates we saw earlier - these high
random rates have disappeared now that I use 4G files for testing - what
a stupid mistake to make...).
To sum up, I've taken the highes rates seen in each of the tests (not
caring whether the rate was seen with 1, 2 or 4 threads), and written it
all up in the little matrix below here:
When re-testing, it is normal to see +- 5% deviation in the rates.
I did some of the tests with both tg3 and e1000 on the server. The
server did at all times run 2.6.11.6. Numbers are in MiB/sec
server NIC: e1000 tg3
--------------- read write read write
2.6.11 up/udp 47 42
2.6.11 up/tcp 38 34
2.6.11 smp/udp 40 35
2.6.11 smp/tcp 34 31
2.6.10 smp/udp 45 39 44 40
2.6.10 smp/tcp 40 33 38 31
And just to make sure 2.4.25 is still alive and kicking: This test was
done on the other client machine (e1000 NIC, dual athlon):
server NIC: tg3
--------------- read write
2.4.25 smp/udp 45 52
Finally a local benchmark on the file server, just to see what we can
get taking NFS out of the equation:
----------------- read write
2.6.11 smp/local 76 65
Hope some of this is worth something to someone ;)
Thanks for all the help and feedback!
--
/ jakob