2007-08-15 17:24:52

by Romain Dolbeau

[permalink] [raw]
Subject: Latency problem with some clients but not others

Hello all,

I have a strange latency problem that affect some clients
but not others. The server is a x86_64 Debian machine.

I've created a test case with just 2 64 bits clients. Both use
the same kernel, and have the same packages installed. They both
mount the same filesystems at the mountpoint through amd (the
maps are distributed through NIS).

The user is the same, with one single logging through SSH.
Nothing is running (except kdm) on either clients.
All machines are directly hooked to the same gigabit switch.
The network traffic was extremely low during the test.
Both clients were freshly rebooted.

One of the client is a dual Xeon 5130 system, with an on-board
intel NIC (module e1000). The other is a single Core 2 Duo 6320,
with an on-board ??? NIC (module r8169).

When doing a ./configure (lots of small r/w accesses) inside one
of the NFS mounted filesystem, the first system is fairly fast,
while the other is much slower - each line of the configure script
takes up to a second to display a result. *But*, pure throughput is
fine - if I use dd to write or read a large file, the speed is what
I would expect from the wire.

The problem is reproductible to all similar clients to the
second system, but I also have other clients (for instance
old 32 bits systems with 3c59x cards) that do not exhibit
the problem. In fact, it seems that all my 32 bits clients
are fast (well, as fast as they can be :-), and all my 64 bits
are slow, except two : the one above, and an old Pentium D
machine with an e100 card.

Any idea where I should look ? My only clue is that all my slow
64 bits client uses the same driver (r8169), whereas the fast one
uses e1000 and e100, could that be the source of the problem ?
(I don't havea spare NIC to try) ; is there any know problem
with such cheap on-board NIC ? How could I tell ?

Thanks in advance for any help.

P.S. just in case it's significant...

* mount display : "type nfs
(nodev,nosuid,nounmount,noatime,rsize=8192,wsize=8192,vers=3,proto=tcp)"
for all filesystems on all clients.
* kernel is current Debian stable (2.6.18-4) or testing (2.6.21-2),
same symptoms for both.
* I've tried both the included r8169 driver and the one from Realtek,
same symptoms for both.

--
Romain Dolbeau
<[email protected]>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-08-15 17:41:26

by Tom Tucker

[permalink] [raw]
Subject: Re: Latency problem with some clients but not others

Could be unrelated but...

I had a similar issue with a Supermicro motherboard that had a new
version of the Intel e1000 LOM chip. Updating the driver from the Intel
web site made the problem go away.

On Wed, 2007-08-15 at 19:19 +0200, Romain Dolbeau wrote:
> Hello all,
>
> I have a strange latency problem that affect some clients
> but not others. The server is a x86_64 Debian machine.
>
> I've created a test case with just 2 64 bits clients. Both use
> the same kernel, and have the same packages installed. They both
> mount the same filesystems at the mountpoint through amd (the
> maps are distributed through NIS).
>
> The user is the same, with one single logging through SSH.
> Nothing is running (except kdm) on either clients.
> All machines are directly hooked to the same gigabit switch.
> The network traffic was extremely low during the test.
> Both clients were freshly rebooted.
>
> One of the client is a dual Xeon 5130 system, with an on-board
> intel NIC (module e1000). The other is a single Core 2 Duo 6320,
> with an on-board ??? NIC (module r8169).
>
> When doing a ./configure (lots of small r/w accesses) inside one
> of the NFS mounted filesystem, the first system is fairly fast,
> while the other is much slower - each line of the configure script
> takes up to a second to display a result. *But*, pure throughput is
> fine - if I use dd to write or read a large file, the speed is what
> I would expect from the wire.
>
> The problem is reproductible to all similar clients to the
> second system, but I also have other clients (for instance
> old 32 bits systems with 3c59x cards) that do not exhibit
> the problem. In fact, it seems that all my 32 bits clients
> are fast (well, as fast as they can be :-), and all my 64 bits
> are slow, except two : the one above, and an old Pentium D
> machine with an e100 card.
>
> Any idea where I should look ? My only clue is that all my slow
> 64 bits client uses the same driver (r8169), whereas the fast one
> uses e1000 and e100, could that be the source of the problem ?
> (I don't havea spare NIC to try) ; is there any know problem
> with such cheap on-board NIC ? How could I tell ?
>
> Thanks in advance for any help.
>
> P.S. just in case it's significant...
>
> * mount display : "type nfs
> (nodev,nosuid,nounmount,noatime,rsize=8192,wsize=8192,vers=3,proto=tcp)"
> for all filesystems on all clients.
> * kernel is current Debian stable (2.6.18-4) or testing (2.6.21-2),
> same symptoms for both.
> * I've tried both the included r8169 driver and the one from Realtek,
> same symptoms for both.
>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-08-15 17:58:07

by Eli Stair

[permalink] [raw]
Subject: Re: Latency problem with some clients but not others


You can set some significant performance variables affecting your ethernet devices with ethtool. I personally turn off TSO (tcp segmentation offload) as have seen this cause issues in production on e1000 chipsets from a number of different vendors, with a variety of drivers. I haven't seen a lot of data on the new NAPI-mode e1000 driver, but it could have the possibility to introduce latency issues with "enhanced" interrupt coalescing and buffering. I'm not sure how you tune the internals on THAT driver, but on every other NIC you do so with the 'ethtool -C eth{n}' commands.

YMMV on those particular fronts, but I've had good luck fixing intel gigabit issues disabling offload functions, and improving bandwidth performance on a number of broadcom chipsets with ethtool coalesce settings. I'd suggest you get some hard numbers for comparison before making any changes, using netperf or iperf. That way you can quantify it at least.

If I entirely misunderstood your post, and you're having problems with the 8169 systems, not the intel Xeon with e1000, you might want to seriously consider ditching the Realtek (8169) and test with another card in that problematic system. Easiest and most direct way to test.


/eli

-----Original Message-----
From: [email protected] on behalf of Romain Dolbeau
Sent: Wed 8/15/2007 10:19 AM
To: [email protected]
Subject: [NFS] Latency problem with some clients but not others

Hello all,

I have a strange latency problem that affect some clients
but not others. The server is a x86_64 Debian machine.

I've created a test case with just 2 64 bits clients. Both use
the same kernel, and have the same packages installed. They both
mount the same filesystems at the mountpoint through amd (the
maps are distributed through NIS).

The user is the same, with one single logging through SSH.
Nothing is running (except kdm) on either clients.
All machines are directly hooked to the same gigabit switch.
The network traffic was extremely low during the test.
Both clients were freshly rebooted.

One of the client is a dual Xeon 5130 system, with an on-board
intel NIC (module e1000). The other is a single Core 2 Duo 6320,
with an on-board ??? NIC (module r8169).

When doing a ./configure (lots of small r/w accesses) inside one
of the NFS mounted filesystem, the first system is fairly fast,
while the other is much slower - each line of the configure script
takes up to a second to display a result. *But*, pure throughput is
fine - if I use dd to write or read a large file, the speed is what
I would expect from the wire.

The problem is reproductible to all similar clients to the
second system, but I also have other clients (for instance
old 32 bits systems with 3c59x cards) that do not exhibit
the problem. In fact, it seems that all my 32 bits clients
are fast (well, as fast as they can be :-), and all my 64 bits
are slow, except two : the one above, and an old Pentium D
machine with an e100 card.

Any idea where I should look ? My only clue is that all my slow
64 bits client uses the same driver (r8169), whereas the fast one
uses e1000 and e100, could that be the source of the problem ?
(I don't havea spare NIC to try) ; is there any know problem
with such cheap on-board NIC ? How could I tell ?

Thanks in advance for any help.

P.S. just in case it's significant...

* mount display : "type nfs
(nodev,nosuid,nounmount,noatime,rsize=8192,wsize=8192,vers=3,proto=tcp)"
for all filesystems on all clients.
* kernel is current Debian stable (2.6.18-4) or testing (2.6.21-2),
same symptoms for both.
* I've tried both the included r8169 driver and the one from Realtek,
same symptoms for both.

--
Romain Dolbeau
<[email protected]>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs



Attachments:
(No filename) (315.00 B)
(No filename) (140.00 B)
Download all attachments

2007-08-16 07:24:32

by Romain Dolbeau

[permalink] [raw]
Subject: Re: Latency problem with some clients but not others

Eli Stair wrote:

> If I entirely misunderstood your post, and you're having problems with
> the 8169 systems, not the intel Xeon with e1000, you might want to
> seriously consider ditching the Realtek (8169) and test with another
> card in that problematic system. Easiest and most direct way to test.

I must not have been very clear, because the other anwser
also suggested fo tix the Intel, which works just fine ;-)

The problem seems indeed to be the realtek crap^H^H^Hard,
as replacing it with an old 3com card brought by a colleague,
the problem just go away.

I'll try fiddling with ethtool and if it fails, I'll just
have to buy a bunch of decent PCI cards :-(

Thanks to all for your help,

--
Romain Dolbeau
<[email protected]>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs