2008-02-14 15:40:58

by Font Bella

[permalink] [raw]
Subject: Performance question

Hi,

some of our apps are experiencing slow nfs performance in our new cluster, in
comparison with the old one. The nfs setups for both clusters are very
similar, and we are wondering what's going on. The details of both setups are
given below for reference.

The problem seems to occur with apps that do heavy i/o, creating, writing,
reading, and deleting many files. However, writing or reading a large file
(as measure with `time dd if=/dev/zero of=2gbfile bs=1024 count=2000`) is not
slow.

We have performed some tests with the disk benchmark 'dbench', which reports
i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/sec in the
new one.

After noticing this problem, we tried the user-mode nfs server instead of the
kernel-mode server, and just installing the user-mode server helped improving
throughput up to 12 Mb/sec, but still far away from the good old 60 Mb/sec.

After going through the "Optimizing NFS performance" section of the
NFS-Howto and tweaking the rsize,wsize parameters (the optimal seems to be
2048, which seems kind of weird to me, specially compared to the 8192 used in
the old cluster), throughput increased to 21 Mb/sec, but is still too far
from the old 60Mb/sec.

We are stuck at this point. Any help/comment/suggestion will be greatly
appreciated.
/P

**************************** OLD CLUSTER *****************************

SATA disks.

Filesystem: ext3.

* the version of nfs-utils you are using: I don't know. It's the most
recent version in debian sarge (oldstable).

user-mode nfs server.

nfs version 2, as reported with rpcinfo.

* the version of the kernel and any non-stock applied kernels: 2.6.12
* the distribution of linux you are using: Debian sarge x386 on Intel Xeon
processors.
* the version(s) of other operating systems involved: no other OS.

It is also useful to know the networking configuration connecting the hosts:
Typical beowulf setup, with all servers connected to a switch, 1Gb network.

/etc/exports:

/srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash)

/etc/fstab:

server:/srv/homes/user /mnt/user nfs rw,hard,intr,rsize=8192,wsize=8192 0 0

**************************** NEW CLUSTER *****************************

SAS 10k disks.

Filesystem: ext3 over LVM.

* the version of nfs-utils you are using: I don't know. It's the most
recent version in debian etch (stable).

kernel-mode nfs server.

nfs version 2, as reported with rpcinfo.

* the version of the kernel and any non-stock applied kernels: 2.6.18-5-amd64
* the distribution of linux you are using: Debian etch AMD64 on Intel Xeon
processors.
* the version(s) of other operating systems involved: no other OS.

It is also useful to know the networking configuration connecting the hosts:
Typical beowulf setup, with all servers connected to a switch, 1Gb network.

/etc/exports:

/srv/homes 192.168.1.0/255.255.255.0 (no_root_squash)

mount options:

rsize=8192,wsize=8192


2008-02-14 16:34:36

by Marcelo Leal

[permalink] [raw]
Subject: Re: Performance question

Hello all,
There is a great diff between access the raw discs and through LVM,
with some kind of RAID, and etc. I think you should use NFS v3, and
it's hard to think that without you explicitally configure it to use
v2, it using...
A great diff between v2 and v3 is that v2 is always "async", what is a
performance burst. Are you sure that in the new environment is not v3?
In the new stable version (nfs-utils), debian is sync by default. I'm
used to "8192" transfer sizes, and was the best perfomance in my
tests.
Would be nice if you could test another network service writing in
that server.. like ftp, or iscsi.
Another question, the discs are "local" or SAN? There is no concurrency?

ps.: v2 has a 2GB file size limit AFAIK.

Leal.

2008/2/14, Font Bella <[email protected]>:
> Hi,
>
> some of our apps are experiencing slow nfs performance in our new cluster, in
> comparison with the old one. The nfs setups for both clusters are very
> similar, and we are wondering what's going on. The details of both setups are
> given below for reference.
>
> The problem seems to occur with apps that do heavy i/o, creating, writing,
> reading, and deleting many files. However, writing or reading a large file
> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 count=2000`) is not
> slow.
>
> We have performed some tests with the disk benchmark 'dbench', which reports
> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/sec in the
> new one.
>
> After noticing this problem, we tried the user-mode nfs server instead of the
> kernel-mode server, and just installing the user-mode server helped improving
> throughput up to 12 Mb/sec, but still far away from the good old 60 Mb/sec.
>
> After going through the "Optimizing NFS performance" section of the
> NFS-Howto and tweaking the rsize,wsize parameters (the optimal seems to be
> 2048, which seems kind of weird to me, specially compared to the 8192 used in
> the old cluster), throughput increased to 21 Mb/sec, but is still too far
> from the old 60Mb/sec.
>
> We are stuck at this point. Any help/comment/suggestion will be greatly
> appreciated.
> /P
>
> **************************** OLD CLUSTER *****************************
>
> SATA disks.
>
> Filesystem: ext3.
>
> * the version of nfs-utils you are using: I don't know. It's the most
> recent version in debian sarge (oldstable).
>
> user-mode nfs server.
>
> nfs version 2, as reported with rpcinfo.
>
> * the version of the kernel and any non-stock applied kernels: 2.6.12
> * the distribution of linux you are using: Debian sarge x386 on Intel Xeon
> processors.
> * the version(s) of other operating systems involved: no other OS.
>
> It is also useful to know the networking configuration connecting the hosts:
> Typical beowulf setup, with all servers connected to a switch, 1Gb network.
>
> /etc/exports:
>
> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash)
>
> /etc/fstab:
>
> server:/srv/homes/user /mnt/user nfs rw,hard,intr,rsize=8192,wsize=8192 0 0
>
> **************************** NEW CLUSTER *****************************
>
> SAS 10k disks.
>
> Filesystem: ext3 over LVM.
>
> * the version of nfs-utils you are using: I don't know. It's the most
> recent version in debian etch (stable).
>
> kernel-mode nfs server.
>
> nfs version 2, as reported with rpcinfo.
>
> * the version of the kernel and any non-stock applied kernels: 2.6.18-5-amd64
> * the distribution of linux you are using: Debian etch AMD64 on Intel Xeon
> processors.
> * the version(s) of other operating systems involved: no other OS.
>
> It is also useful to know the networking configuration connecting the hosts:
> Typical beowulf setup, with all servers connected to a switch, 1Gb network.
>
> /etc/exports:
>
> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash)
>
> mount options:
>
> rsize=8192,wsize=8192
> -
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


--
pOSix rules

2008-02-14 16:57:32

by Chuck Lever III

[permalink] [raw]
Subject: Re: Performance question

On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote:
> Hello all,
> There is a great diff between access the raw discs and through LVM,
> with some kind of RAID, and etc. I think you should use NFS v3, and
> it's hard to think that without you explicitally configure it to use
> v2, it using...
> A great diff between v2 and v3 is that v2 is always "async", what is a
> performance burst. Are you sure that in the new environment is not v3?
> In the new stable version (nfs-utils), debian is sync by default. I'm
> used to "8192" transfer sizes, and was the best perfomance in my
> tests.

As Marcelo suggested, this could be nothing more than the change in
default export options (see exports(8) -- the description of the sync/
async option) between sarge and etch. This was a change in the nfs-
utils package done a while back to improve data integrity guarantees
during server instability.

You can test this easily by explicitly specifying sync or async in
your /etc/exports and trying your test.

It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie
they must be committed to permanent storage before the server
replies) -- the async export option breaks that guarantee to improve
performance. There is some further description in the NFS FAQ at
http://nfs.sourceforge.net/ .

The preferred way to get "async" write performance is to use NFSv3.

> Would be nice if you could test another network service writing in
> that server.. like ftp, or iscsi.
> Another question, the discs are "local" or SAN? There is no
> concurrency?
>
> ps.: v2 has a 2GB file size limit AFAIK.
>
> Leal.
>
> 2008/2/14, Font Bella <[email protected]>:
>> Hi,
>>
>> some of our apps are experiencing slow nfs performance in our new
>> cluster, in
>> comparison with the old one. The nfs setups for both clusters are
>> very
>> similar, and we are wondering what's going on. The details of
>> both setups are
>> given below for reference.
>>
>> The problem seems to occur with apps that do heavy i/o, creating,
>> writing,
>> reading, and deleting many files. However, writing or reading a
>> large file
>> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024
>> count=2000`) is not
>> slow.
>>
>> We have performed some tests with the disk benchmark 'dbench',
>> which reports
>> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/
>> sec in the
>> new one.
>>
>> After noticing this problem, we tried the user-mode nfs server
>> instead of the
>> kernel-mode server, and just installing the user-mode server
>> helped improving
>> throughput up to 12 Mb/sec, but still far away from the good old
>> 60 Mb/sec.
>>
>> After going through the "Optimizing NFS performance" section of the
>> NFS-Howto and tweaking the rsize,wsize parameters (the optimal
>> seems to be
>> 2048, which seems kind of weird to me, specially compared to the
>> 8192 used in
>> the old cluster), throughput increased to 21 Mb/sec, but is still
>> too far
>> from the old 60Mb/sec.
>>
>> We are stuck at this point. Any help/comment/suggestion will be
>> greatly
>> appreciated.
>> /P
>>
>> **************************** OLD CLUSTER
>> *****************************
>>
>> SATA disks.
>>
>> Filesystem: ext3.
>>
>> * the version of nfs-utils you are using: I don't know. It's the
>> most
>> recent version in debian sarge (oldstable).
>>
>> user-mode nfs server.
>>
>> nfs version 2, as reported with rpcinfo.
>>
>> * the version of the kernel and any non-stock applied kernels:
>> 2.6.12
>> * the distribution of linux you are using: Debian sarge x386 on
>> Intel Xeon
>> processors.
>> * the version(s) of other operating systems involved: no other OS.
>>
>> It is also useful to know the networking configuration connecting
>> the hosts:
>> Typical beowulf setup, with all servers connected to a switch,
>> 1Gb network.
>>
>> /etc/exports:
>>
>> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash)
>>
>> /etc/fstab:
>>
>> server:/srv/homes/user /mnt/user nfs
>> rw,hard,intr,rsize=8192,wsize=8192 0 0
>>
>> **************************** NEW CLUSTER
>> *****************************
>>
>> SAS 10k disks.
>>
>> Filesystem: ext3 over LVM.
>>
>> * the version of nfs-utils you are using: I don't know. It's the
>> most
>> recent version in debian etch (stable).
>>
>> kernel-mode nfs server.
>>
>> nfs version 2, as reported with rpcinfo.
>>
>> * the version of the kernel and any non-stock applied kernels:
>> 2.6.18-5-amd64
>> * the distribution of linux you are using: Debian etch AMD64 on
>> Intel Xeon
>> processors.
>> * the version(s) of other operating systems involved: no other OS.
>>
>> It is also useful to know the networking configuration connecting
>> the hosts:
>> Typical beowulf setup, with all servers connected to a switch,
>> 1Gb network.
>>
>> /etc/exports:
>>
>> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash)
>>
>> mount options:
>>
>> rsize=8192,wsize=8192
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-
>> nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
> --
> pOSix rules
> -
> To unsubscribe from this list: send the line "unsubscribe linux-
> nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2008-02-15 15:37:12

by Font Bella

[permalink] [raw]
Subject: Re: Performance question

Dear all,

I finally got it to work, after much pain/testing. Here are my config
notes (just for the record).
Thanks Marcelo and Chuck!

NFS setup
=========

Documentation
-------------

* http://billharlan.com/pub/papers/NFS_for_clusters.html
* http://nfs.sourceforge.net/nfs-howto/ar01s05.html#nfsd_daemon_instances

Setting
-------

We use package nfs-kernel-server, i.e. we use the kernel-space nfs server,
which is faster than nfs-user-server.

We use NFS version 3.

Configuration
-------------

Make sure we are using nfs version 3. This seems to be the default with
package nfs-kernel-server. Check from client side with::

cat /proc/mounts

Use UDP for packet transmission, i.e. use option 'proto=udp' in your
/etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount
command. Check from client side also with 'cat /proc/mounts'.

Make sure you have enough nfsd server threads. See if your server is receiving
too many overlapping requests with

$ grep th /proc/net/rpc/nfsd

Ours isn't, so we increase the number of threads used by the server to
32 by changing
RPCNFSDCOUNT=32 in /etc/default/nfs-kernel-server (Debian configuration file
for startup scripts). Remember to restart nfs-kernel-server for changes to
take effect.

In the server side, use 'async' option in /etc/exports. This was a crucial
step to get good performance.

Finally, try different values of rsize and wsize in your
/etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount
command. Check from client side also with 'cat /proc/mounts'.
Test your favourite benchmark with different rsize,wsize and look for an
optimal value.

ALL the steps above were necessary for me to get good performance, but
the last step was
crucial, since I got very different performances depending on the
value of rsize/wsize.



On Thu, Feb 14, 2008 at 5:56 PM, Chuck Lever <[email protected]> wrote:
> On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote:
> > Hello all,
> > There is a great diff between access the raw discs and through LVM,
> > with some kind of RAID, and etc. I think you should use NFS v3, and
> > it's hard to think that without you explicitally configure it to use
> > v2, it using...
> > A great diff between v2 and v3 is that v2 is always "async", what is a
> > performance burst. Are you sure that in the new environment is not v3?
> > In the new stable version (nfs-utils), debian is sync by default. I'm
> > used to "8192" transfer sizes, and was the best perfomance in my
> > tests.
>
> As Marcelo suggested, this could be nothing more than the change in
> default export options (see exports(8) -- the description of the sync/
> async option) between sarge and etch. This was a change in the nfs-
> utils package done a while back to improve data integrity guarantees
> during server instability.
>
> You can test this easily by explicitly specifying sync or async in
> your /etc/exports and trying your test.
>
> It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie
> they must be committed to permanent storage before the server
> replies) -- the async export option breaks that guarantee to improve
> performance. There is some further description in the NFS FAQ at
> http://nfs.sourceforge.net/ .
>
> The preferred way to get "async" write performance is to use NFSv3.
>
>
>
> > Would be nice if you could test another network service writing in
> > that server.. like ftp, or iscsi.
> > Another question, the discs are "local" or SAN? There is no
> > concurrency?
> >
> > ps.: v2 has a 2GB file size limit AFAIK.
> >
> > Leal.
> >
> > 2008/2/14, Font Bella <[email protected]>:
> >> Hi,
> >>
> >> some of our apps are experiencing slow nfs performance in our new
> >> cluster, in
> >> comparison with the old one. The nfs setups for both clusters are
> >> very
> >> similar, and we are wondering what's going on. The details of
> >> both setups are
> >> given below for reference.
> >>
> >> The problem seems to occur with apps that do heavy i/o, creating,
> >> writing,
> >> reading, and deleting many files. However, writing or reading a
> >> large file
> >> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024
> >> count=2000`) is not
> >> slow.
> >>
> >> We have performed some tests with the disk benchmark 'dbench',
> >> which reports
> >> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/
> >> sec in the
> >> new one.
> >>
> >> After noticing this problem, we tried the user-mode nfs server
> >> instead of the
> >> kernel-mode server, and just installing the user-mode server
> >> helped improving
> >> throughput up to 12 Mb/sec, but still far away from the good old
> >> 60 Mb/sec.
> >>
> >> After going through the "Optimizing NFS performance" section of the
> >> NFS-Howto and tweaking the rsize,wsize parameters (the optimal
> >> seems to be
> >> 2048, which seems kind of weird to me, specially compared to the
> >> 8192 used in
> >> the old cluster), throughput increased to 21 Mb/sec, but is still
> >> too far
> >> from the old 60Mb/sec.
> >>
> >> We are stuck at this point. Any help/comment/suggestion will be
> >> greatly
> >> appreciated.
> >> /P
> >>
> >> **************************** OLD CLUSTER
> >> *****************************
> >>
> >> SATA disks.
> >>
> >> Filesystem: ext3.
> >>
> >> * the version of nfs-utils you are using: I don't know. It's the
> >> most
> >> recent version in debian sarge (oldstable).
> >>
> >> user-mode nfs server.
> >>
> >> nfs version 2, as reported with rpcinfo.
> >>
> >> * the version of the kernel and any non-stock applied kernels:
> >> 2.6.12
> >> * the distribution of linux you are using: Debian sarge x386 on
> >> Intel Xeon
> >> processors.
> >> * the version(s) of other operating systems involved: no other OS.
> >>
> >> It is also useful to know the networking configuration connecting
> >> the hosts:
> >> Typical beowulf setup, with all servers connected to a switch,
> >> 1Gb network.
> >>
> >> /etc/exports:
> >>
> >> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash)
> >>
> >> /etc/fstab:
> >>
> >> server:/srv/homes/user /mnt/user nfs
> >> rw,hard,intr,rsize=8192,wsize=8192 0 0
> >>
> >> **************************** NEW CLUSTER
> >> *****************************
> >>
> >> SAS 10k disks.
> >>
> >> Filesystem: ext3 over LVM.
> >>
> >> * the version of nfs-utils you are using: I don't know. It's the
> >> most
> >> recent version in debian etch (stable).
> >>
> >> kernel-mode nfs server.
> >>
> >> nfs version 2, as reported with rpcinfo.
> >>
> >> * the version of the kernel and any non-stock applied kernels:
> >> 2.6.18-5-amd64
> >> * the distribution of linux you are using: Debian etch AMD64 on
> >> Intel Xeon
> >> processors.
> >> * the version(s) of other operating systems involved: no other OS.
> >>
> >> It is also useful to know the networking configuration connecting
> >> the hosts:
> >> Typical beowulf setup, with all servers connected to a switch,
> >> 1Gb network.
> >>
> >> /etc/exports:
> >>
> >> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash)
> >>
> >> mount options:
> >>
> >> rsize=8192,wsize=8192
> >> -
> >> To unsubscribe from this list: send the line "unsubscribe linux-
> >> nfs" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >
> >
> > --
> > pOSix rules
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>

2008-02-15 16:13:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: Performance question


On Fri, 2008-02-15 at 16:37 +0100, Font Bella wrote:

> Finally, try different values of rsize and wsize in your
> /etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount
> command. Check from client side also with 'cat /proc/mounts'.
> Test your favourite benchmark with different rsize,wsize and look for an
> optimal value.
>
> ALL the steps above were necessary for me to get good performance, but
> the last step was
> crucial, since I got very different performances depending on the
> value of rsize/wsize.

That very likely implies that you have problems with UDP packet loss.
Switch to TCP.

Trond


2008-02-15 16:18:52

by Chuck Lever III

[permalink] [raw]
Subject: Re: Performance question

On Feb 15, 2008, at 10:37 AM, Font Bella wrote:
> Dear all,
>
> I finally got it to work, after much pain/testing. Here are my config
> notes (just for the record).
> Thanks Marcelo and Chuck!
>
> NFS setup
> =========
>
> Documentation
> -------------
>
> * http://billharlan.com/pub/papers/NFS_for_clusters.html
> * http://nfs.sourceforge.net/nfs-howto/
> ar01s05.html#nfsd_daemon_instances
>
> Setting
> -------
>
> We use package nfs-kernel-server, i.e. we use the kernel-space nfs
> server,
> which is faster than nfs-user-server.
>
> We use NFS version 3.
>
> Configuration
> -------------
>
> Make sure we are using nfs version 3. This seems to be the default
> with
> package nfs-kernel-server. Check from client side with::
>
> cat /proc/mounts
>
> Use UDP for packet transmission, i.e. use option 'proto=udp' in your
> /etc/fstab, /etc/auto.home (if using automounts), or in general, in
> any mount
> command. Check from client side also with 'cat /proc/mounts'.
>
> Make sure you have enough nfsd server threads. See if your server
> is receiving
> too many overlapping requests with
>
> $ grep th /proc/net/rpc/nfsd
>
> Ours isn't, so we increase the number of threads used by the server to
> 32 by changing
> RPCNFSDCOUNT=32 in /etc/default/nfs-kernel-server (Debian
> configuration file
> for startup scripts). Remember to restart nfs-kernel-server for
> changes to
> take effect.
>
> In the server side, use 'async' option in /etc/exports. This was a
> crucial
> step to get good performance.
>
> Finally, try different values of rsize and wsize in your
> /etc/fstab, /etc/auto.home (if using automounts), or in general, in
> any mount
> command. Check from client side also with 'cat /proc/mounts'.
> Test your favourite benchmark with different rsize,wsize and look
> for an
> optimal value.
>
> ALL the steps above were necessary for me to get good performance, but
> the last step was
> crucial, since I got very different performances depending on the
> value of rsize/wsize.

I'm glad you were able to make progress. 32 server threads is
actually fairly conservative; you might consider 128 or more if you
have more than a few clients.

I want to make sure you understand the limitations and risks of using
UDP and the "async" export option, however.

1. "async" is no longer the default because it introduces a silent
data corruption risk. With NFSv3, data write operations are already
asynchronous, with a subsequent COMMIT, so that they are safe. The
client now knows when data has hit stable storage and can thus delete
its cached copy safely.

I urge you to read the NFS FAQ discussion on the "async" export
option and reconsider its use in production.

2. UDP is no longer the default because it also introduces a silent
data corruption risk, since the IP ID field (which UDP depends on for
reassembling datagrams larger than a single link-layer frame) is only
16 bits wide. If this field should wrap, datagram reassembly is
compromised. The UDP datagram checksum is weak enough that the
receiving end probably won't detect the reassembly errors.

In addition, UDP will likely perform poorly in situations involving
more than a few clients. It's congestion control algorithm is unable
to handle large amounts of concurrent network traffic since it
doesn't have a packet ACK mechanism like TCP does. The fact that
your performance was best at such a small r/wsize (you mentioned 2048
in your earlier e-mail) suggests you have a network environment that
would benefit enormously from using TCP.


So, our recommendation these days is to use the default "sync" export
setting, and use NFSv3 over TCP if at all possible. (The HOWTO may
be out of date in this regard). If you are not able to achieve good
performance results with these settings, you can e-mail the list
again and we can do further analysis.



> On Thu, Feb 14, 2008 at 5:56 PM, Chuck Lever
> <[email protected]> wrote:
>> On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote:
>>> Hello all,
>>> There is a great diff between access the raw discs and through LVM,
>>> with some kind of RAID, and etc. I think you should use NFS v3, and
>>> it's hard to think that without you explicitally configure it to use
>>> v2, it using...
>>> A great diff between v2 and v3 is that v2 is always "async", what
>>> is a
>>> performance burst. Are you sure that in the new environment is
>>> not v3?
>>> In the new stable version (nfs-utils), debian is sync by default.
>>> I'm
>>> used to "8192" transfer sizes, and was the best perfomance in my
>>> tests.
>>
>> As Marcelo suggested, this could be nothing more than the change in
>> default export options (see exports(8) -- the description of the
>> sync/
>> async option) between sarge and etch. This was a change in the nfs-
>> utils package done a while back to improve data integrity guarantees
>> during server instability.
>>
>> You can test this easily by explicitly specifying sync or async in
>> your /etc/exports and trying your test.
>>
>> It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie
>> they must be committed to permanent storage before the server
>> replies) -- the async export option breaks that guarantee to improve
>> performance. There is some further description in the NFS FAQ at
>> http://nfs.sourceforge.net/ .
>>
>> The preferred way to get "async" write performance is to use NFSv3.
>>
>>
>>
>>> Would be nice if you could test another network service writing in
>>> that server.. like ftp, or iscsi.
>>> Another question, the discs are "local" or SAN? There is no
>>> concurrency?
>>>
>>> ps.: v2 has a 2GB file size limit AFAIK.
>>>
>>> Leal.
>>>
>>> 2008/2/14, Font Bella <[email protected]>:
>>>> Hi,
>>>>
>>>> some of our apps are experiencing slow nfs performance in our new
>>>> cluster, in
>>>> comparison with the old one. The nfs setups for both clusters are
>>>> very
>>>> similar, and we are wondering what's going on. The details of
>>>> both setups are
>>>> given below for reference.
>>>>
>>>> The problem seems to occur with apps that do heavy i/o, creating,
>>>> writing,
>>>> reading, and deleting many files. However, writing or reading a
>>>> large file
>>>> (as measure with `time dd if=/dev/zero of=2gbfile bs=1024
>>>> count=2000`) is not
>>>> slow.
>>>>
>>>> We have performed some tests with the disk benchmark 'dbench',
>>>> which reports
>>>> i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/
>>>> sec in the
>>>> new one.
>>>>
>>>> After noticing this problem, we tried the user-mode nfs server
>>>> instead of the
>>>> kernel-mode server, and just installing the user-mode server
>>>> helped improving
>>>> throughput up to 12 Mb/sec, but still far away from the good old
>>>> 60 Mb/sec.
>>>>
>>>> After going through the "Optimizing NFS performance" section of
>>>> the
>>>> NFS-Howto and tweaking the rsize,wsize parameters (the optimal
>>>> seems to be
>>>> 2048, which seems kind of weird to me, specially compared to the
>>>> 8192 used in
>>>> the old cluster), throughput increased to 21 Mb/sec, but is still
>>>> too far
>>>> from the old 60Mb/sec.
>>>>
>>>> We are stuck at this point. Any help/comment/suggestion will be
>>>> greatly
>>>> appreciated.
>>>> /P
>>>>
>>>> **************************** OLD CLUSTER
>>>> *****************************
>>>>
>>>> SATA disks.
>>>>
>>>> Filesystem: ext3.
>>>>
>>>> * the version of nfs-utils you are using: I don't know. It's the
>>>> most
>>>> recent version in debian sarge (oldstable).
>>>>
>>>> user-mode nfs server.
>>>>
>>>> nfs version 2, as reported with rpcinfo.
>>>>
>>>> * the version of the kernel and any non-stock applied kernels:
>>>> 2.6.12
>>>> * the distribution of linux you are using: Debian sarge x386 on
>>>> Intel Xeon
>>>> processors.
>>>> * the version(s) of other operating systems involved: no other OS.
>>>>
>>>> It is also useful to know the networking configuration connecting
>>>> the hosts:
>>>> Typical beowulf setup, with all servers connected to a switch,
>>>> 1Gb network.
>>>>
>>>> /etc/exports:
>>>>
>>>> /srv/homes 192.168.1.0/255.255.255.0 (rw,no_root_squash)
>>>>
>>>> /etc/fstab:
>>>>
>>>> server:/srv/homes/user /mnt/user nfs
>>>> rw,hard,intr,rsize=8192,wsize=8192 0 0
>>>>
>>>> **************************** NEW CLUSTER
>>>> *****************************
>>>>
>>>> SAS 10k disks.
>>>>
>>>> Filesystem: ext3 over LVM.
>>>>
>>>> * the version of nfs-utils you are using: I don't know. It's the
>>>> most
>>>> recent version in debian etch (stable).
>>>>
>>>> kernel-mode nfs server.
>>>>
>>>> nfs version 2, as reported with rpcinfo.
>>>>
>>>> * the version of the kernel and any non-stock applied kernels:
>>>> 2.6.18-5-amd64
>>>> * the distribution of linux you are using: Debian etch AMD64 on
>>>> Intel Xeon
>>>> processors.
>>>> * the version(s) of other operating systems involved: no other OS.
>>>>
>>>> It is also useful to know the networking configuration connecting
>>>> the hosts:
>>>> Typical beowulf setup, with all servers connected to a switch,
>>>> 1Gb network.
>>>>
>>>> /etc/exports:
>>>>
>>>> /srv/homes 192.168.1.0/255.255.255.0 (no_root_squash)
>>>>
>>>> mount options:
>>>>
>>>> rsize=8192,wsize=8192
>>>> -
>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>>> nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>>>
>>> --
>>> pOSix rules
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>> nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-
> nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2008-02-18 09:39:48

by Font Bella

[permalink] [raw]
Subject: Re: Performance question

I tried TCP and async options, but I get poor performance in my
benchmarks (a dbench run with 10 clients). Below I tabulated the
outcome of my tests, which show that in my setting there is a huge
difference between sync and async, and udp/tcp. Any
comments/suggestions are warmly welcome.

I also tried setting 128 server threads as Chuck suggested, but this
doesn't seem to affect performance. This makes sense, since we only
have a dozen of clients.

About sync/async, I am not very concerned about corrupt data if the
cluster goes down, we do mostly computing, no crucial database
transactions or anything like that. Our users wouldn't mind some
degree of data corruption in case of power failure, but speed is
crucial.

Our network setting is just a dozen of servers connected to a switch.
Everything (adapters/cables/switch) is 1gigabit. We use ethernet
bonding to double networking speed.

Here are the test results. I didn't measure SYNC+UDP, since SYNC+TCP
already gives me very poor performance. Admittedly, my test is very
simple, and I should probably try something more complete, like
IOzone. But the dbench run seems to reproduce the bottleneck we've
been observing in our cluster.

Thanks,
/P


********************** ASYNC option in server ******************************

rsize,wsize TCP UDP

1024 24 MB/s 34 MB/s
2048 35 49
4096 37 75
8192 40.4 35
16386 40.2 19

********************** SYNC option in server ******************************

rsize,wsize TCP UDP

1024 6 MB/s ?? MB/s
2048 7.44 ??
4096 7.33 ??
8192 7 ??
16386 7 ??

On Feb 15, 2008 5:13 PM, Trond Myklebust <[email protected]> wrote:
>
> That very likely implies that you have problems with UDP packet loss.
> Switch to TCP.
>
> Trond
>
>

2008-02-18 16:59:58

by Chuck Lever III

[permalink] [raw]
Subject: Re: Performance question

On Feb 18, 2008, at 4:39 AM, Font Bella wrote:
> I tried TCP and async options, but I get poor performance in my
> benchmarks (a dbench run with 10 clients). Below I tabulated the
> outcome of my tests, which show that in my setting there is a huge
> difference between sync and async, and udp/tcp. Any
> comments/suggestions are warmly welcome.
>
> I also tried setting 128 server threads as Chuck suggested, but this
> doesn't seem to affect performance. This makes sense, since we only
> have a dozen of clients.

Each Linux client mount point can generate up to 16 server requests
by default. A dozen clients each with a single mount point can
generate 192 concurrent requests. So 128 server threads is not as
outlandish as you might think.

In this case, you are likely hitting some other bottleneck before the
clients can utilize all the server threads.

> About sync/async, I am not very concerned about corrupt data if the
> cluster goes down, we do mostly computing, no crucial database
> transactions or anything like that. Our users wouldn't mind some
> degree of data corruption in case of power failure, but speed is
> crucial.

The data corruption is silent. If it weren't, you could simply
restore from a backup as soon as you recover from a server crash.
Silent corruption spreads into your backed up data, and starts
causing strange application errors, sometimes a long time after the
corruption first occurred.

> Our network setting is just a dozen of servers connected to a switch.
> Everything (adapters/cables/switch) is 1gigabit. We use ethernet
> bonding to double networking speed.
>
> Here are the test results. I didn't measure SYNC+UDP, since SYNC+TCP
> already gives me very poor performance. Admittedly, my test is very
> simple, and I should probably try something more complete, like
> IOzone. But the dbench run seems to reproduce the bottleneck we've
> been observing in our cluster.

I assume the dbench test is read and write only (little or no
metadata activity like file creation and deletion). How closely does
dbench reflect your production workload?

I see from your initial e-mail that your server file system is:

> SAS 10k disks.
>
> Filesystem: ext3 over LVM.

Have you tried testing over NFS with a file system that resides on a
single physical disk? If you have done a read-only test versus a
write-only test, how do the numbers compare? Have you tested a range
of write sizes, from small file writes v. writes to writing files
larger than the server's memory?

> ********************** ASYNC option in server
> ******************************
>
> rsize,wsize TCP UDP
>
> 1024 24 MB/s 34 MB/s
> 2048 35 49
> 4096 37 75
> 8192 40.4 35
> 16386 40.2 19

As the size of the read and write requests increase, your UDP
throughput decreases markedly. This does indicate some packet loss,
so TCP is going to provide consistent performance and much lower risk
to data integrity as your network and client workloads increase.

You might try this test again and watch your clients' ethernet
bandwidth and RPC retransmit rate to see what I mean. At the 16386
setting, the UDP test may be pumping significantly more packets onto
the network, but is getting only about 20MB/s through. This will
certainly have some effect on other traffic on the network.

The first thing I check in these instances is that gigabit ethernet
flow control is enabled in both directions on all interfaces (both
host and switch).

In addition, using larger r/wsize settings on your clients means the
server can perform disk reads and writes more efficiently, which will
help your server scale with increasing client workloads.

By examining your current network carefully, you might be able to
boost the performance of NFS over both UDP and TCP. With bonded
gigabit, you should be able to push network throughput past 200 MB/s
using a test like iPerf which doesn't touch disks. Thus, at least
NFS reads from files already in the server's page cache ought to fly
in this configuration.

> ********************** SYNC option in server
> ******************************
>
> rsize,wsize TCP UDP
>
> 1024 6 MB/s ?? MB/s
> 2048 7.44 ??
> 4096 7.33 ??
> 8192 7 ??
> 16386 7 ??

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com