2008-02-06 10:39:57

by Jesper Krogh

[permalink] [raw]
Subject: NFS performance (Currently 2.6.20)


Hi.

I'm currently trying to optimize our NFS server. We're running in a
cluster setup with a single NFS server and some compute nodes pulling data
from it. Currently the dataset is less than 10GB so it fits in memory of
the NFS-server. (confirmed via vmstat 1).
Currently I'm getting around 500mbit (700 peak) of the server on a
gigabit link and the server is CPU-bottlenecked when this happens. Clients
having iowait around 30-50%.

Is it reasonable to expect to be able to fill a gigabit link in this
scenario? (I'd like to put in a 10Gbit interface, but when I have a
cpu-bottleneck)

Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
NFSv4

NFSv3 default mount options is around 1MB for rsize and wsize, but reading
the nfs-man page, they suggest setting them "up to" around 32K.

I probably only need some pointers to the documentation.

Thanks.
--
Jesper Krogh



2008-02-06 14:44:07

by Gabriel Barazer

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)

Hi,

On 02/06/2008 11:04:34 AM +0100, "Jesper Krogh" <[email protected]> wrote:
> Hi.
>
> I'm currently trying to optimize our NFS server. We're running in a
> cluster setup with a single NFS server and some compute nodes pulling data
> from it. Currently the dataset is less than 10GB so it fits in memory of
> the NFS-server. (confirmed via vmstat 1).
> Currently I'm getting around 500mbit (700 peak) of the server on a
> gigabit link and the server is CPU-bottlenecked when this happens. Clients
> having iowait around 30-50%.

I have a similar setup, and I'm very curious on how you can read an
"iowait" value from the clients: On my nodes (server 2.6.21.5/clients
2.6.23.14), the iowait counter is only incremented when dealing with
block devices, and since my nodes are diskless my iowait is near 0%.

Maybe I'm wrong, but when the NFS servers lags, this is my system
counter which is increased (having peaks at 30% system instead of 5-10%)

> Is it reasonable to expect to be able to fill a gigabit link in this
> scenario? (I'd like to put in a 10Gbit interface, but when I have a
> cpu-bottleneck)

I'm sure this is possible, but it is very dependant on which kind of
traffic you have. If you have only data to pull (which theoretically
never invalidate the page cache on the server), and you have options
like 'noatime,nodiratime' to avoid nfs updating the access times, it
seems possible to me. But maybe your CPU is busy doing something else
than only computing NFS traffic. Maybe you should change your network
controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000
driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this
really helps by reducing interrupts when dealing with a lot of traffic.

You will have to check your kernel if you have IOAT enabled in the "DMA
engines" section.

>
> Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
> NFSv4

NFSv2/3 have nearly the same performance, and NFSv4 has a slight
negative hit probably because of its "earlyness": it's too early to work
on the performances when features are not completely stable.

>
> NFSv3 default mount options is around 1MB for rsize and wsize, but reading
> the nfs-man page, they suggest setting them "up to" around 32K.

the values for rsize and wsize mount options depends on the amount of
memory you have (on the server AFAIK), and when you have >4GB the values
are not very realistic anymore. On my systems I have the defaults
rsize/wsize set to 512KB and all is running fine, but I sure there is
some work to be done to adjust more precisely the buffers size when
dealing with large memory amounts (e.g. a 1MB buffer is a non-sense).
The 32k value in a very old one and the man page doesn't even explain
the memory-related rsize/wsize values.

>
> I probably only need some pointers to the documentation.

And the documentation probably needs some refresh, but things are
changing nearly every week here...

Gabriel

2008-02-06 15:18:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)


On Wed, 2008-02-06 at 15:37 +0100, Gabriel Barazer wrote:

> >
> > Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
> > NFSv4
>
> NFSv2/3 have nearly the same performance

Only if you shoot yourself in the foot by setting the 'async' flag
in /etc/exports. Don't do that...

Most people will want to use NFSv3 for performance reasons. Unlike NFSv2
with 'async', NFSv3 with the 'sync' export flag set actually does _safe_
server-side caching of writes.

Trond


2008-02-06 16:00:11

by Jesper Krogh

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)

> Hi,
>> I'm currently trying to optimize our NFS server. We're running in a
>> cluster setup with a single NFS server and some compute nodes pulling
>> data from it. Currently the dataset is less than 10GB so it fits in
>> memory of the NFS-server. (confirmed via vmstat 1). Currently I'm
>> getting around 500mbit (700 peak) of the server on a gigabit link and
>> the server is CPU-bottlenecked when this happens. Clients having iowait
>> around 30-50%.
>
> I have a similar setup, and I'm very curious on how you can read an
> "iowait" value from the clients: On my nodes (server 2.6.21.5/clients
> 2.6.23.14), the iowait counter is only incremented when dealing with
> block devices, and since my nodes are diskless my iowait is near 0%.

Output in top is like this:
top - 16:51:01 up 119 days, 6:10, 1 user, load average: 2.09, 2.00, 1.41
Tasks: 74 total, 2 running, 72 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 50.0%id, 49.8%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2060188k total, 2047488k used, 12700k free, 2988k buffers
Swap: 4200988k total, 42776k used, 4158212k free, 1985500k cached

>> Is it reasonable to expect to be able to fill a gigabit link in this
>> scenario? (I'd like to put in a 10Gbit interface, but when I have a
>> cpu-bottleneck)
>
> I'm sure this is possible, but it is very dependant on which kind of
> traffic you have. If you have only data to pull (which theoretically never
> invalidate the page cache on the server), and you have options like
> 'noatime,nodiratime' to avoid nfs updating the access times, it
> seems possible to me. But maybe your CPU is busy doing something else than
> only computing NFS traffic. Maybe you should change your network
> controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000
> driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this
> really helps by reducing interrupts when dealing with a lot of traffic.

It is a Sun V20Z (dual Opteron) NIC is:
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)

Jesper
--
Jesper Krogh


2008-02-06 18:25:00

by Gabriel Barazer

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)

On 02/06/2008 4:18:16 PM +0100, Trond Myklebust
<[email protected]> wrote:
> On Wed, 2008-02-06 at 15:37 +0100, Gabriel Barazer wrote:
>
>>> Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
>>> NFSv4
>> NFSv2/3 have nearly the same performance
>
> Only if you shoot yourself in the foot by setting the 'async' flag
> in /etc/exports. Don't do that...
>
> Most people will want to use NFSv3 for performance reasons. Unlike NFSv2
> with 'async', NFSv3 with the 'sync' export flag set actually does _safe_
> server-side caching of writes.
>

Oops (tm)! Fortunately I do mostly reads, but maybe the exports(5) man
page should be updated. According to the man page, I thought that
although writes aren't commited to the block devices, the server-side
cache is correctly synchronized (but lost if you pull the plug). Thanks
for the explanation. Having a battery backed large write cache on the
server, is there a performance hit when switching from async to sync in
NFSv3 ?

Off-Topic: maybe the warning when omitting the 'sync' option at export
should be removed to only be showed when using the 'async' option ? We
really want to warn people before too many feet are shot :-)

To Jesper: I found out that using the 'nolock' flag at mount time on the
nfs clients improve the performances but obviously only if don't need
write locks (and your setup seems to do only intensive reads)

Gabriel

2008-02-06 18:46:30

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)


On Wed, 2008-02-06 at 19:24 +0100, Gabriel Barazer wrote:
> Oops (tm)! Fortunately I do mostly reads, but maybe the exports(5) man
> page should be updated. According to the man page, I thought that
> although writes aren't commited to the block devices, the server-side
> cache is correctly synchronized (but lost if you pull the plug).

...or if the server crashes for some reason.

> Thanks
> for the explanation. Having a battery backed large write cache on the
> server, is there a performance hit when switching from async to sync in
> NFSv3 ?

The main performance hits occur on operations like create(), mkdir(),
rename and unlink() since they are required to be immediately synced to
disk.
IOW: there will be a noticeable overhead when writing lots of small
files.

For large files, the overhead should be minimal, since all writes can be
cached by the server until the close() operation.

Trond


2008-02-06 20:04:28

by Gabriel Barazer

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)

On 02/06/2008 4:59:39 PM +0100, "Jesper Krogh" <[email protected]> wrote:

>> I have a similar setup, and I'm very curious on how you can read an
>> "iowait" value from the clients: On my nodes (server 2.6.21.5/clients
>> 2.6.23.14), the iowait counter is only incremented when dealing with
>> block devices, and since my nodes are diskless my iowait is near 0%.
>
> Output in top is like this:
> top - 16:51:01 up 119 days, 6:10, 1 user, load average: 2.09, 2.00, 1.41
> Tasks: 74 total, 2 running, 72 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 50.0%id, 49.8%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 2060188k total, 2047488k used, 12700k free, 2988k buffers
> Swap: 4200988k total, 42776k used, 4158212k free, 1985500k cached

You have obviously a block device on your nodes, so I suspect that
something is reading/writing to it. Looking at how much memory is used,
your system must be constantly swapping. This could explain why your
iowait is so high (if your swap space is a block device or a file on a
block device. You don't use swap over NFS do you?)

> It is a Sun V20Z (dual Opteron) NIC is:
> 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> Gigabit Ethernet (rev 03)

I don't know if this adapter supports DMA (no mention on the broadcom
specs page). I've seen such a technology only with the Intel I/O
Acceleration Technology (I/OAT) implementation, which the mainstream
linux kernel supports. But I have really seen the difference. I suppose
your controllers are integrated on the motherboard?
Another thing which could make a difference, maybe you could compile
your kernel with a lower timer frequency (CONFIG_HZ) such as 100hz: this
results in less interrupts being processed and a higher throughput.
(very dirty explanation, I know)

Gabriel

2008-02-06 20:24:51

by Jesper Krogh

[permalink] [raw]
Subject: Re: NFS performance (Currently 2.6.20)

Gabriel Barazer wrote:
> On 02/06/2008 4:59:39 PM +0100, "Jesper Krogh" <[email protected]> wrote:
>
>>> I have a similar setup, and I'm very curious on how you can read an
>>> "iowait" value from the clients: On my nodes (server 2.6.21.5/clients
>>> 2.6.23.14), the iowait counter is only incremented when dealing with
>>> block devices, and since my nodes are diskless my iowait is near 0%.
>>
>> Output in top is like this:
>> top - 16:51:01 up 119 days, 6:10, 1 user, load average: 2.09, 2.00,
>> 1.41
>> Tasks: 74 total, 2 running, 72 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 50.0%id, 49.8%wa, 0.0%hi, 0.0%si,
>> 0.0%st
>> Mem: 2060188k total, 2047488k used, 12700k free, 2988k buffers
>> Swap: 4200988k total, 42776k used, 4158212k free, 1985500k cached
>
> You have obviously a block device on your nodes, so I suspect that
> something is reading/writing to it. Looking at how much memory is used,
> your system must be constantly swapping. This could explain why your
> iowait is so high (if your swap space is a block device or a file on a
> block device. You don't use swap over NFS do you?)

No swap over NFS and no swapping at all.

A "vmstat 1" output of the above situation looks like:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
0 2 42768 11580 1368 1987336 0 0 0 0 638 366 1
0 50 48
0 2 42768 13088 1368 1985924 0 0 0 0 695 367 2
1 50 47
0 2 42768 13028 1368 1986112 0 0 0 0 345 129 0
0 50 50
1 1 42768 12720 1364 1986328 0 0 0 0 1043 710 6
1 50 42
0 1 42768 12648 1364 1987308 0 0 0 0 636 374 2
4 50 44
0 2 42768 11608 1364 1988436 0 0 0 0 696 382 1
0 51 49

You can also see that there barely is used any swap in the "top" report.

Jesper
--
Jesper