2010-02-24 02:41:05

by Fengguang Wu

[permalink] [raw]
Subject: [RFC] nfs: use 2*rsize readahead size

With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
readahead size 512k*15=7680k is too large than necessary for typical
clients.

On a e1000e--e1000e connection, I got the following numbers

readahead size throughput
16k 35.5 MB/s
32k 54.3 MB/s
64k 64.1 MB/s
128k 70.5 MB/s
256k 74.6 MB/s
rsize ==> 512k 77.4 MB/s
1024k 85.5 MB/s
2048k 86.8 MB/s
4096k 87.9 MB/s
8192k 89.0 MB/s
16384k 87.7 MB/s

So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
can already get near full NFS bandwidth.

The test script is:

#!/bin/sh

file=/mnt/sparse
BDI=0:15

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
echo 3 > /proc/sys/vm/drop_caches
echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
echo readahead_size=${rasize}k
dd if=$file of=/dev/null bs=4k count=1024000
done

CC: Trond Myklebust <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/nfs/client.c | 4 +++-
fs/nfs/internal.h | 8 --------
2 files changed, 3 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/client.c 2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/client.c 2010-02-24 10:16:00.000000000 +0800
@@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

server->backing_dev_info.name = "nfs";
- server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+ server->backing_dev_info.ra_pages = max_t(unsigned long,
+ default_backing_dev_info.ra_pages,
+ 2 * server->rpages);
server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;

if (server->wsize > max_rpc_payload)
--- linux.orig/fs/nfs/internal.h 2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/internal.h 2010-02-23 13:26:00.000000000 +0800
@@ -10,14 +10,6 @@

struct nfs_string;

-/* Maximum number of readahead requests
- * FIXME: this should really be a sysctl so that users may tune it to suit
- * their needs. People that do NFS over a slow network, might for
- * instance want to reduce it to something closer to 1 for improved
- * interactive response.
- */
-#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1)
-
/*
* Determine if sessions are in use.
*/


2010-02-24 04:24:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > readahead size 512k*15=7680k is too large than necessary for typical
> > clients.
> >
> > On a e1000e--e1000e connection, I got the following numbers
> >
> > readahead size throughput
> > 16k 35.5 MB/s
> > 32k 54.3 MB/s
> > 64k 64.1 MB/s
> > 128k 70.5 MB/s
> > 256k 74.6 MB/s
> > rsize ==> 512k 77.4 MB/s
> > 1024k 85.5 MB/s
> > 2048k 86.8 MB/s
> > 4096k 87.9 MB/s
> > 8192k 89.0 MB/s
> > 16384k 87.7 MB/s
> >
> > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > can already get near full NFS bandwidth.
> >
> > The test script is:
> >
> > #!/bin/sh
> >
> > file=/mnt/sparse
> > BDI=0:15
> >
> > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > do
> > echo 3 > /proc/sys/vm/drop_caches
> > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > echo readahead_size=${rasize}k
> > dd if=$file of=/dev/null bs=4k count=1024000
> > done
>
> That's doing a cached read out of the server cache, right? You
> might find the results are different if the server has to read the
> file from disk. I would expect reads from the server cache not
> to require much readahead as there is no IO latency on the server
> side for the readahead to hide....

FWIW, if you mount the client with "-o rsize=32k" or the server only
supports rsize <= 32k then this will probably hurt throughput a lot
because then readahead will be capped at 64k instead of 480k....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 07:39:45

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > What I'm trying to say is that while I agree with your premise that
> > a 7.8MB readahead window is probably far larger than was ever
> > intended, I disagree with your methodology and environment for
> > selecting a better default value. The default readahead value needs
> > to work well in as many situations as possible, not just in perfect
> > 1:1 client/server environment.
>
> Good points. It's imprudent to change a default value based on one
> single benchmark. Need to collect more data, which may take time..

Agreed - better to spend time now to get it right...

> > > It sounds silly to have
> > >
> > > client_readahead_size > server_readahead_size
> >
> > I don't think it is - the client readahead has to take into account
> > the network latency as well as the server latency. e.g. a network
> > with a high bandwidth but high latency is going to need much more
> > client side readahead than a high bandwidth, low latency network to
> > get the same throughput. Hence it is not uncommon to see larger
> > readahead windows on network clients than for local disk access.
>
> Hmm I wonder if I can simulate a high-bandwidth high-latency network
> with e1000's RxIntDelay/TxIntDelay parameters..

I think netem is the blessed method of emulating different network
behaviours. There's a howto+faq for setting it up here:

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 11:18:28

by Akshat Aranya

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 12:22 AM, Dave Chinner <[email protected]> wrote:

>
>> It sounds silly to have
>>
>> ? ? ? ? client_readahead_size > server_readahead_size
>
> I don't think it is ?- the client readahead has to take into account
> the network latency as well as the server latency. e.g. a network
> with a high bandwidth but high latency is going to need much more
> client side readahead than a high bandwidth, low latency network to
> get the same throughput. Hence it is not uncommon to see larger
> readahead windows on network clients than for local disk access.
>
> Also, the NFS server may not even be able to detect sequential IO
> patterns because of the combined access patterns from the clients,
> and so the only effective readahead might be what the clients
> issue....
>

In my experiments, I have observed that the server-side readahead
shuts off rather quickly even with a single client because the client
readahead causes multiple pending read RPCs on the server which are
then serviced in random order and the pattern observed by the
underlying file system is non-sequential. In our file system, we had
to override what the VFS thought was a random workload and continue to
do readahead anyway.

Cheers,
Akshat

2010-02-24 06:12:55

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 12:18:22PM +0800, Wu Fengguang wrote:
> > On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote:
> > > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > > readahead size 512k*15=7680k is too large than necessary for typical
> > > > clients.
> > > >
> > > > On a e1000e--e1000e connection, I got the following numbers
> > > >
> > > > readahead size throughput
> > > > 16k 35.5 MB/s
> > > > 32k 54.3 MB/s
> > > > 64k 64.1 MB/s
> > > > 128k 70.5 MB/s
> > > > 256k 74.6 MB/s
> > > > rsize ==> 512k 77.4 MB/s
> > > > 1024k 85.5 MB/s
> > > > 2048k 86.8 MB/s
> > > > 4096k 87.9 MB/s
> > > > 8192k 89.0 MB/s
> > > > 16384k 87.7 MB/s
> > > >
> > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > > can already get near full NFS bandwidth.
> > > >
> > > > The test script is:
> > > >
> > > > #!/bin/sh
> > > >
> > > > file=/mnt/sparse
> > > > BDI=0:15
> > > >
> > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > > do
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > > echo readahead_size=${rasize}k
> > > > dd if=$file of=/dev/null bs=4k count=1024000
> > > > done
> > >
> > > That's doing a cached read out of the server cache, right? You
> >
> > It does not involve disk IO at least. (The sparse file dataset is
> > larger than server cache.)
>
> It still results in effectively the same thing: very low, consistent
> IO latency.
>
> Effectively all the test results show is that on a clean, low
> latency, uncongested network an unloaded NFS server that has no IO
> latency, a client only requires one 512k readahead block to hide 90%
> of the server read request latency. I don't think this is a
> particularly good test to base a new default on, though.
>
> e.g. What is the result with a smaller rsize? When the server
> actually has to do disk IO? When multiple clients are reading at
> the same time so the server may not detect accesses as sequential
> and issue readahead? When another client is writing to the server at
> the same time as the read and causing significant read IO latency at
> the server?
>
> What I'm trying to say is that while I agree with your premise that
> a 7.8MB readahead window is probably far larger than was ever
> intended, I disagree with your methodology and environment for
> selecting a better default value. The default readahead value needs
> to work well in as many situations as possible, not just in perfect
> 1:1 client/server environment.

Good points. It's imprudent to change a default value based on one
single benchmark. Need to collect more data, which may take time..

> > > might find the results are different if the server has to read the
> > > file from disk. I would expect reads from the server cache not
> > > to require much readahead as there is no IO latency on the server
> > > side for the readahead to hide....
> >
> > Sure the result will be different when disk IO is involved.
> > In this case I would expect the server admin to setup the optimal
> > readahead size for the disk(s).
>
> The default should do the right thing when disk IO is involved, as

Agreed.

> almost no-one has an NFS server that doesn't do IO.... ;)

Sure.

> > It sounds silly to have
> >
> > client_readahead_size > server_readahead_size
>
> I don't think it is - the client readahead has to take into account
> the network latency as well as the server latency. e.g. a network
> with a high bandwidth but high latency is going to need much more
> client side readahead than a high bandwidth, low latency network to
> get the same throughput. Hence it is not uncommon to see larger
> readahead windows on network clients than for local disk access.

Hmm I wonder if I can simulate a high-bandwidth high-latency network
with e1000's RxIntDelay/TxIntDelay parameters..

> Also, the NFS server may not even be able to detect sequential IO
> patterns because of the combined access patterns from the clients,
> and so the only effective readahead might be what the clients
> issue....

Ah yes. Even though the upstream kernel can handle it well, one may
run a pretty old kernel, or other UNIX systems. If it really happens,
the default 512K won't behave too bad, but may well be sub-optimal.

Thanks,
Fengguang

2010-02-26 07:49:19

by Fengguang Wu

[permalink] [raw]
Subject: [RFC] nfs: use 4*rsize readahead size

On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > > What I'm trying to say is that while I agree with your premise that
> > > a 7.8MB readahead window is probably far larger than was ever
> > > intended, I disagree with your methodology and environment for
> > > selecting a better default value. The default readahead value needs
> > > to work well in as many situations as possible, not just in perfect
> > > 1:1 client/server environment.
> >
> > Good points. It's imprudent to change a default value based on one
> > single benchmark. Need to collect more data, which may take time..
>
> Agreed - better to spend time now to get it right...

I collected more data with large network latency as well as rsize=32k,
and updates the readahead size accordingly to 4*rsize.

===
nfs: use 2*rsize readahead size

With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
readahead size 512k*15=7680k is too large than necessary for typical
clients.

On a e1000e--e1000e connection, I got the following numbers
(this reads sparse file from server and involves no disk IO)

readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*)
16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s
32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s
64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s
128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s
256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s
rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s
1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s
2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s
8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s
16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s

readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*)
16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s
rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s
64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s
128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s
256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s
512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s
2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s
4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s
8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s
16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s

(*) 10ms+10ms means to add delay on both client & server sides with
# /sbin/tc qdisc change dev eth0 root netem delay 10ms
The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
command takes a dozen seconds. Note that the actual delay reported
by ping is larger, eg. for the 1ms+1ms case:
rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms


So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
flight) is able to get near full NFS bandwidth. Reducing the mulriple
from 15 to 4 not only makes the client side readahead size more sane
(2MB by default), but also reduces the disorderness of the server side
RPC read requests, which yeilds better server side readahead behavior.

To avoid small readahead when the client mount with "-o rsize=32k" or
the server only supports rsize <= 32k, we take the max of 2*rsize and
default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
be explicitly changed by user with kernel parameter "readahead=" and
runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
takes effective for future NFS mounts).

The test script is:

#!/bin/sh

file=/mnt/sparse
BDI=0:15

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
echo 3 > /proc/sys/vm/drop_caches
echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
echo readahead_size=${rasize}k
dd if=$file of=/dev/null bs=4k count=1024000
done

CC: Dave Chinner <[email protected]>
CC: Trond Myklebust <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/nfs/client.c | 4 +++-
fs/nfs/internal.h | 8 --------
2 files changed, 3 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800
+++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800
@@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

server->backing_dev_info.name = "nfs";
- server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+ server->backing_dev_info.ra_pages = max_t(unsigned long,
+ default_backing_dev_info.ra_pages,
+ 4 * server->rpages);
server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;

if (server->wsize > max_rpc_payload)
--- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800
+++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800
@@ -10,14 +10,6 @@

struct nfs_string;

-/* Maximum number of readahead requests
- * FIXME: this should really be a sysctl so that users may tune it to suit
- * their needs. People that do NFS over a slow network, might for
- * instance want to reduce it to something closer to 1 for improved
- * interactive response.
- */
-#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1)
-
/*
* Determine if sessions are in use.
*/

2010-02-24 05:22:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 12:18:22PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > readahead size 512k*15=7680k is too large than necessary for typical
> > > clients.
> > >
> > > On a e1000e--e1000e connection, I got the following numbers
> > >
> > > readahead size throughput
> > > 16k 35.5 MB/s
> > > 32k 54.3 MB/s
> > > 64k 64.1 MB/s
> > > 128k 70.5 MB/s
> > > 256k 74.6 MB/s
> > > rsize ==> 512k 77.4 MB/s
> > > 1024k 85.5 MB/s
> > > 2048k 86.8 MB/s
> > > 4096k 87.9 MB/s
> > > 8192k 89.0 MB/s
> > > 16384k 87.7 MB/s
> > >
> > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > can already get near full NFS bandwidth.
> > >
> > > The test script is:
> > >
> > > #!/bin/sh
> > >
> > > file=/mnt/sparse
> > > BDI=0:15
> > >
> > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > do
> > > echo 3 > /proc/sys/vm/drop_caches
> > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > echo readahead_size=${rasize}k
> > > dd if=$file of=/dev/null bs=4k count=1024000
> > > done
> >
> > That's doing a cached read out of the server cache, right? You
>
> It does not involve disk IO at least. (The sparse file dataset is
> larger than server cache.)

It still results in effectively the same thing: very low, consistent
IO latency.

Effectively all the test results show is that on a clean, low
latency, uncongested network an unloaded NFS server that has no IO
latency, a client only requires one 512k readahead block to hide 90%
of the server read request latency. I don't think this is a
particularly good test to base a new default on, though.

e.g. What is the result with a smaller rsize? When the server
actually has to do disk IO? When multiple clients are reading at
the same time so the server may not detect accesses as sequential
and issue readahead? When another client is writing to the server at
the same time as the read and causing significant read IO latency at
the server?

What I'm trying to say is that while I agree with your premise that
a 7.8MB readahead window is probably far larger than was ever
intended, I disagree with your methodology and environment for
selecting a better default value. The default readahead value needs
to work well in as many situations as possible, not just in perfect
1:1 client/server environment.

> > might find the results are different if the server has to read the
> > file from disk. I would expect reads from the server cache not
> > to require much readahead as there is no IO latency on the server
> > side for the readahead to hide....
>
> Sure the result will be different when disk IO is involved.
> In this case I would expect the server admin to setup the optimal
> readahead size for the disk(s).

The default should do the right thing when disk IO is involved, as
almost no-one has an NFS server that doesn't do IO.... ;)

> It sounds silly to have
>
> client_readahead_size > server_readahead_size

I don't think it is - the client readahead has to take into account
the network latency as well as the server latency. e.g. a network
with a high bandwidth but high latency is going to need much more
client side readahead than a high bandwidth, low latency network to
get the same throughput. Hence it is not uncommon to see larger
readahead windows on network clients than for local disk access.

Also, the NFS server may not even be able to detect sequential IO
patterns because of the combined access patterns from the clients,
and so the only effective readahead might be what the clients
issue....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 03:29:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> readahead size 512k*15=7680k is too large than necessary for typical
> clients.
>
> On a e1000e--e1000e connection, I got the following numbers
>
> readahead size throughput
> 16k 35.5 MB/s
> 32k 54.3 MB/s
> 64k 64.1 MB/s
> 128k 70.5 MB/s
> 256k 74.6 MB/s
> rsize ==> 512k 77.4 MB/s
> 1024k 85.5 MB/s
> 2048k 86.8 MB/s
> 4096k 87.9 MB/s
> 8192k 89.0 MB/s
> 16384k 87.7 MB/s
>
> So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> can already get near full NFS bandwidth.
>
> The test script is:
>
> #!/bin/sh
>
> file=/mnt/sparse
> BDI=0:15
>
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> echo 3 > /proc/sys/vm/drop_caches
> echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> echo readahead_size=${rasize}k
> dd if=$file of=/dev/null bs=4k count=1024000
> done

That's doing a cached read out of the server cache, right? You
might find the results are different if the server has to read the
file from disk. I would expect reads from the server cache not
to require much readahead as there is no IO latency on the server
side for the readahead to hide....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 04:18:26

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > readahead size 512k*15=7680k is too large than necessary for typical
> > clients.
> >
> > On a e1000e--e1000e connection, I got the following numbers
> >
> > readahead size throughput
> > 16k 35.5 MB/s
> > 32k 54.3 MB/s
> > 64k 64.1 MB/s
> > 128k 70.5 MB/s
> > 256k 74.6 MB/s
> > rsize ==> 512k 77.4 MB/s
> > 1024k 85.5 MB/s
> > 2048k 86.8 MB/s
> > 4096k 87.9 MB/s
> > 8192k 89.0 MB/s
> > 16384k 87.7 MB/s
> >
> > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > can already get near full NFS bandwidth.
> >
> > The test script is:
> >
> > #!/bin/sh
> >
> > file=/mnt/sparse
> > BDI=0:15
> >
> > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > do
> > echo 3 > /proc/sys/vm/drop_caches
> > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > echo readahead_size=${rasize}k
> > dd if=$file of=/dev/null bs=4k count=1024000
> > done
>
> That's doing a cached read out of the server cache, right? You

It does not involve disk IO at least. (The sparse file dataset is
larger than server cache.)

> might find the results are different if the server has to read the
> file from disk. I would expect reads from the server cache not
> to require much readahead as there is no IO latency on the server
> side for the readahead to hide....

Sure the result will be different when disk IO is involved.
In this case I would expect the server admin to setup the optimal
readahead size for the disk(s).

It sounds silly to have

client_readahead_size > server_readahead_size

Thanks,
Fengguang

2010-02-24 04:33:47

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > readahead size 512k*15=7680k is too large than necessary for typical
> > > clients.
> > >
> > > On a e1000e--e1000e connection, I got the following numbers
> > >
> > > readahead size throughput
> > > 16k 35.5 MB/s
> > > 32k 54.3 MB/s
> > > 64k 64.1 MB/s
> > > 128k 70.5 MB/s
> > > 256k 74.6 MB/s
> > > rsize ==> 512k 77.4 MB/s
> > > 1024k 85.5 MB/s
> > > 2048k 86.8 MB/s
> > > 4096k 87.9 MB/s
> > > 8192k 89.0 MB/s
> > > 16384k 87.7 MB/s
> > >
> > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > can already get near full NFS bandwidth.
> > >
> > > The test script is:
> > >
> > > #!/bin/sh
> > >
> > > file=/mnt/sparse
> > > BDI=0:15
> > >
> > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > do
> > > echo 3 > /proc/sys/vm/drop_caches
> > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > echo readahead_size=${rasize}k
> > > dd if=$file of=/dev/null bs=4k count=1024000
> > > done
> >
> > That's doing a cached read out of the server cache, right? You
> > might find the results are different if the server has to read the
> > file from disk. I would expect reads from the server cache not
> > to require much readahead as there is no IO latency on the server
> > side for the readahead to hide....
>
> FWIW, if you mount the client with "-o rsize=32k" or the server only
> supports rsize <= 32k then this will probably hurt throughput a lot
> because then readahead will be capped at 64k instead of 480k....

That's why I take the max of 2*rsize and system default readahead size
(which will be enlarged to 512K):

- server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+ server->backing_dev_info.ra_pages = max_t(unsigned long,
+ default_backing_dev_info.ra_pages,
+ 2 * server->rpages);

Thanks,
Fengguang

2010-02-24 04:43:58

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > readahead size 512k*15=7680k is too large than necessary for typical
> > > clients.
> > >
> > > On a e1000e--e1000e connection, I got the following numbers
> > >
> > > readahead size throughput
> > > 16k 35.5 MB/s
> > > 32k 54.3 MB/s
> > > 64k 64.1 MB/s
> > > 128k 70.5 MB/s
> > > 256k 74.6 MB/s
> > > rsize ==> 512k 77.4 MB/s
> > > 1024k 85.5 MB/s
> > > 2048k 86.8 MB/s
> > > 4096k 87.9 MB/s
> > > 8192k 89.0 MB/s
> > > 16384k 87.7 MB/s
> > >
> > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > can already get near full NFS bandwidth.
> > >
> > > The test script is:
> > >
> > > #!/bin/sh
> > >
> > > file=/mnt/sparse
> > > BDI=0:15
> > >
> > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > do
> > > echo 3 > /proc/sys/vm/drop_caches
> > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > echo readahead_size=${rasize}k
> > > dd if=$file of=/dev/null bs=4k count=1024000
> > > done
> >
> > That's doing a cached read out of the server cache, right? You
> > might find the results are different if the server has to read the
> > file from disk. I would expect reads from the server cache not
> > to require much readahead as there is no IO latency on the server
> > side for the readahead to hide....
>
> FWIW, if you mount the client with "-o rsize=32k" or the server only
> supports rsize <= 32k then this will probably hurt throughput a lot
> because then readahead will be capped at 64k instead of 480k....

I should have mentioned that in changelog.. Hope the updated one
helps.

Thanks,
Fengguang
---
nfs: use 2*rsize readahead size

With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
readahead size 512k*15=7680k is too large than necessary for typical
clients.

On a e1000e--e1000e connection, I got the following numbers
(this reads sparse file from server and involves no disk IO)

readahead size throughput
16k 35.5 MB/s
32k 54.3 MB/s
64k 64.1 MB/s
128k 70.5 MB/s
256k 74.6 MB/s
rsize ==> 512k 77.4 MB/s
1024k 85.5 MB/s
2048k 86.8 MB/s
4096k 87.9 MB/s
8192k 89.0 MB/s
16384k 87.7 MB/s

So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
can already get near full NFS bandwidth.

To avoid small readahead when the client mount with "-o rsize=32k" or
the server only supports rsize <= 32k, we take the max of 2*rsize and
default_backing_dev_info.ra_pages. The latter defaults to 512K, and
will be auto scaled down when system memory is less than 512M, or can
be explicitly changed by user with kernel parameter "readahead=".

The test script is:

#!/bin/sh

file=/mnt/sparse
BDI=0:15

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
echo 3 > /proc/sys/vm/drop_caches
echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
echo readahead_size=${rasize}k
dd if=$file of=/dev/null bs=4k count=1024000
done

CC: Dave Chinner <[email protected]>
CC: Trond Myklebust <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/nfs/client.c | 4 +++-
fs/nfs/internal.h | 8 --------
2 files changed, 3 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/client.c 2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/client.c 2010-02-24 10:16:00.000000000 +0800
@@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

server->backing_dev_info.name = "nfs";
- server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+ server->backing_dev_info.ra_pages = max_t(unsigned long,
+ default_backing_dev_info.ra_pages,
+ 2 * server->rpages);
server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;

if (server->wsize > max_rpc_payload)
--- linux.orig/fs/nfs/internal.h 2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/internal.h 2010-02-23 13:26:00.000000000 +0800
@@ -10,14 +10,6 @@

struct nfs_string;

-/* Maximum number of readahead requests
- * FIXME: this should really be a sysctl so that users may tune it to suit
- * their needs. People that do NFS over a slow network, might for
- * instance want to reduce it to something closer to 1 for improved
- * interactive response.
- */
-#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1)
-
/*
* Determine if sessions are in use.
*/

2010-02-25 12:37:59

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 07:18:26PM +0800, Akshat Aranya wrote:
> On Wed, Feb 24, 2010 at 12:22 AM, Dave Chinner <[email protected]> wrote:
>
> >
> >> It sounds silly to have
> >>
> >>         client_readahead_size > server_readahead_size
> >
> > I don't think it is  - the client readahead has to take into account
> > the network latency as well as the server latency. e.g. a network
> > with a high bandwidth but high latency is going to need much more
> > client side readahead than a high bandwidth, low latency network to
> > get the same throughput. Hence it is not uncommon to see larger
> > readahead windows on network clients than for local disk access.
> >
> > Also, the NFS server may not even be able to detect sequential IO
> > patterns because of the combined access patterns from the clients,
> > and so the only effective readahead might be what the clients
> > issue....
> >
>
> In my experiments, I have observed that the server-side readahead
> shuts off rather quickly even with a single client because the client
> readahead causes multiple pending read RPCs on the server which are
> then serviced in random order and the pattern observed by the
> underlying file system is non-sequential. In our file system, we had
> to override what the VFS thought was a random workload and continue to
> do readahead anyway.

What's the server side kernel version, plus client/server side
readahead size? I'd expect the context readahead to handle it well.

With the patchset in <http://lkml.org/lkml/2010/2/23/376>, you can
actually see the readahead details:

# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0

And I've actually verified the NFS case with the help of such traces
long ago. When client_readahead_size <= server_readahead_size, the
readahead requests may look a bit random at first, and then will
quickly turn into a perfect series of sequential context readaheads.

Thanks,
Fengguang

2010-02-24 05:24:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] nfs: use 2*rsize readahead size

On Wed, Feb 24, 2010 at 12:43:56PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> > > That's doing a cached read out of the server cache, right? You
> > > might find the results are different if the server has to read the
> > > file from disk. I would expect reads from the server cache not
> > > to require much readahead as there is no IO latency on the server
> > > side for the readahead to hide....
> >
> > FWIW, if you mount the client with "-o rsize=32k" or the server only
> > supports rsize <= 32k then this will probably hurt throughput a lot
> > because then readahead will be capped at 64k instead of 480k....
>
> I should have mentioned that in changelog.. Hope the updated one
> helps.

Sorry, my fault for not reading the code correctly.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-03-02 17:34:10

by John Stoffel

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

>>>>> "Trond" == Trond Myklebust <[email protected]> writes:

Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
>> Dave,
>>
>> Here is one more test on a big ext4 disk file:
>>
>> 16k 39.7 MB/s
>> 32k 54.3 MB/s
>> 64k 63.6 MB/s
>> 128k 72.6 MB/s
>> 256k 71.7 MB/s
>> rsize ==> 512k 71.7 MB/s
>> 1024k 72.2 MB/s
>> 2048k 71.0 MB/s
>> 4096k 73.0 MB/s
>> 8192k 74.3 MB/s
>> 16384k 74.5 MB/s
>>
>> It shows that >=128k client side readahead is enough for single disk
>> case :) As for RAID configurations, I guess big server side readahead
>> should be enough.

Trond> There are lots of people who would like to use NFS on their
Trond> company WAN, where you typically have high bandwidths (up to
Trond> 10GigE), but often a high latency too (due to geographical
Trond> dispersion). My ping latency from here to a typical server in
Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
Trond> with 10ms delays, but have you tested with higher than that?

If you have that high a latency, the low level TCP protocol is going
to kill your performance before you get to the NFS level. You really
need to open up the TCP window size at that point. And it only gets
worse as the bandwidth goes up too.

There's no good solution, because while you can get good throughput at
points, latency is going to suffer no matter what.

John

2010-03-02 18:42:24

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote:
> >>>>> "Trond" == Trond Myklebust <[email protected]> writes:
>
> Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
> >> Dave,
> >>
> >> Here is one more test on a big ext4 disk file:
> >>
> >> 16k 39.7 MB/s
> >> 32k 54.3 MB/s
> >> 64k 63.6 MB/s
> >> 128k 72.6 MB/s
> >> 256k 71.7 MB/s
> >> rsize ==> 512k 71.7 MB/s
> >> 1024k 72.2 MB/s
> >> 2048k 71.0 MB/s
> >> 4096k 73.0 MB/s
> >> 8192k 74.3 MB/s
> >> 16384k 74.5 MB/s
> >>
> >> It shows that >=128k client side readahead is enough for single disk
> >> case :) As for RAID configurations, I guess big server side readahead
> >> should be enough.
>
> Trond> There are lots of people who would like to use NFS on their
> Trond> company WAN, where you typically have high bandwidths (up to
> Trond> 10GigE), but often a high latency too (due to geographical
> Trond> dispersion). My ping latency from here to a typical server in
> Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
> Trond> with 10ms delays, but have you tested with higher than that?
>
> If you have that high a latency, the low level TCP protocol is going
> to kill your performance before you get to the NFS level. You really
> need to open up the TCP window size at that point. And it only gets
> worse as the bandwidth goes up too.

Yes. You need to open the TCP window in addition to reading ahead
aggressively.

> There's no good solution, because while you can get good throughput at
> points, latency is going to suffer no matter what.

It depends upon your workload. Sequential read and write should still be
doable if you have aggressive readahead and open up for lots of parallel
write RPCs.

Cheers
Trond

2010-03-02 14:19:40

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
> Dave,
>
> Here is one more test on a big ext4 disk file:
>
> 16k 39.7 MB/s
> 32k 54.3 MB/s
> 64k 63.6 MB/s
> 128k 72.6 MB/s
> 256k 71.7 MB/s
> rsize ==> 512k 71.7 MB/s
> 1024k 72.2 MB/s
> 2048k 71.0 MB/s
> 4096k 73.0 MB/s
> 8192k 74.3 MB/s
> 16384k 74.5 MB/s
>
> It shows that >=128k client side readahead is enough for single disk
> case :) As for RAID configurations, I guess big server side readahead
> should be enough.

There are lots of people who would like to use NFS on their company WAN,
where you typically have high bandwidths (up to 10GigE), but often a
high latency too (due to geographical dispersion).
My ping latency from here to a typical server in NetApp's Bangalore
office is ~ 312ms. I read your test results with 10ms delays, but have
you tested with higher than that?

Cheers
Trond

2010-03-02 03:10:24

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

Dave,

Here is one more test on a big ext4 disk file:

16k 39.7 MB/s
32k 54.3 MB/s
64k 63.6 MB/s
128k 72.6 MB/s
256k 71.7 MB/s
rsize ==> 512k 71.7 MB/s
1024k 72.2 MB/s
2048k 71.0 MB/s
4096k 73.0 MB/s
8192k 74.3 MB/s
16384k 74.5 MB/s

It shows that >=128k client side readahead is enough for single disk
case :) As for RAID configurations, I guess big server side readahead
should be enough.

#!/bin/sh

file=/mnt/ext4_test/zero
BDI=0:24

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
echo readahead_size=${rasize}k
fadvise $file 0 0 dontneed
ssh p9 "fadvise $file 0 0 dontneed"
dd if=$file of=/dev/null bs=4k count=402400
done

Thanks,
Fengguang

On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > > > What I'm trying to say is that while I agree with your premise that
> > > > a 7.8MB readahead window is probably far larger than was ever
> > > > intended, I disagree with your methodology and environment for
> > > > selecting a better default value. The default readahead value needs
> > > > to work well in as many situations as possible, not just in perfect
> > > > 1:1 client/server environment.
> > >
> > > Good points. It's imprudent to change a default value based on one
> > > single benchmark. Need to collect more data, which may take time..
> >
> > Agreed - better to spend time now to get it right...
>
> I collected more data with large network latency as well as rsize=32k,
> and updates the readahead size accordingly to 4*rsize.
>
> ===
> nfs: use 2*rsize readahead size
>
> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> readahead size 512k*15=7680k is too large than necessary for typical
> clients.
>
> On a e1000e--e1000e connection, I got the following numbers
> (this reads sparse file from server and involves no disk IO)
>
> readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*)
> 16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s
> 32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s
> 64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s
> 128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s
> 256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s
> rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s
> 1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s
> 2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s
> 4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s
> 8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s
> 16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s
>
> readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*)
> 16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s
> rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s
> 64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s
> 128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s
> 256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s
> 512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s
> 2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s
> 4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s
> 8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s
> 16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s
>
> (*) 10ms+10ms means to add delay on both client & server sides with
> # /sbin/tc qdisc change dev eth0 root netem delay 10ms
> The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
> command takes a dozen seconds. Note that the actual delay reported
> by ping is larger, eg. for the 1ms+1ms case:
> rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms
>
>
> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
> flight) is able to get near full NFS bandwidth. Reducing the mulriple
> from 15 to 4 not only makes the client side readahead size more sane
> (2MB by default), but also reduces the disorderness of the server side
> RPC read requests, which yeilds better server side readahead behavior.
>
> To avoid small readahead when the client mount with "-o rsize=32k" or
> the server only supports rsize <= 32k, we take the max of 2*rsize and
> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
> be explicitly changed by user with kernel parameter "readahead=" and
> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
> takes effective for future NFS mounts).
>
> The test script is:
>
> #!/bin/sh
>
> file=/mnt/sparse
> BDI=0:15
>
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> echo 3 > /proc/sys/vm/drop_caches
> echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> echo readahead_size=${rasize}k
> dd if=$file of=/dev/null bs=4k count=1024000
> done
>
> CC: Dave Chinner <[email protected]>
> CC: Trond Myklebust <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/nfs/client.c | 4 +++-
> fs/nfs/internal.h | 8 --------
> 2 files changed, 3 insertions(+), 9 deletions(-)
>
> --- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800
> +++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800
> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
> server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>
> server->backing_dev_info.name = "nfs";
> - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
> + server->backing_dev_info.ra_pages = max_t(unsigned long,
> + default_backing_dev_info.ra_pages,
> + 4 * server->rpages);
> server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
>
> if (server->wsize > max_rpc_payload)
> --- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800
> +++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800
> @@ -10,14 +10,6 @@
>
> struct nfs_string;
>
> -/* Maximum number of readahead requests
> - * FIXME: this should really be a sysctl so that users may tune it to suit
> - * their needs. People that do NFS over a slow network, might for
> - * instance want to reduce it to something closer to 1 for improved
> - * interactive response.
> - */
> -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1)
> -
> /*
> * Determine if sessions are in use.
> */

2010-03-03 03:27:28

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

On Wed, Mar 03, 2010 at 02:42:19AM +0800, Trond Myklebust wrote:
> On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote:
> > >>>>> "Trond" == Trond Myklebust <[email protected]> writes:
> >
> > Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
> > >> Dave,
> > >>
> > >> Here is one more test on a big ext4 disk file:
> > >>
> > >> 16k 39.7 MB/s
> > >> 32k 54.3 MB/s
> > >> 64k 63.6 MB/s
> > >> 128k 72.6 MB/s
> > >> 256k 71.7 MB/s
> > >> rsize ==> 512k 71.7 MB/s
> > >> 1024k 72.2 MB/s
> > >> 2048k 71.0 MB/s
> > >> 4096k 73.0 MB/s
> > >> 8192k 74.3 MB/s
> > >> 16384k 74.5 MB/s
> > >>
> > >> It shows that >=128k client side readahead is enough for single disk
> > >> case :) As for RAID configurations, I guess big server side readahead
> > >> should be enough.
> >
> > Trond> There are lots of people who would like to use NFS on their
> > Trond> company WAN, where you typically have high bandwidths (up to
> > Trond> 10GigE), but often a high latency too (due to geographical
> > Trond> dispersion). My ping latency from here to a typical server in
> > Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
> > Trond> with 10ms delays, but have you tested with higher than that?
> >
> > If you have that high a latency, the low level TCP protocol is going
> > to kill your performance before you get to the NFS level. You really
> > need to open up the TCP window size at that point. And it only gets
> > worse as the bandwidth goes up too.
>
> Yes. You need to open the TCP window in addition to reading ahead
> aggressively.

I only get ~10MB/s throughput with following settings.

# huge NFS ra size
echo 89512 > /sys/devices/virtual/bdi/0:15/read_ahead_kb

# on both sides
/sbin/tc qdisc add dev eth0 root netem delay 200ms

net.core.rmem_max = 873800000
net.core.wmem_max = 655360000
net.ipv4.tcp_rmem = 8192 87380000 873800000
net.ipv4.tcp_wmem = 4096 65536000 655360000

Did I miss something?

Thanks,
Fengguang

2010-03-03 01:43:28

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

On Wed, Mar 03, 2010 at 04:14:33AM +0800, Bret Towe wrote:

> how do you determine which bdi to use? I skimmed thru
> the filesystem in /sys and didn't see anything that says which is what

MOUNTPOINT=" /mnt/ext4_test "
# grep "$MOUNTPOINT" /proc/$$/mountinfo|awk '{print $3}'
0:24

Thanks,
Fengguang

2010-03-02 20:14:33

by Bret Towe

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

On Mon, Mar 1, 2010 at 7:10 PM, Wu Fengguang <[email protected]> wrote=
:
> Dave,
>
> Here is one more test on a big ext4 disk file:
>
> =A0 =A0 =A0 =A0 =A0 16k =A039.7 MB/s
> =A0 =A0 =A0 =A0 =A0 32k =A054.3 MB/s
> =A0 =A0 =A0 =A0 =A0 64k =A063.6 MB/s
> =A0 =A0 =A0 =A0 =A0128k =A072.6 MB/s
> =A0 =A0 =A0 =A0 =A0256k =A071.7 MB/s
> rsize =3D=3D> 512k =A071.7 MB/s
> =A0 =A0 =A0 =A0 1024k =A072.2 MB/s
> =A0 =A0 =A0 =A0 2048k =A071.0 MB/s
> =A0 =A0 =A0 =A0 4096k =A073.0 MB/s
> =A0 =A0 =A0 =A0 8192k =A074.3 MB/s
> =A0 =A0 =A0 =A016384k =A074.5 MB/s
>
> It shows that >=3D128k client side readahead is enough for single disk
> case :) As for RAID configurations, I guess big server side readahead
> should be enough.
>
> #!/bin/sh
>
> file=3D/mnt/ext4_test/zero
> BDI=3D0:24
>
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> =A0 =A0 =A0 =A0echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> =A0 =A0 =A0 =A0echo readahead_size=3D${rasize}k
> =A0 =A0 =A0 =A0fadvise $file 0 0 dontneed
> =A0 =A0 =A0 =A0ssh p9 "fadvise $file 0 0 dontneed"
> =A0 =A0 =A0 =A0dd if=3D$file of=3D/dev/null bs=3D4k count=3D402400
> done

how do you determine which bdi to use? I skimmed thru
the filesystem in /sys and didn't see anything that says which is what

> Thanks,
> Fengguang
>
> On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote:
>> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
>> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
>> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
>> > > > What I'm trying to say is that while I agree with your premise tha=
t
>> > > > a 7.8MB readahead window is probably far larger than was ever
>> > > > intended, I disagree with your methodology and environment for
>> > > > selecting a better default value. =A0The default readahead value n=
eeds
>> > > > to work well in as many situations as possible, not just in perfec=
t
>> > > > 1:1 client/server environment.
>> > >
>> > > Good points. It's imprudent to change a default value based on one
>> > > single benchmark. Need to collect more data, which may take time..
>> >
>> > Agreed - better to spend time now to get it right...
>>
>> I collected more data with large network latency as well as rsize=3D32k,
>> and updates the readahead size accordingly to 4*rsize.
>>
>> =3D=3D=3D
>> nfs: use 2*rsize readahead size
>>
>> With default rsize=3D512k and NFS_MAX_READAHEAD=3D15, the current NFS
>> readahead size 512k*15=3D7680k is too large than necessary for typical
>> clients.
>>
>> On a e1000e--e1000e connection, I got the following numbers
>> (this reads sparse file from server and involves no disk IO)
>>
>> readahead size =A0 =A0 =A0 =A0normal =A0 =A0 =A0 =A0 =A01ms+1ms =A0 =A0 =
=A0 =A0 5ms+5ms =A0 =A0 =A0 =A0 10ms+10ms(*)
>> =A0 =A0 =A0 =A0 =A016k =A035.5 MB/s =A0 =A0 =A0 =A04.8 MB/s =A0 =A0 =A0 =
=A02.1 MB/s =A0 =A0 =A0 1.2 MB/s
>> =A0 =A0 =A0 =A0 =A032k =A054.3 MB/s =A0 =A0 =A0 =A06.7 MB/s =A0 =A0 =A0 =
=A03.6 MB/s =A0 =A0 =A0 2.3 MB/s
>> =A0 =A0 =A0 =A0 =A064k =A064.1 MB/s =A0 =A0 =A0 12.6 MB/s =A0 =A0 =A0 =
=A06.5 MB/s =A0 =A0 =A0 4.7 MB/s
>> =A0 =A0 =A0 =A0 128k =A070.5 MB/s =A0 =A0 =A0 20.1 MB/s =A0 =A0 =A0 11.9=
MB/s =A0 =A0 =A0 8.7 MB/s
>> =A0 =A0 =A0 =A0 256k =A074.6 MB/s =A0 =A0 =A0 38.6 MB/s =A0 =A0 =A0 21.3=
MB/s =A0 =A0 =A015.0 MB/s
>> rsize =3D=3D> 512k =A0 =A0 =A0 =A077.4 MB/s =A0 =A0 =A0 59.4 MB/s =A0 =
=A0 =A0 39.8 MB/s =A0 =A0 =A025.5 MB/s
>> =A0 =A0 =A0 =A01024k =A085.5 MB/s =A0 =A0 =A0 77.9 MB/s =A0 =A0 =A0 65.7=
MB/s =A0 =A0 =A043.0 MB/s
>> =A0 =A0 =A0 =A02048k =A086.8 MB/s =A0 =A0 =A0 81.5 MB/s =A0 =A0 =A0 84.1=
MB/s =A0 =A0 =A059.7 MB/s
>> =A0 =A0 =A0 =A04096k =A087.9 MB/s =A0 =A0 =A0 77.4 MB/s =A0 =A0 =A0 56.2=
MB/s =A0 =A0 =A059.2 MB/s
>> =A0 =A0 =A0 =A08192k =A089.0 MB/s =A0 =A0 =A0 81.2 MB/s =A0 =A0 =A0 78.0=
MB/s =A0 =A0 =A041.2 MB/s
>> =A0 =A0 =A0 16384k =A087.7 MB/s =A0 =A0 =A0 85.8 MB/s =A0 =A0 =A0 62.0 M=
B/s =A0 =A0 =A056.5 MB/s
>>
>> readahead size =A0 =A0 =A0 =A0normal =A0 =A0 =A0 =A0 =A01ms+1ms =A0 =A0 =
=A0 =A0 5ms+5ms =A0 =A0 =A0 =A0 10ms+10ms(*)
>> =A0 =A0 =A0 =A0 =A016k =A037.2 MB/s =A0 =A0 =A0 =A06.4 MB/s =A0 =A0 =A0 =
=A02.1 MB/s =A0 =A0 =A0 =A01.2 MB/s
>> rsize =3D=3D> =A032k =A0 =A0 =A0 =A056.6 MB/s =A0 =A0 =A0 =A06.8 MB/s =
=A0 =A0 =A0 =A03.6 MB/s =A0 =A0 =A0 =A02.3 MB/s
>> =A0 =A0 =A0 =A0 =A064k =A066.1 MB/s =A0 =A0 =A0 12.7 MB/s =A0 =A0 =A0 =
=A06.6 MB/s =A0 =A0 =A0 =A04.7 MB/s
>> =A0 =A0 =A0 =A0 128k =A069.3 MB/s =A0 =A0 =A0 22.0 MB/s =A0 =A0 =A0 12.2=
MB/s =A0 =A0 =A0 =A08.9 MB/s
>> =A0 =A0 =A0 =A0 256k =A069.6 MB/s =A0 =A0 =A0 41.8 MB/s =A0 =A0 =A0 20.7=
MB/s =A0 =A0 =A0 14.7 MB/s
>> =A0 =A0 =A0 =A0 512k =A071.3 MB/s =A0 =A0 =A0 54.1 MB/s =A0 =A0 =A0 25.0=
MB/s =A0 =A0 =A0 16.9 MB/s
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=
^^
>> =A0 =A0 =A0 =A01024k =A071.5 MB/s =A0 =A0 =A0 48.4 MB/s =A0 =A0 =A0 26.0=
MB/s =A0 =A0 =A0 16.7 MB/s
>> =A0 =A0 =A0 =A02048k =A071.7 MB/s =A0 =A0 =A0 53.2 MB/s =A0 =A0 =A0 25.3=
MB/s =A0 =A0 =A0 17.6 MB/s
>> =A0 =A0 =A0 =A04096k =A071.5 MB/s =A0 =A0 =A0 50.4 MB/s =A0 =A0 =A0 25.7=
MB/s =A0 =A0 =A0 17.1 MB/s
>> =A0 =A0 =A0 =A08192k =A071.1 MB/s =A0 =A0 =A0 52.3 MB/s =A0 =A0 =A0 26.3=
MB/s =A0 =A0 =A0 16.9 MB/s
>> =A0 =A0 =A0 16384k =A070.2 MB/s =A0 =A0 =A0 56.6 MB/s =A0 =A0 =A0 27.0 M=
B/s =A0 =A0 =A0 16.8 MB/s
>>
>> (*) 10ms+10ms means to add delay on both client & server sides with
>> =A0 =A0 # /sbin/tc qdisc change dev eth0 root netem delay 10ms
>> =A0 =A0 The total >=3D20ms delay is so large for NFS, that a simple `vi =
some.sh`
>> =A0 =A0 command takes a dozen seconds. Note that the actual delay report=
ed
>> =A0 =A0 by ping is larger, eg. for the 1ms+1ms case:
>> =A0 =A0 =A0 =A0 rtt min/avg/max/mdev =3D 7.361/8.325/9.710/0.837 ms
>>
>>
>> So it seems that readahead_size=3D4*rsize (ie. keep 4 RPC requests in
>> flight) is able to get near full NFS bandwidth. Reducing the mulriple
>> from 15 to 4 not only makes the client side readahead size more sane
>> (2MB by default), but also reduces the disorderness of the server side
>> RPC read requests, which yeilds better server side readahead behavior.
>>
>> To avoid small readahead when the client mount with "-o rsize=3D32k" or
>> the server only supports rsize <=3D 32k, we take the max of 2*rsize and
>> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
>> be explicitly changed by user with kernel parameter "readahead=3D" and
>> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
>> takes effective for future NFS mounts).
>>
>> The test script is:
>>
>> #!/bin/sh
>>
>> file=3D/mnt/sparse
>> BDI=3D0:15
>>
>> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
>> do
>> =A0 =A0 =A0 echo 3 > /proc/sys/vm/drop_caches
>> =A0 =A0 =A0 echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
>> =A0 =A0 =A0 echo readahead_size=3D${rasize}k
>> =A0 =A0 =A0 dd if=3D$file of=3D/dev/null bs=3D4k count=3D1024000
>> done
>>
>> CC: Dave Chinner <[email protected]>
>> CC: Trond Myklebust <[email protected]>
>> Signed-off-by: Wu Fengguang <[email protected]>
>> ---
>> =A0fs/nfs/client.c =A0 | =A0 =A04 +++-
>> =A0fs/nfs/internal.h | =A0 =A08 --------
>> =A02 files changed, 3 insertions(+), 9 deletions(-)
>>
>> --- linux.orig/fs/nfs/client.c =A0 =A0 =A0 =A02010-02-26 10:10:46.000000=
000 +0800
>> +++ linux/fs/nfs/client.c =A0 =A0 2010-02-26 11:07:22.000000000 +0800
>> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
>> =A0 =A0 =A0 server->rpages =3D (server->rsize + PAGE_CACHE_SIZE - 1) >> =
PAGE_CACHE_SHIFT;
>>
>> =A0 =A0 =A0 server->backing_dev_info.name =3D "nfs";
>> - =A0 =A0 server->backing_dev_info.ra_pages =3D server->rpages * NFS_MAX=
_READAHEAD;
>> + =A0 =A0 server->backing_dev_info.ra_pages =3D max_t(unsigned long,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 default_backing_dev_info.ra_pages,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 4 * server->rpages);
>> =A0 =A0 =A0 server->backing_dev_info.capabilities |=3D BDI_CAP_ACCT_UNST=
ABLE;
>>
>> =A0 =A0 =A0 if (server->wsize > max_rpc_payload)
>> --- linux.orig/fs/nfs/internal.h =A0 =A0 =A02010-02-26 10:10:46.00000000=
0 +0800
>> +++ linux/fs/nfs/internal.h =A0 2010-02-26 11:07:07.000000000 +0800
>> @@ -10,14 +10,6 @@
>>
>> =A0struct nfs_string;
>>
>> -/* Maximum number of readahead requests
>> - * FIXME: this should really be a sysctl so that users may tune it to s=
uit
>> - * =A0 =A0 =A0 =A0their needs. People that do NFS over a slow network, =
might for
>> - * =A0 =A0 =A0 =A0instance want to reduce it to something closer to 1 f=
or improved
>> - * =A0 =A0 =A0 =A0interactive response.
>> - */
>> -#define NFS_MAX_READAHEAD =A0 =A0(RPC_DEF_SLOT_TABLE - 1)
>> -
>> =A0/*
>> =A0 * Determine if sessions are in use.
>> =A0 */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" i=
n
> the body of a message to [email protected]
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at =A0http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-04-14 21:22:28

by Dean

[permalink] [raw]
Subject: Re: [RFC] nfs: use 4*rsize readahead size

You cannot simply update linux system tcp parameters and expect nfs to
work well performance-wise over the wan. The NFS server does not use
system tcp parameters. This is a long standing issue. A patch was
originally added in 2.6.30 that enabled NFS to use linux tcp buffer
autotuning, which would resolve the issue, but a regression was reported
(http://thread.gmane.org/gmane.linux.kernel/826598 ) and so they removed
the patch.

Maybe its time to rethink allowing users to manually set linux nfs
server tcp buffer sizes? Years have passed on this subject and people
are still waiting. Good performance over the wan will require manually
setting tcp buffer sizes. As mentioned in the regression thread,
autotuning can reduce performance by up to 10%. Here is a patch
(slightly outdated) that creates 2 sysctls that allow users to manually
to set NFS TCP buffer sizes. The first link also has a fair amount of
background information on the subject.
http://www.spinics.net/lists/linux-nfs/msg01338.html
http://www.spinics.net/lists/linux-nfs/msg01339.html

Dean


Wu Fengguang wrote:
> On Wed, Mar 03, 2010 at 02:42:19AM +0800, Trond Myklebust wrote:
>
>> On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote:
>>
>>>>>>>> "Trond" == Trond Myklebust <[email protected]> writes:
>>>>>>>>
>>> Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote:
>>>
>>>>> Dave,
>>>>>
>>>>> Here is one more test on a big ext4 disk file:
>>>>>
>>>>> 16k 39.7 MB/s
>>>>> 32k 54.3 MB/s
>>>>> 64k 63.6 MB/s
>>>>> 128k 72.6 MB/s
>>>>> 256k 71.7 MB/s
>>>>> rsize ==> 512k 71.7 MB/s
>>>>> 1024k 72.2 MB/s
>>>>> 2048k 71.0 MB/s
>>>>> 4096k 73.0 MB/s
>>>>> 8192k 74.3 MB/s
>>>>> 16384k 74.5 MB/s
>>>>>
>>>>> It shows that >=128k client side readahead is enough for single disk
>>>>> case :) As for RAID configurations, I guess big server side readahead
>>>>> should be enough.
>>>>>
>>> Trond> There are lots of people who would like to use NFS on their
>>> Trond> company WAN, where you typically have high bandwidths (up to
>>> Trond> 10GigE), but often a high latency too (due to geographical
>>> Trond> dispersion). My ping latency from here to a typical server in
>>> Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
>>> Trond> with 10ms delays, but have you tested with higher than that?
>>>
>>> If you have that high a latency, the low level TCP protocol is going
>>> to kill your performance before you get to the NFS level. You really
>>> need to open up the TCP window size at that point. And it only gets
>>> worse as the bandwidth goes up too.
>>>
>> Yes. You need to open the TCP window in addition to reading ahead
>> aggressively.
>>
>
> I only get ~10MB/s throughput with following settings.
>
> # huge NFS ra size
> echo 89512 > /sys/devices/virtual/bdi/0:15/read_ahead_kb
>
> # on both sides
> /sbin/tc qdisc add dev eth0 root netem delay 200ms
>
> net.core.rmem_max = 873800000
> net.core.wmem_max = 655360000
> net.ipv4.tcp_rmem = 8192 87380000 873800000
> net.ipv4.tcp_wmem = 4096 65536000 655360000
>
> Did I miss something?
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>