LinuxLists.cc - Re: NFS read performance degradation after upgrading to kernel 5.4.*

2021-03-29 15:08:28

Subject: Re: NFS read performance degradation after upgrading to kernel 5.4.*

Hello Team,

-We have got multiple customers complaining about NFS read performance degradation after they upgraded to kernel 5.4.*

-After doing some deep dive and testing we figured out that the reason behind the regression was patch NFS: Optimise the default readahead size[1] Which has been merged to Linux kernels 5.4.* and above.
-Our customers are using AWS EC2 instances as client mounting EFS export (which is AWS managed NFSV4 service), I am sharing the results that we got before & after the upgrade given that the NFS server(EFS) should be able to achieve between 250-300MB/sec which the clients can achieve without patch[1] while getting quarter of this speed around 70MB/sec with the mentioned patch merged as seen below.

##########################################################################################

Before the upgrade:
# uname -r
4.14.225-168.357.amzn2.x86_64
[root@ip-172-31-28-135 ec2-user]# sync; echo 3 > /proc/sys/vm/drop_caches
[root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
[root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
test
8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1)

##########################################################################################

After the upgrade using the same client & server:
#uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-readahead show /home/ec2-user/efs/;rsync --progress efs/test .
5.4.0-1.132.64.amzn2.x86_64
/home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
test
1,073,741,824 100%   68.61MB/s    0:00:14 (xfr#1, to-chk=0/1)

-We are recommending[2] EFS users to use rsize=1048576 as mount option for getting the best read performance from their EFS exports given that EC2 to EFS traffic is residing in the same AWS availability zone hence it has low latency and up to 250-300MB/sec throughput however with the mentioned patch merged the customer can’t achieve this throughput after the kernel upgrade because the default NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so the clients have to manually raise the manually raise the read_ahead_kb parameter from 128 to 15360 to get the same experience they were getting before the upgrade.
-We know that the purpose of the mentioned patch was to decrease OS boot time (for netboot users) also decreasing Application start up times in congested & Low throughput networks as mentioned in [3], however this would also cause regression for high throughput & low latency workload especially sequential read workflows.
-After doing further debugging we also found that the maximum read ahead size is constant so there is no Autotuning for this configuration even if the client is filling the read ahead window which means any NFS client specially ones using maximum rsize mount option will have to manually tune their maximum NFS read ahead size after the upgrade which in my opinion is some sort of regression from older kernels behaviour.

#########################################################################################

After increasing the maximum NFS read ahead size to 15MB it’s clear that read ahead window is expanded as expected and it will be doubled until it reach 15MB.

Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840

#########################################################################################

With 128KB as maximum NFS read ahead size, the read ahead window size is increasing until it reach the configured maximum Read ahead (128KB).

Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32

-In my reproduction I used rsync as clarified above and it is always doing read syscalls requesting 256 KB in each call:
15:47:10.780658 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023749>
15:47:10.805467 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023739>
15:47:10.830272 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023664>
15:47:10.854972 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023814>
15:47:10.879837 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023625>
15:47:10.904496 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023645>
15:47:10.929180 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.024072>
15:47:10.954308 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023669>

-Looking into the readahead source code and I can see that readahead is doing some heuristics to determine if the access pattern is sequential or random then it modify the read ahead window(amount of data it will prefetch) accordingly, read ahead can't also read beyond the requested file size, this theoretically means that having Large NFS Max read ahead size (15MB) shouldn't have much impact on performance even with random I/O pattern or with data set consists of small files, the only major impact of having large NFS read ahead size would be some networking congestion or bootup delay with hosts using congested or low throughput networks as illustrated in https://bugzilla.kernel.org/show_bug.cgi?id=204939 & https://lore.kernel.org/linux-nfs/[email protected]/T/.
-With patch https://www.spinics.net/lists/linux-nfs/msg75018.html the packet captures are showing the client either asking for 128KB or 256 KB in the NFS READ calls and it can't reach even the 1MB configured rsize mount option this is because the ondemand_readahead which should be responsible for moving and scaling the read ahead window has an if condition which was a part of https://www.mail-archive.com/[email protected]/msg1274743.html, this patch actually modified read ahead to issue the maximum of the user request size(rsync is doing 256KB read requests), and the read-ahead max size(128KB by default), but capped to the max request size on the device side(1MB in our case). The latter is done to avoid reading ahead too much, if the application asks for a huge read. this is why with 128KB as read ahead size and application asking for 256KB we never exceed 256KB because this patch actually intended to do that, it avoids limiting the requested data to the maximum read ahead size but we are still limited by the minimum between amount of data application is reading which is 256KB as sync in rsync strace output & bdi->io_pages(256 pages=1MB) as configured in the rsize mount option.

-Output after adding some debugging to the kernel showing the size of each variable in the mentioned "if" condition:

[ 238.387788] req_size= 64 ------>256MB rsync read requests
[ 238.387790] io pages= 256----->1MB as supported by EFS and as configured in the rsize mount option.
[ 238.390487] max_pages before= 32----->128 KB read ahead size which is the default.
[ 238.393177] max_pages after= 64---->raised to 256 KB because of changes mentioned in [4] "max_pages = min(req_size, bdi->io_pages);"

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435

/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
static void ondemand_readahead(struct readahead_control *ractl,
        struct file_ra_state *ra, bool hit_readahead_marker,
        unsigned long req_size)
{
    struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
    unsigned long max_pages = ra->ra_pages;
    unsigned long add_pages;
    unsigned long index = readahead_index(ractl);
    pgoff_t prev_index;

    /*
     * If the request exceeds the readahead window, allow the read to
     * be up to the optimal hardware IO size
     */
    if (req_size > max_pages && bdi->io_pages > max_pages)
        max_pages = min(req_size, bdi->io_pages);

##################################################################################################

-With 128KB as default maximum read ahead size the packet capture from the client side is showing the NFSv4 READ calls showing count in bytes moving between 128KB to 256KB.

73403         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
73404         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
73406         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73403)[Unreassembled Packet]
73415         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
73416         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73404)[Unreassembled Packet]
73428         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
73429         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73415)[Unreassembled Packet]
73438         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
73439         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73428)[Unreassembled Packet]

-nfsstat is showing 8183 NFSV4 READ calls required to read 1GB file.

# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
8204       0          8204

Client nfs v4:
null         read         write        commit       open         open_conf
1         0% 8183     99% 0         0% 0         0% 0         0% 0         0%
open_noat    open_dgrd    close        setattr      fsinfo       renew
1         0% 0         0% 1         0% 0         0% 2         0% 0         0%
setclntid    confirm      lock         lockt        locku        access
0         0% 0         0% 0         0% 0         0% 0         0% 1         0%
getattr      lookup       lookup_root remove       rename       link
4         0% 1         0% 1         0% 0         0% 0         0% 0         0%
symlink      create      pathconf     statfs       readlink     readdir
0         0% 0         0% 1         0% 0         0% 0         0% 0         0%
server_caps delegreturn getacl       setacl       fs_locations rel_lkowner
3         0% 0         0% 0         0% 0         0% 0         0% 0         0%
secinfo      exchange_id create_ses   destroy_ses sequence     get_lease_t
0         0% 0         0% 2         0% 1         0% 0         0% 0         0%
reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn getdevlist
0         0% 1         0% 0         0% 0         0% 0         0% 0         0%
(null)
1 0%

###############################################################################################

-When using 15MB as maximum read ahead size, the client is sending 1MB NFSV4 read requests hence it’s able to read the same 1GB file in 1024 NFS READ calls

#uname -r; mount -t nfs4 -o nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead show /home/ec2-user/efs/
5.3.9
/home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
#sync; echo 3 > /proc/sys/vm/drop_caches
#rsync --progress efs/test .
test
1,073,741,824 100% 260.15MB/s    0:00:03 (xfr#1, to-chk=0/1)
[root@ip-172-31-17-42 ec2-user]# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
1043       0          1043

Client nfs v4:
null         read         write        commit       open         open_conf
1         0% 1024     98% 0         0% 0         0% 0         0% 0         0%
open_noat    open_dgrd    close        setattr      fsinfo       renew
1         0% 0         0% 1         0% 0         0% 2         0% 0         0%
setclntid    confirm      lock         lockt        locku        access
0         0% 0         0% 0         0% 0         0% 0         0% 1         0%
getattr      lookup       lookup_root remove       rename       link
2         0% 1         0% 1         0% 0         0% 0         0% 0         0%
symlink     create       pathconf     statfs       readlink     readdir
0         0% 0         0% 1         0% 0         0% 0         0% 0         0%
server_caps delegreturn getacl       setacl       fs_locations rel_lkowner
3         0% 0         0% 0         0% 0         0% 0         0% 0         0%
secinfo      exchange_id create_ses   destroy_ses sequence     get_lease_t
0         0% 0         0% 2         0% 1         0% 0         0% 0         0%
reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn getdevlist
0         0% 1         0% 0         0% 0         0% 0         0% 0         0%
(null)
1 0%

-The packet capture from the client side is showing NFSv4 READ calls with 1MB as read count in bytes when having 15MB as maximum NFS read ahead size.

2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576
2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576
2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576
2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576

###############################################################################################

-I think there are 2 options to mitigate this behaviour which I am listing below:
A)Raising the default maximum NFS read ahead size because the current default 128KB doesn’t seem to be sufficient for High throughout & low latency workload, I strongly believe that the NFS rsize mount option should be used as variable in deciding the maximum NFS read ahead size which was the case before [1] while now it’s always 128KB regardless the utilized rsize mount option. Also I think clients running in High latency & Low throughout environment shouldn’t use 1MB as rsize in their mount options(i.e they should use smaller rsize) because it may increase their suffering even with low maximum NFS read ahead size.
B)Adding some logic to read ahead to have some kind of Autotuning (Similar to TCP Autotuning) where the maximum read ahead size can dynamically increase in case the client/reader is constantly filling up/utilizing the read ahead window size.

Links:
[1] https://www.spinics.net/lists/linux-nfs/msg75018.html
[2] https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
[4] https://www.mail-archive.com/[email protected]/msg1274743.html

Thank you.

Hazem

Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

2021-03-31 12:57:15

by Mohamed Abuelfotoh, Hazem

[permalink] [raw]

Subject: Re: NFS read performance degradation after upgrading to kernel 5.4.*

Hi Trond,

I am wondering if we should consider raising the default maximum NFS read ahead size given the facts I mentioned in my previous e-mail.

Thank you.

Hazem

On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <[email protected]> wrote:

Hello Team,

-We have got multiple customers complaining about NFS read performance degradation after they upgraded to kernel 5.4.*

-After doing some deep dive and testing we figured out that the reason behind the regression was patch NFS: Optimise the default readahead size[1] Which has been merged to Linux kernels 5.4.* and above.
-Our customers are using AWS EC2 instances as client mounting EFS export (which is AWS managed NFSV4 service), I am sharing the results that we got before & after the upgrade given that the NFS server(EFS) should be able to achieve between 250-300MB/sec which the clients can achieve without patch[1] while getting quarter of this speed around 70MB/sec with the mentioned patch merged as seen below.

##########################################################################################

Before the upgrade:
# uname -r
4.14.225-168.357.amzn2.x86_64
[root@ip-172-31-28-135 ec2-user]# sync; echo 3 > /proc/sys/vm/drop_caches
[root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
[root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
test
8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1)

##########################################################################################

After the upgrade using the same client & server:
#uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-readahead show /home/ec2-user/efs/;rsync --progress efs/test .
5.4.0-1.132.64.amzn2.x86_64
/home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
test
1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to-chk=0/1)

-We are recommending[2] EFS users to use rsize=1048576 as mount option for getting the best read performance from their EFS exports given that EC2 to EFS traffic is residing in the same AWS availability zone hence it has low latency and up to 250-300MB/sec throughput however with the mentioned patch merged the customer can’t achieve this throughput after the kernel upgrade because the default NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so the clients have to manually raise the manually raise the read_ahead_kb parameter from 128 to 15360 to get the same experience they were getting before the upgrade.
-We know that the purpose of the mentioned patch was to decrease OS boot time (for netboot users) also decreasing Application start up times in congested & Low throughput networks as mentioned in [3], however this would also cause regression for high throughput & low latency workload especially sequential read workflows.
-After doing further debugging we also found that the maximum read ahead size is constant so there is no Autotuning for this configuration even if the client is filling the read ahead window which means any NFS client specially ones using maximum rsize mount option will have to manually tune their maximum NFS read ahead size after the upgrade which in my opinion is some sort of regression from older kernels behaviour.

#########################################################################################

After increasing the maximum NFS read ahead size to 15MB it’s clear that read ahead window is expanded as expected and it will be doubled until it reach 15MB.

Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840

#########################################################################################

With 128KB as maximum NFS read ahead size, the read ahead window size is increasing until it reach the configured maximum Read ahead (128KB).

Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32

-In my reproduction I used rsync as clarified above and it is always doing read syscalls requesting 256 KB in each call:
15:47:10.780658 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023749>
15:47:10.805467 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023739>
15:47:10.830272 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023664>
15:47:10.854972 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023814>
15:47:10.879837 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023625>
15:47:10.904496 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023645>
15:47:10.929180 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.024072>
15:47:10.954308 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023669>

-Looking into the readahead source code and I can see that readahead is doing some heuristics to determine if the access pattern is sequential or random then it modify the read ahead window(amount of data it will prefetch) accordingly, read ahead can't also read beyond the requested file size, this theoretically means that having Large NFS Max read ahead size (15MB) shouldn't have much impact on performance even with random I/O pattern or with data set consists of small files, the only major impact of having large NFS read ahead size would be some networking congestion or bootup delay with hosts using congested or low throughput networks as illustrated in https://bugzilla.kernel.org/show_bug.cgi?id=204939 & https://lore.kernel.org/linux-nfs/[email protected]/T/.
-With patch https://www.spinics.net/lists/linux-nfs/msg75018.html the packet captures are showing the client either asking for 128KB or 256 KB in the NFS READ calls and it can't reach even the 1MB configured rsize mount option this is because the ondemand_readahead which should be responsible for moving and scaling the read ahead window has an if condition which was a part of https://www.mail-archive.com/[email protected]/msg1274743.html, this patch actually modified read ahead to issue the maximum of the user request size(rsync is doing 256KB read requests), and the read-ahead max size(128KB by default), but capped to the max request size on the device side(1MB in our case). The latter is done to avoid reading ahead too much, if the application asks for a huge read. this is why with 128KB as read ahead size and application asking for 256KB we never exceed 256KB because this patch actually intended to do that, it avoids limiting the requested data to the maximum read ahead size but we are still limited by the minimum between amount of data application is reading which is 256KB as sync in rsync strace output & bdi->io_pages(256 pages=1MB) as configured in the rsize mount option.

-Output after adding some debugging to the kernel showing the size of each variable in the mentioned "if" condition:

[ 238.387788] req_size= 64 ------>256MB rsync read requests
[ 238.387790] io pages= 256----->1MB as supported by EFS and as configured in the rsize mount option.
[ 238.390487] max_pages before= 32----->128 KB read ahead size which is the default.
[ 238.393177] max_pages after= 64---->raised to 256 KB because of changes mentioned in [4] "max_pages = min(req_size, bdi->io_pages);"

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435

/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
static void ondemand_readahead(struct readahead_control *ractl,
struct file_ra_state *ra, bool hit_readahead_marker,
unsigned long req_size)
{
struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
unsigned long max_pages = ra->ra_pages;
unsigned long add_pages;
unsigned long index = readahead_index(ractl);
pgoff_t prev_index;

/*
* If the request exceeds the readahead window, allow the read to
* be up to the optimal hardware IO size
*/
if (req_size > max_pages && bdi->io_pages > max_pages)
max_pages = min(req_size, bdi->io_pages);

##################################################################################################

-With 128KB as default maximum read ahead size the packet capture from the client side is showing the NFSv4 READ calls showing count in bytes moving between 128KB to 256KB.

73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73403)[Unreassembled Packet]
73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73404)[Unreassembled Packet]
73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73415)[Unreassembled Packet]
73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73428)[Unreassembled Packet]

-nfsstat is showing 8183 NFSV4 READ calls required to read 1GB file.

# nfsstat
Client rpc stats:
calls retrans authrefrsh
8204 0 8204

Client nfs v4:
null read write commit open open_conf
1 0% 8183 99% 0 0% 0 0% 0 0% 0 0%
open_noat open_dgrd close setattr fsinfo renew
1 0% 0 0% 1 0% 0 0% 2 0% 0 0%
setclntid confirm lock lockt locku access
0 0% 0 0% 0 0% 0 0% 0 0% 1 0%
getattr lookup lookup_root remove rename link
4 0% 1 0% 1 0% 0 0% 0 0% 0 0%
symlink create pathconf statfs readlink readdir
0 0% 0 0% 1 0% 0 0% 0 0% 0 0%
server_caps delegreturn getacl setacl fs_locations rel_lkowner
3 0% 0 0% 0 0% 0 0% 0 0% 0 0%
secinfo exchange_id create_ses destroy_ses sequence get_lease_t
0 0% 0 0% 2 0% 1 0% 0 0% 0 0%
reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist
0 0% 1 0% 0 0% 0 0% 0 0% 0 0%
(null)
1 0%

###############################################################################################

-When using 15MB as maximum read ahead size, the client is sending 1MB NFSV4 read requests hence it’s able to read the same 1GB file in 1024 NFS READ calls

#uname -r; mount -t nfs4 -o nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead show /home/ec2-user/efs/
5.3.9
/home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
#sync; echo 3 > /proc/sys/vm/drop_caches
#rsync --progress efs/test .
test
1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to-chk=0/1)
[root@ip-172-31-17-42 ec2-user]# nfsstat
Client rpc stats:
calls retrans authrefrsh
1043 0 1043

Client nfs v4:
null read write commit open open_conf
1 0% 1024 98% 0 0% 0 0% 0 0% 0 0%
open_noat open_dgrd close setattr fsinfo renew
1 0% 0 0% 1 0% 0 0% 2 0% 0 0%
setclntid confirm lock lockt locku access
0 0% 0 0% 0 0% 0 0% 0 0% 1 0%
getattr lookup lookup_root remove rename link
2 0% 1 0% 1 0% 0 0% 0 0% 0 0%
symlink create pathconf statfs readlink readdir
0 0% 0 0% 1 0% 0 0% 0 0% 0 0%
server_caps delegreturn getacl setacl fs_locations rel_lkowner
3 0% 0 0% 0 0% 0 0% 0 0% 0 0%
secinfo exchange_id create_ses destroy_ses sequence get_lease_t
0 0% 0 0% 2 0% 1 0% 0 0% 0 0%
reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist
0 0% 1 0% 0 0% 0 0% 0 0% 0 0%
(null)
1 0%

-The packet capture from the client side is showing NFSv4 READ calls with 1MB as read count in bytes when having 15MB as maximum NFS read ahead size.

2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576
2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576
2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576
2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576

###############################################################################################

-I think there are 2 options to mitigate this behaviour which I am listing below:
A)Raising the default maximum NFS read ahead size because the current default 128KB doesn’t seem to be sufficient for High throughout & low latency workload, I strongly believe that the NFS rsize mount option should be used as variable in deciding the maximum NFS read ahead size which was the case before [1] while now it’s always 128KB regardless the utilized rsize mount option. Also I think clients running in High latency & Low throughout environment shouldn’t use 1MB as rsize in their mount options(i.e they should use smaller rsize) because it may increase their suffering even with low maximum NFS read ahead size.
B)Adding some logic to read ahead to have some kind of Autotuning (Similar to TCP Autotuning) where the maximum read ahead size can dynamically increase in case the client/reader is constantly filling up/utilizing the read ahead window size.

Links:
[1] https://www.spinics.net/lists/linux-nfs/msg75018.html
[2] https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
[4] https://www.mail-archive.com/[email protected]/msg1274743.html

Thank you.

Hazem

Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

2021-03-31 13:10:26

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS read performance degradation after upgrading to kernel 5.4.*

On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote:
> Hi Trond,
>
> I am wondering if we should consider raising the default maximum NFS
> read ahead size given the facts I mentioned in my previous e-mail.
>

We can't keep changing the default every time someone tries to measure
a new workload.

The change in 5.4 was also measurement based, and was due to poor
performance in cases where rsize/wsize is smaller and readahead was
overshooting.

> Thank you.
>
> Hazem
>
> On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <
> [email protected]> wrote:
>
>
>     Hello Team,
>
>     -We have got multiple customers complaining about NFS read
> performance degradation after they upgraded to kernel 5.4.*
>
>     -After doing some deep dive and testing we figured out that the
> reason behind the regression was patch NFS: Optimise the default
> readahead size[1] Which has been merged to Linux kernels 5.4.* and
> above.
>     -Our customers are using AWS EC2 instances as client mounting EFS
> export (which is AWS managed NFSV4 service), I am sharing the results
> that we got before & after the upgrade given that the NFS server(EFS)
> should be able to achieve between 250-300MB/sec which the clients can
> achieve without patch[1] while getting quarter of this speed around
> 70MB/sec with the mentioned patch merged as seen below.
>
>
> #####################################################################
> #####################
>
>
>     Before the upgrade:
>     # uname -r
>     4.14.225-168.357.amzn2.x86_64
>     [root@ip-172-31-28-135 ec2-user]# sync; echo 3 >
> /proc/sys/vm/drop_caches
>     [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o
> nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
>     [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
>     test
>       8,589,934,592 100% 313.20MB/s    0:00:26 (xfr#1, to-chk=0/1)
>
>
> #####################################################################
> #####################
>
>     After the upgrade using the same client & server:
>     #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-
> readahead show /home/ec2-user/efs/;rsync --progress efs/test .
>     5.4.0-1.132.64.amzn2.x86_64
>     /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
>     test
>       1,073,741,824 100%   68.61MB/s    0:00:14 (xfr#1, to-chk=0/1)
>
>
>     -We are recommending[2] EFS users to use rsize=1048576 as mount
> option for getting the best read performance from their EFS exports
> given that EC2 to EFS traffic is residing in the same AWS
> availability zone hence it has low latency and up to 250-300MB/sec
> throughput however with the mentioned patch merged the customer can’t
> achieve this throughput after the kernel upgrade because the default
> NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so
> the clients have to manually raise the manually raise the
> read_ahead_kb parameter from 128 to 15360 to get the same experience
> they were getting before the upgrade.
>     -We know that the purpose of the mentioned patch was to decrease
> OS boot time (for netboot users) also decreasing Application start up
> times in congested & Low throughput networks as mentioned in [3],
> however this would also cause regression for high throughput & low
> latency workload especially sequential read workflows.
>     -After doing further debugging we also found that the maximum
> read ahead size is constant so there is no Autotuning for this
> configuration even if the client is filling the read ahead window
> which means any NFS client specially ones using maximum rsize mount
> option will have to manually tune their maximum NFS read ahead size
> after the upgrade which in my opinion is some sort of regression from
> older kernels behaviour.
>
>
> #####################################################################
> ####################
>
>     After increasing the maximum NFS read ahead size to 15MB it’s
> clear that read ahead window is expanded as expected and it will be
> doubled until it reach 15MB.
>
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>
>
> #####################################################################
> ####################
>
>     With 128KB as maximum NFS read ahead size, the read ahead window
> size is increasing until it reach the configured maximum Read ahead
> (128KB).
>
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>
>     -In my reproduction I used rsync as clarified above and it is
> always doing read syscalls requesting 256 KB in each call:
>     15:47:10.780658 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023749>
>     15:47:10.805467 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023739>
>     15:47:10.830272 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023664>
>     15:47:10.854972 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023814>
>     15:47:10.879837 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023625>
>     15:47:10.904496 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023645>
>     15:47:10.929180 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.024072>
>     15:47:10.954308 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023669>
>
>
>     -Looking into the readahead source code and I can see that
> readahead is doing some heuristics to determine if the access pattern
> is sequential or random then it modify the read ahead window(amount
> of data it will prefetch) accordingly, read ahead can't also read
> beyond the requested file size, this theoretically means that having
> Large NFS Max read ahead size (15MB) shouldn't have much impact on
> performance even with random I/O pattern or with data set consists of
> small files, the only major impact of having large NFS read ahead
> size would be some networking congestion or bootup delay with hosts
> using congested or low throughput networks as illustrated in
> https://bugzilla.kernel.org/show_bug.cgi?id=204939 &
> https://lore.kernel.org/linux-nfs/[email protected]/T/
> .
>     -With patch
> https://www.spinics.net/lists/linux-nfs/msg75018.html the packet
> captures are showing the client either asking for 128KB or 256 KB in
> the NFS READ calls and it can't reach even the 1MB configured rsize
> mount option this is because the ondemand_readahead which should be
> responsible for moving and scaling the read ahead window has an if
> condition which was a part of
> https://www.mail-archive.com/[email protected]/msg1274743.html
> , this patch actually modified read ahead to issue the maximum of the
> user request size(rsync is doing 256KB read requests), and the read-
> ahead max size(128KB by default), but capped to the max request size
> on the device side(1MB in our case). The latter is done to avoid
> reading ahead too much, if the application asks for a huge read. this
> is why with 128KB as read ahead size and application asking for 256KB
> we never exceed 256KB because this patch actually intended to do
> that, it avoids limiting the requested data to the maximum read ahead
> size but we are still limited by the minimum between amount of data
> application is reading which is 256KB as sync in rsync strace output
> & bdi->io_pages(256 pages=1MB) as configured in the rsize mount
> option.
>
>     -Output after adding some debugging to the kernel showing the
> size of each variable in the mentioned "if" condition:
>
>     [ 238.387788] req_size= 64 ------>256MB rsync read requests
>     [ 238.387790] io pages= 256----->1MB as supported by EFS and as
> configured in the rsize mount option.
>     [ 238.390487] max_pages before= 32----->128 KB read ahead size
> which is the default.
>     [ 238.393177] max_pages after= 64---->raised to 256 KB because
> of changes mentioned in [4] "max_pages = min(req_size, bdi-
> >io_pages);"
>
>
> https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435
>
>     /*
>      * A minimal readahead algorithm for trivial sequential/random
> reads.
>      */
>     static void ondemand_readahead(struct readahead_control *ractl,
>             struct file_ra_state *ra, bool hit_readahead_marker,
>             unsigned long req_size)
>     {
>         struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping-
> >host);
>         unsigned long max_pages = ra->ra_pages;
>         unsigned long add_pages;
>         unsigned long index = readahead_index(ractl);
>         pgoff_t prev_index;
>
>         /*
>          * If the request exceeds the readahead window, allow the
> read to
>          * be up to the optimal hardware IO size
>          */
>         if (req_size > max_pages && bdi->io_pages > max_pages)
>             max_pages = min(req_size, bdi->io_pages);
>
>
>
> #####################################################################
> #############################
>
>     -With 128KB as default maximum read ahead size the packet capture
> from the client side is showing the NFSv4 READ calls showing count
> in bytes moving between 128KB to 256KB.
>
>     73403         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
>     73404         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
>     73406         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73403)[Unreassembled Packet]
>     73415         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
>     73416         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73404)[Unreassembled Packet]
>     73428         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
>     73429         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73415)[Unreassembled Packet]
>     73438         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
>     73439         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73428)[Unreassembled Packet]
>
>     -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB
> file.
>
>     # nfsstat
>     Client rpc stats:
>     calls      retrans    authrefrsh
>     8204       0          8204
>
>     Client nfs v4:
>     null         read         write        commit       open
> open_conf
>     1         0% 8183     99% 0         0% 0         0% 0         0%
> 0         0%
>     open_noat    open_dgrd    close        setattr      fsinfo
> renew
>     1         0% 0         0% 1         0% 0         0% 2         0%
> 0         0%
>     setclntid    confirm      lock         lockt        locku
> access
>     0         0% 0         0% 0         0% 0         0% 0         0%
> 1         0%
>     getattr      lookup       lookup_root remove       rename
> link
>     4         0% 1         0% 1         0% 0         0% 0         0%
> 0         0%
>     symlink      create       pathconf     statfs       readlink
> readdir
>     0         0% 0         0% 1         0% 0         0% 0         0%
> 0         0%
>     server_caps delegreturn getacl       setacl       fs_locations
> rel_lkowner
>     3         0% 0         0% 0         0% 0         0% 0         0%
> 0         0%
>     secinfo      exchange_id create_ses   destroy_ses sequence
> get_lease_t
>     0         0% 0         0% 2         0% 1         0% 0         0%
> 0         0%
>     reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn
> getdevlist
>     0         0% 1         0% 0         0% 0         0% 0         0%
> 0         0%
>     (null)
>     1 0%
>
>
> #####################################################################
> ##########################
>
>     -When using 15MB as maximum read ahead size, the client is
> sending 1MB NFSV4 read requests hence it’s able to read the same 1GB
> file in 1024 NFS READ calls
>
>     #uname -r; mount -t nfs4 -o
> nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead
> show /home/ec2-user/efs/
>     5.3.9
>     /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
>     #sync; echo 3 > /proc/sys/vm/drop_caches
>     #rsync --progress efs/test .
>     test
>       1,073,741,824 100% 260.15MB/s    0:00:03 (xfr#1, to-chk=0/1)
>     [root@ip-172-31-17-42 ec2-user]# nfsstat
>     Client rpc stats:
>     calls      retrans    authrefrsh
>     1043       0          1043
>
>     Client nfs v4:
>     null         read         write        commit       open
> open_conf
>     1         0% 1024     98% 0         0% 0         0% 0         0%
> 0         0%
>     open_noat    open_dgrd    close        setattr      fsinfo
> renew
>     1         0% 0         0% 1         0% 0         0% 2         0%
> 0         0%
>     setclntid    confirm      lock         lockt        locku
> access
>     0         0% 0         0% 0         0% 0         0% 0         0%
> 1         0%
>     getattr      lookup       lookup_root remove       rename
> link
>     2         0% 1         0% 1         0% 0         0% 0         0%
> 0         0%
>     symlink      create       pathconf     statfs       readlink
> readdir
>     0         0% 0         0% 1         0% 0         0% 0         0%
> 0         0%
>     server_caps delegreturn getacl       setacl       fs_locations
> rel_lkowner
>     3         0% 0         0% 0         0% 0         0% 0         0%
> 0         0%
>     secinfo      exchange_id create_ses   destroy_ses sequence
> get_lease_t
>     0         0% 0         0% 2         0% 1         0% 0         0%
> 0         0%
>     reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn
> getdevlist
>     0         0% 1         0% 0         0% 0         0% 0         0%
> 0         0%
>     (null)
>     1 0%
>
>     -The packet capture from the client side is showing NFSv4 READ
> calls with 1MB as read count in bytes when having 15MB as maximum NFS
> read ahead size.
>
>     2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS
> 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576
>     2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS
> 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576
>     2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS
> 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576
>     2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS
> 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576
>
>
> #####################################################################
> ##########################
>
>
>     -I think there are 2 options to mitigate this behaviour which I
> am listing below:
>     A)Raising the default maximum NFS read ahead size because the
> current default 128KB doesn’t seem to be sufficient for High
> throughout & low latency workload, I strongly believe that the NFS
> rsize mount option should be used as variable in deciding the maximum
> NFS read ahead size which was the case before [1] while now it’s
> always 128KB regardless the utilized rsize mount option. Also I think
> clients running in High latency & Low throughout environment
> shouldn’t use 1MB as rsize in their mount options(i.e they should use
> smaller rsize) because it may increase their suffering even with low
> maximum NFS read ahead size.
>     B)Adding some logic to read ahead to have some kind of Autotuning
> (Similar to TCP Autotuning) where the maximum read ahead size can
> dynamically increase in case the client/reader is constantly filling
> up/utilizing the read ahead window size.
>
>
>     Links:
>     [1] https://www.spinics.net/lists/linux-nfs/msg75018.html
>     [2]
> https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
>     [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
>     [4]
> https://www.mail-archive.com/[email protected]/msg1274743.html
>
>
>     Thank you.
>
>     Hazem
>
>
>
>
>
>
>
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855
> Luxembourg, R.C.S. Luxembourg B186284
>
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza,
> Burlington Road, Dublin 4, Ireland, branch registration number 908705
>
>

--
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022

http://www.hammer.space

2021-03-31 13:34:30

by Mohamed Abuelfotoh, Hazem

[permalink] [raw]

Subject: Re: NFS read performance degradation after upgrading to kernel 5.4.*

Ok I got that but based on the facts mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=204939 it looks like the bad behaviour was mainly related to using the maximum rsize 1MB while read ahead was 15MB based on the equation of 15*rsize while in congested/low throughput/High latency networks the client shouldn’t use 1MB as rsize in the first place, my point is that it's reasonable to calculate the maximum read ahead size based on the utilized rsize while now it's constant as 128KB regardless the configured networking rsize, We have seen multiple customers reporting this regression after upgrading to kernel 5.4 and I am pretty sure we will get more Linux clients reporting this with more people moving to kernel 5.4 and later.

Thank you.

Hazem

Links

[1] https://bugzilla.kernel.org/show_bug.cgi?id=204939

On 31/03/2021, 15:10, "Trond Myklebust" <[email protected]> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote:
> Hi Trond,
>
> I am wondering if we should consider raising the default maximum NFS
> read ahead size given the facts I mentioned in my previous e-mail.
>

We can't keep changing the default every time someone tries to measure
a new workload.

The change in 5.4 was also measurement based, and was due to poor
performance in cases where rsize/wsize is smaller and readahead was
overshooting.

> Thank you.
>
> Hazem
>
> On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <
> [email protected]> wrote:
>
>
> Hello Team,
>
> -We have got multiple customers complaining about NFS read
> performance degradation after they upgraded to kernel 5.4.*
>
> -After doing some deep dive and testing we figured out that the
> reason behind the regression was patch NFS: Optimise the default
> readahead size[1] Which has been merged to Linux kernels 5.4.* and
> above.
> -Our customers are using AWS EC2 instances as client mounting EFS
> export (which is AWS managed NFSV4 service), I am sharing the results
> that we got before & after the upgrade given that the NFS server(EFS)
> should be able to achieve between 250-300MB/sec which the clients can
> achieve without patch[1] while getting quarter of this speed around
> 70MB/sec with the mentioned patch merged as seen below.
>
>
> #####################################################################
> #####################
>
>
> Before the upgrade:
> # uname -r
> 4.14.225-168.357.amzn2.x86_64
> [root@ip-172-31-28-135 ec2-user]# sync; echo 3 >
> /proc/sys/vm/drop_caches
> [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o
> nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
> [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
> test
> 8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1)
>
>
> #####################################################################
> #####################
>
> After the upgrade using the same client & server:
> #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-
> readahead show /home/ec2-user/efs/;rsync --progress efs/test .
> 5.4.0-1.132.64.amzn2.x86_64
> /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
> test
> 1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to-chk=0/1)
>
>
> -We are recommending[2] EFS users to use rsize=1048576 as mount
> option for getting the best read performance from their EFS exports
> given that EC2 to EFS traffic is residing in the same AWS
> availability zone hence it has low latency and up to 250-300MB/sec
> throughput however with the mentioned patch merged the customer can’t
> achieve this throughput after the kernel upgrade because the default
> NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so
> the clients have to manually raise the manually raise the
> read_ahead_kb parameter from 128 to 15360 to get the same experience
> they were getting before the upgrade.
> -We know that the purpose of the mentioned patch was to decrease
> OS boot time (for netboot users) also decreasing Application start up
> times in congested & Low throughput networks as mentioned in [3],
> however this would also cause regression for high throughput & low
> latency workload especially sequential read workflows.
> -After doing further debugging we also found that the maximum
> read ahead size is constant so there is no Autotuning for this
> configuration even if the client is filling the read ahead window
> which means any NFS client specially ones using maximum rsize mount
> option will have to manually tune their maximum NFS read ahead size
> after the upgrade which in my opinion is some sort of regression from
> older kernels behaviour.
>
>
> #####################################################################
> ####################
>
> After increasing the maximum NFS read ahead size to 15MB it’s
> clear that read ahead window is expanded as expected and it will be
> doubled until it reach 15MB.
>
> Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>
>
> #####################################################################
> ####################
>
> With 128KB as maximum NFS read ahead size, the read ahead window
> size is increasing until it reach the configured maximum Read ahead
> (128KB).
>
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>
> -In my reproduction I used rsync as clarified above and it is
> always doing read syscalls requesting 256 KB in each call:
> 15:47:10.780658 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023749>
> 15:47:10.805467 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023739>
> 15:47:10.830272 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023664>
> 15:47:10.854972 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023814>
> 15:47:10.879837 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023625>
> 15:47:10.904496 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023645>
> 15:47:10.929180 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.024072>
> 15:47:10.954308 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023669>
>
>
> -Looking into the readahead source code and I can see that
> readahead is doing some heuristics to determine if the access pattern
> is sequential or random then it modify the read ahead window(amount
> of data it will prefetch) accordingly, read ahead can't also read
> beyond the requested file size, this theoretically means that having
> Large NFS Max read ahead size (15MB) shouldn't have much impact on
> performance even with random I/O pattern or with data set consists of
> small files, the only major impact of having large NFS read ahead
> size would be some networking congestion or bootup delay with hosts
> using congested or low throughput networks as illustrated in
> https://bugzilla.kernel.org/show_bug.cgi?id=204939 &
> https://lore.kernel.org/linux-nfs/[email protected]/T/
> .
> -With patch
> https://www.spinics.net/lists/linux-nfs/msg75018.html the packet
> captures are showing the client either asking for 128KB or 256 KB in
> the NFS READ calls and it can't reach even the 1MB configured rsize
> mount option this is because the ondemand_readahead which should be
> responsible for moving and scaling the read ahead window has an if
> condition which was a part of
> https://www.mail-archive.com/[email protected]/msg1274743.html
> , this patch actually modified read ahead to issue the maximum of the
> user request size(rsync is doing 256KB read requests), and the read-
> ahead max size(128KB by default), but capped to the max request size
> on the device side(1MB in our case). The latter is done to avoid
> reading ahead too much, if the application asks for a huge read. this
> is why with 128KB as read ahead size and application asking for 256KB
> we never exceed 256KB because this patch actually intended to do
> that, it avoids limiting the requested data to the maximum read ahead
> size but we are still limited by the minimum between amount of data
> application is reading which is 256KB as sync in rsync strace output
> & bdi->io_pages(256 pages=1MB) as configured in the rsize mount
> option.
>
> -Output after adding some debugging to the kernel showing the
> size of each variable in the mentioned "if" condition:
>
> [ 238.387788] req_size= 64 ------>256MB rsync read requests
> [ 238.387790] io pages= 256----->1MB as supported by EFS and as
> configured in the rsize mount option.
> [ 238.390487] max_pages before= 32----->128 KB read ahead size
> which is the default.
> [ 238.393177] max_pages after= 64---->raised to 256 KB because
> of changes mentioned in [4] "max_pages = min(req_size, bdi-
> >io_pages);"
>
>
> https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435
>
> /*
> * A minimal readahead algorithm for trivial sequential/random
> reads.
> */
> static void ondemand_readahead(struct readahead_control *ractl,
> struct file_ra_state *ra, bool hit_readahead_marker,
> unsigned long req_size)
> {
> struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping-
> >host);
> unsigned long max_pages = ra->ra_pages;
> unsigned long add_pages;
> unsigned long index = readahead_index(ractl);
> pgoff_t prev_index;
>
> /*
> * If the request exceeds the readahead window, allow the
> read to
> * be up to the optimal hardware IO size
> */
> if (req_size > max_pages && bdi->io_pages > max_pages)
> max_pages = min(req_size, bdi->io_pages);
>
>
>
> #####################################################################
> #############################
>
> -With 128KB as default maximum read ahead size the packet capture
> from the client side is showing the NFSv4 READ calls showing count
> in bytes moving between 128KB to 256KB.
>
> 73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
> 73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
> 73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73403)[Unreassembled Packet]
> 73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
> 73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73404)[Unreassembled Packet]
> 73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
> 73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73415)[Unreassembled Packet]
> 73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
> 73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73428)[Unreassembled Packet]
>
> -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB
> file.
>
> # nfsstat
> Client rpc stats:
> calls retrans authrefrsh
> 8204 0 8204
>
> Client nfs v4:
> null read write commit open
> open_conf
> 1 0% 8183 99% 0 0% 0 0% 0 0%
> 0 0%
> open_noat open_dgrd close setattr fsinfo
> renew
> 1 0% 0 0% 1 0% 0 0% 2 0%
> 0 0%
> setclntid confirm lock lockt locku
> access
> 0 0% 0 0% 0 0% 0 0% 0 0%
> 1 0%
> getattr lookup lookup_root remove rename
> link
> 4 0% 1 0% 1 0% 0 0% 0 0%
> 0 0%
> symlink create pathconf statfs readlink
> readdir
> 0 0% 0 0% 1 0% 0 0% 0 0%
> 0 0%
> server_caps delegreturn getacl setacl fs_locations
> rel_lkowner
> 3 0% 0 0% 0 0% 0 0% 0 0%
> 0 0%
> secinfo exchange_id create_ses destroy_ses sequence
> get_lease_t
> 0 0% 0 0% 2 0% 1 0% 0 0%
> 0 0%
> reclaim_comp layoutget getdevinfo layoutcommit layoutreturn
> getdevlist
> 0 0% 1 0% 0 0% 0 0% 0 0%
> 0 0%
> (null)
> 1 0%
>
>
> #####################################################################
> ##########################
>
> -When using 15MB as maximum read ahead size, the client is
> sending 1MB NFSV4 read requests hence it’s able to read the same 1GB
> file in 1024 NFS READ calls
>
> #uname -r; mount -t nfs4 -o
> nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead
> show /home/ec2-user/efs/
> 5.3.9
> /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
> #sync; echo 3 > /proc/sys/vm/drop_caches
> #rsync --progress efs/test .
> test
> 1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to-chk=0/1)
> [root@ip-172-31-17-42 ec2-user]# nfsstat
> Client rpc stats:
> calls retrans authrefrsh
> 1043 0 1043
>
> Client nfs v4:
> null read write commit open
> open_conf
> 1 0% 1024 98% 0 0% 0 0% 0 0%
> 0 0%
> open_noat open_dgrd close setattr fsinfo
> renew
> 1 0% 0 0% 1 0% 0 0% 2 0%
> 0 0%
> setclntid confirm lock lockt locku
> access
> 0 0% 0 0% 0 0% 0 0% 0 0%
> 1 0%
> getattr lookup lookup_root remove rename
> link
> 2 0% 1 0% 1 0% 0 0% 0 0%
> 0 0%
> symlink create pathconf statfs readlink
> readdir
> 0 0% 0 0% 1 0% 0 0% 0 0%
> 0 0%
> server_caps delegreturn getacl setacl fs_locations
> rel_lkowner
> 3 0% 0 0% 0 0% 0 0% 0 0%
> 0 0%
> secinfo exchange_id create_ses destroy_ses sequence
> get_lease_t
> 0 0% 0 0% 2 0% 1 0% 0 0%
> 0 0%
> reclaim_comp layoutget getdevinfo layoutcommit layoutreturn
> getdevlist
> 0 0% 1 0% 0 0% 0 0% 0 0%
> 0 0%
> (null)
> 1 0%
>
> -The packet capture from the client side is showing NFSv4 READ
> calls with 1MB as read count in bytes when having 15MB as maximum NFS
> read ahead size.
>
> 2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS
> 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576
> 2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS
> 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576
> 2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS
> 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576
> 2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS
> 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576
>
>
> #####################################################################
> ##########################
>
>
> -I think there are 2 options to mitigate this behaviour which I
> am listing below:
> A)Raising the default maximum NFS read ahead size because the
> current default 128KB doesn’t seem to be sufficient for High
> throughout & low latency workload, I strongly believe that the NFS
> rsize mount option should be used as variable in deciding the maximum
> NFS read ahead size which was the case before [1] while now it’s
> always 128KB regardless the utilized rsize mount option. Also I think
> clients running in High latency & Low throughout environment
> shouldn’t use 1MB as rsize in their mount options(i.e they should use
> smaller rsize) because it may increase their suffering even with low
> maximum NFS read ahead size.
> B)Adding some logic to read ahead to have some kind of Autotuning
> (Similar to TCP Autotuning) where the maximum read ahead size can
> dynamically increase in case the client/reader is constantly filling
> up/utilizing the read ahead window size.
>
>
> Links:
> [1] https://www.spinics.net/lists/linux-nfs/msg75018.html
> [2]
> https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
> [4]
> https://www.mail-archive.com/[email protected]/msg1274743.html
>
>
> Thank you.
>
> Hazem
>
>
>
>
>
>
>
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855
> Luxembourg, R.C.S. Luxembourg B186284
>
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza,
> Burlington Road, Dublin 4, Ireland, branch registration number 908705
>
>

--
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022

http://www.hammer.space

Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

2021-03-31 14:38:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS read performance degradation after upgrading to kernel 5.4.*

On Wed, 2021-03-31 at 13:31 +0000, Mohamed Abuelfotoh, Hazem wrote:
> Ok I got that but based on the facts mentioned in
> https://bugzilla.kernel.org/show_bug.cgi?id=204939 it looks like the
> bad behaviour was mainly related to using the maximum rsize 1MB while
> read ahead was 15MB based on the equation of 15*rsize while in
> congested/low throughput/High latency networks the client shouldn’t
> use 1MB as rsize in the first place, my point is that it's reasonable
> to calculate the maximum read ahead size based on the utilized rsize
> while now it's constant as 128KB regardless the configured networking
> rsize, We have seen multiple customers reporting this regression after
> upgrading to kernel 5.4 and I am pretty sure we will get more Linux
> clients reporting this with more people moving to kernel 5.4 and later.
>

The readahead algorithm is designed to look at how large an area your
application is trying to read, and it tries to optimise for the rsize
because we tell it that is the "optimal read block size".

If you are seeing a 128K window, then that is because you're hitting
the heuristic case, where there is no guidance from the application,
and the kernel is basically trying to keep a minimal pipeline filled
just in case this turns out to be a sequential read.

Your users can change that heuristic value using the per-deviceid
entries in /sys/class/bdi. Please see
https://www.suse.com/support/kb/doc/?id=000017019

They can also use udev with a generic rule like this
(from https://access.redhat.com/solutions/407263 ):

SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes", ATTR{read_ahead_kb}="1048576"

The default setting is 128K because that is a global default setting.
Yes we can change it, but that needs to be motivated by explaining what
makes NFS so different from all the other filesystems that use the same
setting. The fact that some users have better hardware than others
isn't sufficient (and no, the default rsize setting does not take that
into consideration either).

> Thank you.
>
> Hazem
>
>
>
> Links
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=204939
>
> On 31/03/2021, 15:10, "Trond Myklebust" <[email protected]>
> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender
> and know the content is safe.
>
>
>
>     On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote:
>     > Hi Trond,
>     >
>     > I am wondering if we should consider raising the default maximum
> NFS
>     > read ahead size given the facts I mentioned in my previous e-
> mail.
>     >
>
>     We can't keep changing the default every time someone tries to
> measure
>     a new workload.
>
>     The change in 5.4 was also measurement based, and was due to poor
>     performance in cases where rsize/wsize is smaller and readahead was
>     overshooting.
>
>
>
>     > Thank you.
>     >
>     > Hazem
>     >
>     > On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <
>     > [email protected]> wrote:
>     >
>     >
>     >     Hello Team,
>     >
>     >     -We have got multiple customers complaining about NFS read
>     > performance degradation after they upgraded to kernel 5.4.*
>     >
>     >     -After doing some deep dive and testing we figured out that
> the
>     > reason behind the regression was patch NFS: Optimise the default
>     > readahead size[1] Which has been merged to Linux kernels 5.4.*
> and
>     > above.
>     >     -Our customers are using AWS EC2 instances as client mounting
> EFS
>     > export (which is AWS managed NFSV4 service), I am sharing the
> results
>     > that we got before & after the upgrade given that the NFS
> server(EFS)
>     > should be able to achieve between 250-300MB/sec which the clients
> can
>     > achieve without patch[1] while getting quarter of this speed
> around
>     > 70MB/sec with the mentioned patch merged as seen below.
>     >
>     >
>     >
> #####################################################################
>     > #####################
>     >
>     >
>     >     Before the upgrade:
>     >     # uname -r
>     >     4.14.225-168.357.amzn2.x86_64
>     >     [root@ip-172-31-28-135 ec2-user]# sync; echo 3 >
>     > /proc/sys/vm/drop_caches
>     >     [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o
>     >
> nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore
>     > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
>     >     [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
>     >     test
>     >       8,589,934,592 100% 313.20MB/s    0:00:26 (xfr#1, to-
> chk=0/1)
>     >
>     >
>     >
> #####################################################################
>     > #####################
>     >
>     >     After the upgrade using the same client & server:
>     >     #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-
>     > readahead show /home/ec2-user/efs/;rsync --progress efs/test .
>     >     5.4.0-1.132.64.amzn2.x86_64
>     >     /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb =
> 128
>     >     test
>     >       1,073,741,824 100%   68.61MB/s    0:00:14 (xfr#1, to-
> chk=0/1)
>     >
>     >
>     >     -We are recommending[2] EFS users to use rsize=1048576 as
> mount
>     > option for getting the best read performance from their EFS
> exports
>     > given that EC2 to EFS traffic is residing in the same AWS
>     > availability zone hence it has low latency and up to 250-
> 300MB/sec
>     > throughput however with the mentioned patch merged the customer
> can’t
>     > achieve this throughput after the kernel upgrade because the
> default
>     > NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB
> so
>     > the clients have to manually raise the manually raise the
>     > read_ahead_kb parameter from 128 to 15360 to get the same
> experience
>     > they were getting before the upgrade.
>     >     -We know that the purpose of the mentioned patch was to
> decrease
>     > OS boot time (for netboot users) also decreasing Application
> start up
>     > times in congested & Low throughput networks as mentioned in [3],
>     > however this would also cause regression for high throughput &
> low
>     > latency workload especially sequential read workflows.
>     >     -After doing further debugging we also found that the maximum
>     > read ahead size is constant so there is no Autotuning for this
>     > configuration even if the client is filling the read ahead window
>     > which means any NFS client specially ones using maximum rsize
> mount
>     > option will have to manually tune their maximum NFS read ahead
> size
>     > after the upgrade which in my opinion is some sort of regression
> from
>     > older kernels behaviour.
>     >
>     >
>     >
> #####################################################################
>     > ####################
>     >
>     >     After increasing the maximum NFS read ahead size to 15MB it’s
>     > clear that read ahead window is expanded as expected and it will
> be
>     > doubled until it reach 15MB.
>     >
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
>     >     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     >
>     >
>     >
> #####################################################################
>     > ####################
>     >
>     >     With 128KB as maximum NFS read ahead size, the read ahead
> window
>     > size is increasing until it reach the configured maximum Read
> ahead
>     > (128KB).
>     >
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     >     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     >
>     >     -In my reproduction I used rsync as clarified above and it is
>     > always doing read syscalls requesting 256 KB in each call:
>     >     15:47:10.780658 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023749>
>     >     15:47:10.805467 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023739>
>     >     15:47:10.830272 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023664>
>     >     15:47:10.854972 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023814>
>     >     15:47:10.879837 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023625>
>     >     15:47:10.904496 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023645>
>     >     15:47:10.929180 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.024072>
>     >     15:47:10.954308 read(3,
>     >
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
>     > , 262144) = 262144 <0.023669>
>     >
>     >
>     >     -Looking into the readahead source code and I can see that
>     > readahead is doing some heuristics to determine if the access
> pattern
>     > is sequential or random then it modify the read ahead
> window(amount
>     > of data it will prefetch) accordingly, read ahead can't also read
>     > beyond the requested file size, this theoretically means that
> having
>     > Large NFS Max read ahead size (15MB) shouldn't have much impact
> on
>     > performance even with random I/O pattern or with data set
> consists of
>     > small files, the only major impact of having large NFS read ahead
>     > size would be some networking congestion or bootup delay with
> hosts
>     > using congested or low throughput networks as illustrated in
>     > https://bugzilla.kernel.org/show_bug.cgi?id=204939 &
>     >
> https://lore.kernel.org/linux-nfs/[email protected]/T/
>     > .
>     >     -With patch
>     > https://www.spinics.net/lists/linux-nfs/msg75018.html the packet
>     > captures are showing the client either asking for 128KB or 256 KB
> in
>     > the NFS READ calls and it can't reach even the 1MB configured
> rsize
>     > mount option this is because the ondemand_readahead which should
> be
>     > responsible for moving and scaling the read ahead window has an
> if
>     > condition which was a part of
>     >
> https://www.mail-archive.com/[email protected]/msg1274743.html
>     > , this patch actually modified read ahead to issue the maximum of
> the
>     > user request size(rsync is doing 256KB read requests), and the
> read-
>     > ahead max size(128KB by default), but capped to the max request
> size
>     > on the device side(1MB in our case). The latter is done to avoid
>     > reading ahead too much, if the application asks for a huge read.
> this
>     > is why with 128KB as read ahead size and application asking for
> 256KB
>     > we never exceed 256KB because this patch actually intended to do
>     > that, it avoids limiting the requested data to the maximum read
> ahead
>     > size but we are still limited by the minimum between amount of
> data
>     > application is reading which is 256KB as sync in rsync strace
> output
>     > & bdi->io_pages(256 pages=1MB) as configured in the rsize mount
>     > option.
>     >
>     >     -Output after adding some debugging to the kernel showing the
>     > size of each variable in the mentioned "if" condition:
>     >
>     >     [ 238.387788] req_size= 64 ------>256MB rsync read requests
>     >     [ 238.387790] io pages= 256----->1MB as supported by EFS and
> as
>     > configured in the rsize mount option.
>     >     [ 238.390487] max_pages before= 32----->128 KB read ahead
> size
>     > which is the default.
>     >     [ 238.393177] max_pages after= 64---->raised to 256 KB
> because
>     > of changes mentioned in [4] "max_pages = min(req_size, bdi-
>     > >io_pages);"
>     >
>     >
>     >
> https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435
>     >
>     >     /*
>     >      * A minimal readahead algorithm for trivial
> sequential/random
>     > reads.
>     >      */
>     >     static void ondemand_readahead(struct readahead_control
> *ractl,
>     >             struct file_ra_state *ra, bool hit_readahead_marker,
>     >             unsigned long req_size)
>     >     {
>     >         struct backing_dev_info *bdi = inode_to_bdi(ractl-
> >mapping-
>     > >host);
>     >         unsigned long max_pages = ra->ra_pages;
>     >         unsigned long add_pages;
>     >         unsigned long index = readahead_index(ractl);
>     >         pgoff_t prev_index;
>     >
>     >         /*
>     >          * If the request exceeds the readahead window, allow the
>     > read to
>     >          * be up to the optimal hardware IO size
>     >          */
>     >         if (req_size > max_pages && bdi->io_pages > max_pages)
>     >             max_pages = min(req_size, bdi->io_pages);
>     >
>     >
>     >
>     >
> #####################################################################
>     > #############################
>     >
>     >     -With 128KB as default maximum read ahead size the packet
> capture
>     > from the client side is showing the NFSv4 READ calls showing
> count
>     > in bytes moving between 128KB to 256KB.
>     >
>     >     73403         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4
> Call
>     > READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
>     >     73404         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4
> Call
>     > READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
>     >     73406         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4
> Reply
>     > (Call In 73403)[Unreassembled Packet]
>     >     73415         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4
> Call
>     > READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
>     >     73416         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4
> Reply
>     > (Call In 73404)[Unreassembled Packet]
>     >     73428         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4
> Call
>     > READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
>     >     73429         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4
> Reply
>     > (Call In 73415)[Unreassembled Packet]
>     >     73438         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4
> Call
>     > READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
>     >     73439         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4
> Reply
>     > (Call In 73428)[Unreassembled Packet]
>     >
>     >     -nfsstat is showing 8183 NFSV4 READ calls required to read
> 1GB
>     > file.
>     >
>     >     # nfsstat
>     >     Client rpc stats:
>     >     calls      retrans    authrefrsh
>     >     8204       0          8204
>     >
>     >     Client nfs v4:
>     >     null         read         write        commit       open
>     > open_conf
>     >     1         0% 8183     99% 0         0% 0         0% 0
> 0%
>     > 0         0%
>     >     open_noat    open_dgrd    close        setattr      fsinfo
>     > renew
>     >     1         0% 0         0% 1         0% 0         0% 2
> 0%
>     > 0         0%
>     >     setclntid    confirm      lock         lockt        locku
>     > access
>     >     0         0% 0         0% 0         0% 0         0% 0
> 0%
>     > 1         0%
>     >     getattr      lookup       lookup_root remove       rename
>     > link
>     >     4         0% 1         0% 1         0% 0         0% 0
> 0%
>     > 0         0%
>     >     symlink      create       pathconf     statfs       readlink
>     > readdir
>     >     0         0% 0         0% 1         0% 0         0% 0
> 0%
>     > 0         0%
>     >     server_caps delegreturn getacl       setacl
> fs_locations
>     > rel_lkowner
>     >     3         0% 0         0% 0         0% 0         0% 0
> 0%
>     > 0         0%
>     >     secinfo      exchange_id create_ses   destroy_ses sequence
>     > get_lease_t
>     >     0         0% 0         0% 2         0% 1         0% 0
> 0%
>     > 0         0%
>     >     reclaim_comp layoutget    getdevinfo   layoutcommit
> layoutreturn
>     > getdevlist
>     >     0         0% 1         0% 0         0% 0         0% 0
> 0%
>     > 0         0%
>     >     (null)
>     >     1 0%
>     >
>     >
>     >
> #####################################################################
>     > ##########################
>     >
>     >     -When using 15MB as maximum read ahead size, the client is
>     > sending 1MB NFSV4 read requests hence it’s able to read the same
> 1GB
>     > file in 1024 NFS READ calls
>     >
>     >     #uname -r; mount -t nfs4 -o
>     >
> nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore
>     > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-
> readahead
>     > show /home/ec2-user/efs/
>     >     5.3.9
>     >     /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb =
> 15360
>     >     #sync; echo 3 > /proc/sys/vm/drop_caches
>     >     #rsync --progress efs/test .
>     >     test
>     >       1,073,741,824 100% 260.15MB/s    0:00:03 (xfr#1, to-
> chk=0/1)
>     >     [root@ip-172-31-17-42 ec2-user]# nfsstat
>     >     Client rpc stats:
>     >     calls      retrans    authrefrsh
>     >     1043       0          1043
>     >
>     >     Client nfs v4:
>     >     null         read         write        commit       open
>     > open_conf
>     >     1         0% 1024     98% 0         0% 0         0% 0
> 0%
>     > 0         0%
>     >     open_noat    open_dgrd    close        setattr      fsinfo
>     > renew
>     >     1         0% 0         0% 1         0% 0         0% 2
> 0%
>     > 0         0%
>     >     setclntid    confirm      lock         lockt        locku
>     > access
>     >     0         0% 0         0% 0         0% 0         0% 0
> 0%
>     > 1         0%
>     >     getattr      lookup       lookup_root remove       rename
>     > link
>     >     2         0% 1         0% 1         0% 0         0% 0
> 0%
>     > 0         0%
>     >     symlink      create       pathconf     statfs       readlink
>     > readdir
>     >     0         0% 0         0% 1         0% 0         0% 0
> 0%
>     > 0         0%
>     >     server_caps delegreturn getacl       setacl
> fs_locations
>     > rel_lkowner
>     >     3         0% 0         0% 0         0% 0         0% 0
> 0%
>     > 0         0%
>     >     secinfo      exchange_id create_ses   destroy_ses sequence
>     > get_lease_t
>     >     0         0% 0         0% 2         0% 1         0% 0
> 0%
>     > 0         0%
>     >     reclaim_comp layoutget    getdevinfo   layoutcommit
> layoutreturn
>     > getdevlist
>     >     0         0% 1         0% 0         0% 0         0% 0
> 0%
>     > 0         0%
>     >     (null)
>     >     1 0%
>     >
>     >     -The packet capture from the client side is showing NFSv4
> READ
>     > calls with 1MB as read count in bytes when having 15MB as maximum
> NFS
>     > read ahead size.
>     >
>     >     2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161
> NFS
>     > 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len:
> 1048576
>     >     2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161
> NFS
>     > 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len:
> 1048576
>     >     2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161
> NFS
>     > 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len:
> 1048576
>     >     2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161
> NFS
>     > 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len:
> 1048576
>     >
>     >
>     >
> #####################################################################
>     > ##########################
>     >
>     >
>     >     -I think there are 2 options to mitigate this behaviour which
> I
>     > am listing below:
>     >     A)Raising the default maximum NFS read ahead size because
> the
>     > current default 128KB doesn’t seem to be sufficient for High
>     > throughout & low latency workload, I strongly believe that the
> NFS
>     > rsize mount option should be used as variable in deciding the
> maximum
>     > NFS read ahead size which was the case before [1] while now it’s
>     > always 128KB regardless the utilized rsize mount option. Also I
> think
>     > clients running in High latency & Low throughout environment
>     > shouldn’t use 1MB as rsize in their mount options(i.e they should
> use
>     > smaller rsize) because it may increase their suffering even with
> low
>     > maximum NFS read ahead size.
>     >     B)Adding some logic to read ahead to have some kind of
> Autotuning
>     > (Similar to TCP Autotuning) where the maximum read ahead size can
>     > dynamically increase in case the client/reader is constantly
> filling
>     > up/utilizing the read ahead window size.
>     >
>     >
>     >     Links:
>     >     [1] https://www.spinics.net/lists/linux-nfs/msg75018.html
>     >     [2]
>     >
> https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
>     >     [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
>     >     [4]
>     >
> https://www.mail-archive.com/[email protected]/msg1274743.html
>     >
>     >
>     >     Thank you.
>     >
>     >     Hazem
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855
>     > Luxembourg, R.C.S. Luxembourg B186284
>     >
>     > Amazon Web Services EMEA SARL, Irish Branch, One Burlington
> Plaza,
>     > Burlington Road, Dublin 4, Ireland, branch registration number
> 908705
>     >
>     >
>
>     --
>     Trond Myklebust
>     CTO, Hammerspace Inc
>     4984 El Camino Real, Suite 208
>     Los Altos, CA 94022
>
>     http://www.hammer.space
>
>
>
>
>
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855
> Luxembourg, R.C.S. Luxembourg B186284
>
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza,
> Burlington Road, Dublin 4, Ireland, branch registration number 908705
>
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]