Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18;
MIME-Version: 1.0
From:   Daire Byrne <daire@dneg.com>
Date:   Mon, 7 Feb 2022 18:57:50 +0000
Message-ID: <CAPt2mGMZh9=Vwcqjh0J4XoTu3stOnKwswdzApL4wCA_usOFV_g@mail.gmail.com>
Subject: NFSv4 versus NFSv3 parallel client op/s
To:     linux-nfs <linux-nfs@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Hi,

As part of my ongoing investigations into high latency WAN NFS
performance with only a single client (for the purposes of then
re-exporting), I have been looking at the metadata performance
differences between NFSv3 and NFSv4.2.

High latency seems to be a particularly good way of highlighting the
parallel/concurrency performance limitations with a single NFS client.
So I took a client 200ms away from the server and ran things like
open() and stat() calls to many files & directories using simultaneous
threads (200+) to see how many requests and operations we could keep
in flight simultaneously.

The executive summary is that NFSv4 is around 10x worse than NFSv3 and
an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
with the same test.

On paper, NFSv4 is more compelling over the WAN as it should reduce
round trips with things like compound operations and delegations, but
that's only good if it can do lots of them simultaneously too.

Comparing the slot table/xport stats between the two protocols while
running the benchmark highlights the difference:

NFSv3
opts: rw,vers=3D3,rsize=3D1048576,wsize=3D1048576,namlen=3D255,acregmin=3D3=
600,acregmax=3D3600,acdirmin=3D3600,acdirmax=3D3600,hard,nocto,noresvport,p=
roto=3Dtcp,nconnect=3D4,timeo=3D600,retrans=3D10,sec=3Dsys,mountaddr=3D10.2=
5.22.17,mountvers=3D3,mountport=3D20048,mountproto=3Dudp,fsc,local_lock=3Dn=
one
xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396

NFSv4.2
opts: rw,vers=3D4.2,rsize=3D1048576,wsize=3D1048576,namlen=3D255,acregmin=
=3D3600,acregmax=3D3600,acdirmin=3D3600,acdirmax=3D3600,hard,nocto,noresvpo=
rt,proto=3Dtcp,nconnect=3D4,timeo=3D600,retrans=3D10,sec=3Dsys,clientaddr=
=3D10.25.112.8,fsc,local_lock=3Dnone
xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067

So either we aren't putting things into the slot table quickly enough
for it to scale up, or it just isn't scaling for some other reason.

The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
for the aggregate difference of 10x I see in benchmarking?

I tried increasing the /sys/module/nfs/parameters/max_session_slots
from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
seem to make much difference. Maybe it's a server side limit then and
the lowest is being used:

fs/nfsd/stat.h:
#define NFSD_SLOT_CACHE_SIZE            2048
/* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
#define NFSD_CACHE_SIZE_SLOTS_PER_SESSION       32

I'm sure there are probably good reasons for these values (like
stopping a client from hogging the queue) but is this the reason I see
such a big difference in the performance of concurrency for a single
client over high latencies?

Why do I feel like in writing this all down, I have probably answered
my own question...

Cheers,

Daire