2022-02-09 06:21:50

by Daire Byrne

[permalink] [raw]
Subject: NFSv4 versus NFSv3 parallel client op/s

Hi,

As part of my ongoing investigations into high latency WAN NFS
performance with only a single client (for the purposes of then
re-exporting), I have been looking at the metadata performance
differences between NFSv3 and NFSv4.2.

High latency seems to be a particularly good way of highlighting the
parallel/concurrency performance limitations with a single NFS client.
So I took a client 200ms away from the server and ran things like
open() and stat() calls to many files & directories using simultaneous
threads (200+) to see how many requests and operations we could keep
in flight simultaneously.

The executive summary is that NFSv4 is around 10x worse than NFSv3 and
an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
with the same test.

On paper, NFSv4 is more compelling over the WAN as it should reduce
round trips with things like compound operations and delegations, but
that's only good if it can do lots of them simultaneously too.

Comparing the slot table/xport stats between the two protocols while
running the benchmark highlights the difference:

NFSv3
opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none
xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396

NFSv4.2
opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none
xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067

So either we aren't putting things into the slot table quickly enough
for it to scale up, or it just isn't scaling for some other reason.

The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
for the aggregate difference of 10x I see in benchmarking?

I tried increasing the /sys/module/nfs/parameters/max_session_slots
from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
seem to make much difference. Maybe it's a server side limit then and
the lowest is being used:

fs/nfsd/stat.h:
#define NFSD_SLOT_CACHE_SIZE 2048
/* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
#define NFSD_CACHE_SIZE_SLOTS_PER_SESSION 32

I'm sure there are probably good reasons for these values (like
stopping a client from hogging the queue) but is this the reason I see
such a big difference in the performance of concurrency for a single
client over high latencies?

Why do I feel like in writing this all down, I have probably answered
my own question...

Cheers,

Daire


2022-02-09 19:11:44

by Tom Talpey

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s

On 2/7/2022 1:57 PM, Daire Byrne wrote:
> Hi,
>
> As part of my ongoing investigations into high latency WAN NFS
> performance with only a single client (for the purposes of then
> re-exporting), I have been looking at the metadata performance
> differences between NFSv3 and NFSv4.2.
>
> High latency seems to be a particularly good way of highlighting the
> parallel/concurrency performance limitations with a single NFS client.
> So I took a client 200ms away from the server and ran things like
> open() and stat() calls to many files & directories using simultaneous
> threads (200+) to see how many requests and operations we could keep
> in flight simultaneously.
>
> The executive summary is that NFSv4 is around 10x worse than NFSv3 and
> an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
> comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
> with the same test.
>
> On paper, NFSv4 is more compelling over the WAN as it should reduce
> round trips with things like compound operations and delegations, but
> that's only good if it can do lots of them simultaneously too.
>
> Comparing the slot table/xport stats between the two protocols while
> running the benchmark highlights the difference:
>
> NFSv3
> opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none
> xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
> xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
> xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
> xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396
>
> NFSv4.2
> opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none
> xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
> xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
> xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
> xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067
>
> So either we aren't putting things into the slot table quickly enough
> for it to scale up, or it just isn't scaling for some other reason.
>
> The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
> for the aggregate difference of 10x I see in benchmarking?
>
> I tried increasing the /sys/module/nfs/parameters/max_session_slots
> from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
> seem to make much difference. Maybe it's a server side limit then and
> the lowest is being used:
>
> fs/nfsd/stat.h:
> #define NFSD_SLOT_CACHE_SIZE 2048
> /* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
> #define NFSD_CACHE_SIZE_SLOTS_PER_SESSION 32
>
> I'm sure there are probably good reasons for these values (like
> stopping a client from hogging the queue) but is this the reason I see
> such a big difference in the performance of concurrency for a single
> client over high latencies?

Daire, I'm interested in your results if you increase the server slot
limits. Remember that the "slot" is an NFSv4.1+ protocol element. In
NFSv3 and v4.0, there is no protocol-based flow control, so the max
outstanding RPC counts are effectively the smaller of the client's and
server's RPC task and/or thread limits, and of course the wire itself.

With a 200msec RTT and a single-threaded workload, you'll get 5 ops/sec,
times 32 slots that's pretty much the 180 you see. So I'd expect it to
rise linearly as you scale both ends' slot numbers.

> Why do I feel like in writing this all down, I have probably answered
> my own question...

:)

Tom.

2022-02-18 21:20:16

by Daire Byrne

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s

On Wed, 9 Feb 2022 at 17:38, Tom Talpey <[email protected]> wrote:
>
> On 2/7/2022 1:57 PM, Daire Byrne wrote:
> > Hi,
> >
> > As part of my ongoing investigations into high latency WAN NFS
> > performance with only a single client (for the purposes of then
> > re-exporting), I have been looking at the metadata performance
> > differences between NFSv3 and NFSv4.2.
> >
> > High latency seems to be a particularly good way of highlighting the
> > parallel/concurrency performance limitations with a single NFS client.
> > So I took a client 200ms away from the server and ran things like
> > open() and stat() calls to many files & directories using simultaneous
> > threads (200+) to see how many requests and operations we could keep
> > in flight simultaneously.
> >
> > The executive summary is that NFSv4 is around 10x worse than NFSv3 and
> > an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
> > comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
> > with the same test.
> >
> > On paper, NFSv4 is more compelling over the WAN as it should reduce
> > round trips with things like compound operations and delegations, but
> > that's only good if it can do lots of them simultaneously too.
> >
> > Comparing the slot table/xport stats between the two protocols while
> > running the benchmark highlights the difference:
> >
> > NFSv3
> > opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none
> > xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
> > xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
> > xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
> > xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396
> >
> > NFSv4.2
> > opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none
> > xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
> > xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
> > xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
> > xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067
> >
> > So either we aren't putting things into the slot table quickly enough
> > for it to scale up, or it just isn't scaling for some other reason.
> >
> > The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
> > for the aggregate difference of 10x I see in benchmarking?
> >
> > I tried increasing the /sys/module/nfs/parameters/max_session_slots
> > from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
> > seem to make much difference. Maybe it's a server side limit then and
> > the lowest is being used:
> >
> > fs/nfsd/stat.h:
> > #define NFSD_SLOT_CACHE_SIZE 2048
> > /* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
> > #define NFSD_CACHE_SIZE_SLOTS_PER_SESSION 32
> >
> > I'm sure there are probably good reasons for these values (like
> > stopping a client from hogging the queue) but is this the reason I see
> > such a big difference in the performance of concurrency for a single
> > client over high latencies?
>
> Daire, I'm interested in your results if you increase the server slot
> limits. Remember that the "slot" is an NFSv4.1+ protocol element. In
> NFSv3 and v4.0, there is no protocol-based flow control, so the max
> outstanding RPC counts are effectively the smaller of the client's and
> server's RPC task and/or thread limits, and of course the wire itself.
>
> With a 200msec RTT and a single-threaded workload, you'll get 5 ops/sec,
> times 32 slots that's pretty much the 180 you see. So I'd expect it to
> rise linearly as you scale both ends' slot numbers.

I finally got around to testing this again. I recompiled a server kernel with:

NFSD_CACHE_SIZE_SLOTS_PER_SESSION=256

I ran some more tests and as predicted this helps a lot. Because the
client default for the client's max_sessions_slots=64 (where the
server is 32), I saw double the concurrency straightaway.

And then as I increased the client's max_sessions_slots (up to 256) it
kept on improving. I guess I would need to set the server and client
slots to be around 512 to see the same concurrency performance as for
NFSv3 with 200ms.

Which I guess leads on to some questions:
1) Why is NFSD_CACHE_SIZE_SLOTS_PER_SESSION not a tunable? We don't
really want to maintain our own kernel compiles on our RHEL8 servers.
2) Why is the default linux client slot count 64 and the server's is
32? You can tune the linux client down and not up (if using a Linux
server).
3) What would be the recommended and safest way to have a few high
latency clients with increased slots and concurrency?

I'm thinking it would be better to have the server default be higher
and the linux client default be 32 instead to replicate the current
situation. But no doubt there are other storage filers that already
rely on the fact that the Linux client uses 64 (e.g. cloud Netapps and
the like).

It's probably just a lot less hassle to stick with NFSv3 for this kind
of high latency multi process concurrency use case.

Daire

2022-02-19 00:53:27

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s



> On Feb 18, 2022, at 4:26 PM, Tom Talpey <[email protected]> wrote:
>
>
> On 2/18/2022 2:04 PM, Daire Byrne wrote:
>> On Wed, 9 Feb 2022 at 17:38, Tom Talpey <[email protected]> wrote:
>>>
>>> On 2/7/2022 1:57 PM, Daire Byrne wrote:
>>>> Hi,
>>>>
>>>> As part of my ongoing investigations into high latency WAN NFS
>>>> performance with only a single client (for the purposes of then
>>>> re-exporting), I have been looking at the metadata performance
>>>> differences between NFSv3 and NFSv4.2.
>>>>
>>>> High latency seems to be a particularly good way of highlighting the
>>>> parallel/concurrency performance limitations with a single NFS client.
>>>> So I took a client 200ms away from the server and ran things like
>>>> open() and stat() calls to many files & directories using simultaneous
>>>> threads (200+) to see how many requests and operations we could keep
>>>> in flight simultaneously.
>>>>
>>>> The executive summary is that NFSv4 is around 10x worse than NFSv3 and
>>>> an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
>>>> comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
>>>> with the same test.
>>>>
>>>> On paper, NFSv4 is more compelling over the WAN as it should reduce
>>>> round trips with things like compound operations and delegations, but
>>>> that's only good if it can do lots of them simultaneously too.
>>>>
>>>> Comparing the slot table/xport stats between the two protocols while
>>>> running the benchmark highlights the difference:
>>>>
>>>> NFSv3
>>>> opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none
>>>> xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
>>>> xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
>>>> xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
>>>> xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396
>>>>
>>>> NFSv4.2
>>>> opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none
>>>> xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
>>>> xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
>>>> xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
>>>> xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067
>>>>
>>>> So either we aren't putting things into the slot table quickly enough
>>>> for it to scale up, or it just isn't scaling for some other reason.
>>>>
>>>> The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
>>>> for the aggregate difference of 10x I see in benchmarking?
>>>>
>>>> I tried increasing the /sys/module/nfs/parameters/max_session_slots
>>>> from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
>>>> seem to make much difference. Maybe it's a server side limit then and
>>>> the lowest is being used:
>>>>
>>>> fs/nfsd/stat.h:
>>>> #define NFSD_SLOT_CACHE_SIZE 2048
>>>> /* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
>>>> #define NFSD_CACHE_SIZE_SLOTS_PER_SESSION 32
>>>>
>>>> I'm sure there are probably good reasons for these values (like
>>>> stopping a client from hogging the queue) but is this the reason I see
>>>> such a big difference in the performance of concurrency for a single
>>>> client over high latencies?
>>>
>>> Daire, I'm interested in your results if you increase the server slot
>>> limits. Remember that the "slot" is an NFSv4.1+ protocol element. In
>>> NFSv3 and v4.0, there is no protocol-based flow control, so the max
>>> outstanding RPC counts are effectively the smaller of the client's and
>>> server's RPC task and/or thread limits, and of course the wire itself.
>>>
>>> With a 200msec RTT and a single-threaded workload, you'll get 5 ops/sec,
>>> times 32 slots that's pretty much the 180 you see. So I'd expect it to
>>> rise linearly as you scale both ends' slot numbers.
>> I finally got around to testing this again. I recompiled a server kernel with:
>> NFSD_CACHE_SIZE_SLOTS_PER_SESSION=256
>> I ran some more tests and as predicted this helps a lot. Because the
>> client default for the client's max_sessions_slots=64 (where the
>> server is 32), I saw double the concurrency straightaway.
>
> Nice, thanks for the followup!
>
>> And then as I increased the client's max_sessions_slots (up to 256) it
>> kept on improving. I guess I would need to set the server and client
>> slots to be around 512 to see the same concurrency performance as for
>> NFSv3 with 200ms.
>> Which I guess leads on to some questions:
>> 1) Why is NFSD_CACHE_SIZE_SLOTS_PER_SESSION not a tunable? We don't
>> really want to maintain our own kernel compiles on our RHEL8 servers.
>
> I totally agree that it's reasonable to allow tuning. And, 32 is a
> woefully small maximum.

As denizens of this community know, I don't relish adding
tuning knobs when the setting can be abused or set improperly.
You'll have to convince me that we can't construct a reasonable
and safe internal heuristic that determines a good default slot
count value. (meaning: adjustable is OK, but I'd prefer it to
be a dynamic and automated setting, not one that needs to be
set via an administrative interface).


>> 2) Why is the default linux client slot count 64 and the server's is
>> 32? You can tune the linux client down and not up (if using a Linux
>> server).
>
> That's for Trond and Chuck I guess.

For the Linux NFS server, there is an enhancement request open
in this area:

https://bugzilla.linux-nfs.org/show_bug.cgi?id=375

If there are any relevant design notes or performance results,
that would be the place to put them.

IIRC the only downside to a large default slot count on the
server is that it can waste memory, and it is difficult to handle
the corner cases when the server is running on a small physical
host (or in a small container).


>> 3) What would be the recommended and safest way to have a few high
>> latency clients with increased slots and concurrency?
>
> So, slot counts are negotiable, and dynamic, between client and
> server in NVSv4.1+. But I don't believe that either the Linux client
> or server allow them to change after starting a session.
>
> IMO the best way is to write some code to manage slots both to increase
> on demand and decrease on non-use. But dynamic credit management is a
> devilishly hard thing to get right. It won't be trivial.
>
>> I'm thinking it would be better to have the server default be higher
>> and the linux client default be 32 instead to replicate the current
>> situation. But no doubt there are other storage filers that already
>> rely on the fact that the Linux client uses 64 (e.g. cloud Netapps and
>> the like).
>
> If that's true, it'd be a shame. The protocol allows any value. No
> constant number will ever be "best", or even correct.
>
>> It's probably just a lot less hassle to stick with NFSv3 for this kind
>> of high latency multi process concurrency use case.
>
> That, too, would be a shame. It's worth the effort to find a better
> NFSv4.1 Linux solution.
>
> Tom.

--
Chuck Lever



2022-02-19 07:35:57

by Tom Talpey

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s


On 2/18/2022 2:04 PM, Daire Byrne wrote:
> On Wed, 9 Feb 2022 at 17:38, Tom Talpey <[email protected]> wrote:
>>
>> On 2/7/2022 1:57 PM, Daire Byrne wrote:
>>> Hi,
>>>
>>> As part of my ongoing investigations into high latency WAN NFS
>>> performance with only a single client (for the purposes of then
>>> re-exporting), I have been looking at the metadata performance
>>> differences between NFSv3 and NFSv4.2.
>>>
>>> High latency seems to be a particularly good way of highlighting the
>>> parallel/concurrency performance limitations with a single NFS client.
>>> So I took a client 200ms away from the server and ran things like
>>> open() and stat() calls to many files & directories using simultaneous
>>> threads (200+) to see how many requests and operations we could keep
>>> in flight simultaneously.
>>>
>>> The executive summary is that NFSv4 is around 10x worse than NFSv3 and
>>> an NFSv4 client clearly flatlines at around 180 ops/s with 200ms. By
>>> comparison, an NFSv3 client can do around 1,500 ops/s (access+lookup)
>>> with the same test.
>>>
>>> On paper, NFSv4 is more compelling over the WAN as it should reduce
>>> round trips with things like compound operations and delegations, but
>>> that's only good if it can do lots of them simultaneously too.
>>>
>>> Comparing the slot table/xport stats between the two protocols while
>>> running the benchmark highlights the difference:
>>>
>>> NFSv3
>>> opts: rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,mountaddr=10.25.22.17,mountvers=3,mountport=20048,mountproto=udp,fsc,local_lock=none
>>> xprt: tcp 0 1 2 0 0 85480 85380 0 6549783 0 102 166291 6296122
>>> xprt: tcp 0 1 2 0 0 85827 85727 0 6575842 0 102 149914 6322130
>>> xprt: tcp 0 1 2 0 0 85674 85574 0 6577487 0 102 131288 6320278
>>> xprt: tcp 0 1 2 0 0 84943 84843 0 6505613 0 102 182313 6251396
>>>
>>> NFSv4.2
>>> opts: rw,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=3600,acregmax=3600,acdirmin=3600,acdirmax=3600,hard,nocto,noresvport,proto=tcp,nconnect=4,timeo=600,retrans=10,sec=sys,clientaddr=10.25.112.8,fsc,local_lock=none
>>> xprt: tcp 0 0 2 0 0 301 301 0 1439 0 9 80 1058
>>> xprt: tcp 0 0 2 0 0 294 294 0 1452 0 10 79 1085
>>> xprt: tcp 0 0 2 0 0 292 292 0 1443 0 10 102 1055
>>> xprt: tcp 0 0 2 0 0 287 286 0 1407 0 9 64 1067
>>>
>>> So either we aren't putting things into the slot table quickly enough
>>> for it to scale up, or it just isn't scaling for some other reason.
>>>
>>> The max slots of 101 for NFSv3 and 10 for NFSv4.2 probably accounts
>>> for the aggregate difference of 10x I see in benchmarking?
>>>
>>> I tried increasing the /sys/module/nfs/parameters/max_session_slots
>>> from 64 to 128 on the client (modprobe.conf & reboot) but it didn't
>>> seem to make much difference. Maybe it's a server side limit then and
>>> the lowest is being used:
>>>
>>> fs/nfsd/stat.h:
>>> #define NFSD_SLOT_CACHE_SIZE 2048
>>> /* Maximum number of NFSD_SLOT_CACHE_SIZE slots per session */
>>> #define NFSD_CACHE_SIZE_SLOTS_PER_SESSION 32
>>>
>>> I'm sure there are probably good reasons for these values (like
>>> stopping a client from hogging the queue) but is this the reason I see
>>> such a big difference in the performance of concurrency for a single
>>> client over high latencies?
>>
>> Daire, I'm interested in your results if you increase the server slot
>> limits. Remember that the "slot" is an NFSv4.1+ protocol element. In
>> NFSv3 and v4.0, there is no protocol-based flow control, so the max
>> outstanding RPC counts are effectively the smaller of the client's and
>> server's RPC task and/or thread limits, and of course the wire itself.
>>
>> With a 200msec RTT and a single-threaded workload, you'll get 5 ops/sec,
>> times 32 slots that's pretty much the 180 you see. So I'd expect it to
>> rise linearly as you scale both ends' slot numbers.
>
> I finally got around to testing this again. I recompiled a server kernel with:
>
> NFSD_CACHE_SIZE_SLOTS_PER_SESSION=256
>
> I ran some more tests and as predicted this helps a lot. Because the
> client default for the client's max_sessions_slots=64 (where the
> server is 32), I saw double the concurrency straightaway.

Nice, thanks for the followup!

> And then as I increased the client's max_sessions_slots (up to 256) it
> kept on improving. I guess I would need to set the server and client
> slots to be around 512 to see the same concurrency performance as for
> NFSv3 with 200ms.
>
> Which I guess leads on to some questions:
> 1) Why is NFSD_CACHE_SIZE_SLOTS_PER_SESSION not a tunable? We don't
> really want to maintain our own kernel compiles on our RHEL8 servers.

I totally agree that it's reasonable to allow tuning. And, 32 is a
woefully small maximum.

> 2) Why is the default linux client slot count 64 and the server's is
> 32? You can tune the linux client down and not up (if using a Linux
> server).

That's for Trond and Chuck I guess.

> 3) What would be the recommended and safest way to have a few high
> latency clients with increased slots and concurrency?

So, slot counts are negotiable, and dynamic, between client and
server in NVSv4.1+. But I don't believe that either the Linux client
or server allow them to change after starting a session.

IMO the best way is to write some code to manage slots both to increase
on demand and decrease on non-use. But dynamic credit management is a
devilishly hard thing to get right. It won't be trivial.

> I'm thinking it would be better to have the server default be higher
> and the linux client default be 32 instead to replicate the current
> situation. But no doubt there are other storage filers that already
> rely on the fact that the Linux client uses 64 (e.g. cloud Netapps and
> the like).

If that's true, it'd be a shame. The protocol allows any value. No
constant number will ever be "best", or even correct.

> It's probably just a lot less hassle to stick with NFSv3 for this kind
> of high latency multi process concurrency use case.

That, too, would be a shame. It's worth the effort to find a better
NFSv4.1 Linux solution.

Tom.

2022-02-20 03:57:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s

On Sat, 2022-02-19 at 11:43 +1100, NeilBrown wrote:
> On Sat, 19 Feb 2022, Chuck Lever III wrote:
> >
> > For the Linux NFS server, there is an enhancement request open
> > in this area:
> >
> > https://bugzilla.linux-nfs.org/show_bug.cgi?id=375
> >
> > If there are any relevant design notes or performance results,
> > that would be the place to put them.
>
> I wonder if I have a login there..
>

If you're having trouble setting one up, then let me know. I should be
able to help.
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-02-21 08:53:16

by NeilBrown

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s

On Sat, 19 Feb 2022, Chuck Lever III wrote:
>
> > On Feb 18, 2022, at 4:26 PM, Tom Talpey <[email protected]> wrote:
> >
> >
> > On 2/18/2022 2:04 PM, Daire Byrne wrote:
> >>
> >> 2) Why is the default linux client slot count 64 and the server's is
> >> 32? You can tune the linux client down and not up (if using a Linux
> >> server).
> >
> > That's for Trond and Chuck I guess.
>
> For the Linux NFS server, there is an enhancement request open
> in this area:
>
> https://bugzilla.linux-nfs.org/show_bug.cgi?id=375
>
> If there are any relevant design notes or performance results,
> that would be the place to put them.

I wonder if I have a login there..

>
> IIRC the only downside to a large default slot count on the
> server is that it can waste memory, and it is difficult to handle
> the corner cases when the server is running on a small physical
> host (or in a small container).

I would have a small default slot count (one page of slots??), which
automatically grew when it reached some level - say 70% - providing the
required kmalloc succeeded (with GFP_NORETRY or similar so that it
doesn't try to hard). It would register a "shrinker" so that it could
respond to memory pressure and scale back the slot count when memory is
tight.

Freeing slot memory would be not be quick as you might need to wait for
the client to stop using it, so allocating new memory should be
correspondingly sluggish.

Shouldn't be too hard.... Definitely don't want a tunable for this.

NeilBrown

2022-02-21 09:16:32

by NeilBrown

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s

On Sat, 19 Feb 2022, Trond Myklebust wrote:
> On Sat, 2022-02-19 at 11:43 +1100, NeilBrown wrote:
> > On Sat, 19 Feb 2022, Chuck Lever III wrote:
> > >
> > > For the Linux NFS server, there is an enhancement request open
> > > in this area:
> > >
> > > https://bugzilla.linux-nfs.org/show_bug.cgi?id=375
> > >
> > > If there are any relevant design notes or performance results,
> > > that would be the place to put them.
> >
> > I wonder if I have a login there..
> >
>
> If you're having trouble setting one up, then let me know. I should be
> able to help.

Thanks. I couldn't find any evidence in email history of ever having
one, so I tried creating one and it went completely smoothly. :-)

NeilBrown

2022-02-21 23:21:53

by Daire Byrne

[permalink] [raw]
Subject: Re: NFSv4 versus NFSv3 parallel client op/s

I certainly have nothing to contribute in terms of implementation
details (you all know best). I will happily test and provide numbers
though.

I guess it was just one of those (out of the box) corner cases where
NFSv3 > NFSv4.2 that I thought was worth highlighting.

For now, I probably have no other option but to use NFSv3 for any high
latency clients that require lots of concurrency as future changes to
the server side per client slot count logic is unlikely to make its
way into this generation of Linux distros (e.g. RHEL8).

Daire

On Mon, 21 Feb 2022 at 07:59, NeilBrown <[email protected]> wrote:
>
> On Sat, 19 Feb 2022, Trond Myklebust wrote:
> > On Sat, 2022-02-19 at 11:43 +1100, NeilBrown wrote:
> > > On Sat, 19 Feb 2022, Chuck Lever III wrote:
> > > >
> > > > For the Linux NFS server, there is an enhancement request open
> > > > in this area:
> > > >
> > > > https://bugzilla.linux-nfs.org/show_bug.cgi?id=375
> > > >
> > > > If there are any relevant design notes or performance results,
> > > > that would be the place to put them.
> > >
> > > I wonder if I have a login there..
> > >
> >
> > If you're having trouble setting one up, then let me know. I should be
> > able to help.
>
> Thanks. I couldn't find any evidence in email history of ever having
> one, so I tried creating one and it went completely smoothly. :-)
>
> NeilBrown