2024-01-13 15:09:32

by Jeffrey Layton

[permalink] [raw]
Subject: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> Hi!
>
> ----
>
> Jess/Chuck: Who is in charge of the SPAM filter at kernel.org ? I'm
> having problems sending ANY messages to the mailinglist, and it is
> starting to frustrate me... ;-(
>
> ----
>
> Bye,
> Roland
>
> Forwarded Conversation
> Subject: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing
> NFSD_MAX_OPS_PER_COMPOUND to 96
> ------------------------
>
> From: Roland Mainz <[email protected]>
> Date: Sat, Jan 13, 2024 at 3:22 PM
> To: Linux NFS Mailing List <[email protected]>
>
>
> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
> > We've been experiencing significant nfsd performance problems with a
> > customer who has a deeply nested filesystem hierarchy, lots of
> > subdirs, some of them 60-80 dirs deep (!!), which leads to an
> > exponentially slowdown with nfsd accesses.
> >
> > Some of the issues have been addressed by implementing a better
> > directory walker via multiple dir fds and openat() (instead of just
> > cwd+open()), but the nfsd side still was a pretty dramatic issue,
> > until we bumped #define NFSD_MAX_OPS_PER_COMPOUND in
> > linux-6.7/fs/nfsd/nfsd.h from 50 to 96. After that the nfsd side
> > behaved MUCH more performant.
>
> More general question:
> Is it feasible to turn the values for NFSD_MAX_* (max_ops,
> max_req etc., e.g. everything which is being negotiated in a NFSv4.1
> session) into tuneables, which are set at nfsd startup ? It might help
> with Dan's scenario, benchmarking, client testing (e.g. my case, where
> I switched to nfs4j) and tuning...
>

(re-cc'ing the mailing list...)

We generally don't like to add knobs like this when we can get by with
just tuning a sane value for everyone. This particular value governs the
maximum number of operations per compound. I don't see any value in
keeping it artificially low.

The only real argument against it that I can see is that it might make
it easier for a malicious or badly-designed client to DoS the server.
That's certainly something we should be wary of, but I don't expect that
increasing the max from 50 to ~100 will make a big difference there.
--
Jeff Layton <[email protected]>


2024-01-13 16:11:30

by Chuck Lever

[permalink] [raw]
Subject: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96



> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
>
> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>>
>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
>>> We've been experiencing significant nfsd performance problems with a
>>> customer who has a deeply nested filesystem hierarchy, lots of
>>> subdirs, some of them 60-80 dirs deep (!!), which leads to an
>>> exponentially slowdown with nfsd accesses.
>>>
>>> Some of the issues have been addressed by implementing a better
>>> directory walker via multiple dir fds and openat() (instead of just
>>> cwd+open()), but the nfsd side still was a pretty dramatic issue,
>>> until we bumped #define NFSD_MAX_OPS_PER_COMPOUND in
>>> linux-6.7/fs/nfsd/nfsd.h from 50 to 96. After that the nfsd side
>>> behaved MUCH more performant.
>>
>> More general question:
>> Is it feasible to turn the values for NFSD_MAX_* (max_ops,
>> max_req etc., e.g. everything which is being negotiated in a NFSv4.1
>> session) into tuneables, which are set at nfsd startup ? It might help
>> with Dan's scenario, benchmarking, client testing (e.g. my case, where
>> I switched to nfs4j) and tuning...
>>
>
> (re-cc'ing the mailing list...)
>
> We generally don't like to add knobs like this when we can get by with
> just tuning a sane value for everyone. This particular value governs the
> maximum number of operations per compound. I don't see any value in
> keeping it artificially low.
>
> The only real argument against it that I can see is that it might make
> it easier for a malicious or badly-designed client to DoS the server.
> That's certainly something we should be wary of, but I don't expect that
> increasing the max from 50 to ~100 will make a big difference there.

The server allocates memory and other resources based on the
largest COMPOUND it expects.

If we crank the maximum number, it has an impact on server
resource utilization. In particular, those extra COMPOUND
slots will almost never be used except in a handful of corner
cases.

Plus, this becomes a race against applications and workloads
that try to consume past that limit. We bump it, they use
more and hit the new limit. We bump it, lather, rinse,
repeat.

Indeed, if we increase that value enough, it does become a
server DoS vector by tying up all available nfsd threads
trying to execute enormous COMPOUNDs.

Upshot is I'm not in favor of increasing the max-ops limit or
making it tunable, unless we have grossly misunderstood the
issue.


>> Solaris 11 is known to send COMPOUNDs that are too large
>> during mount, but the rest of the time these three client
>> implementations are not known to send large COMPOUNDs.
> Actually the FreeBSD client is the same as Solaris, in that it does the
> entire mount path in one compound. If you were to attempt a mount
> with more than 48 components, it would exceed 50 ops in the compound.
> I don't think it can exceed 50 ops any other way.


I'd like to see the raw packet captures to confirm that our
speculation about the problem is indeed correct. Since this
limit is hit only when mounting (and not at all by Linux
clients), I don't yet see how that would "make NFSD slow".


>> I guess your clients are trying to do a long pathwalk in a single
>> COMPOUND?
>
> Is there a problem with that (assuming NFSv4.1 session limits are honored) ?

Yes: very clearly the client will hit a rather artificial
path length limit. And the limit isn't based on the character
length of the path: the limit is hit much sooner with a path
that is constructed from a series of very short component
names, for instance.

Good client implementations keep the number of operations per
COMPOUND limited to a small number, and break up operations
like path walks to ensure that the protocol and server
implementation do not impose any kind of application-visible
constraint.


>> Is this the windows client?
>
> No, the ms-nfs41-client (see
> https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but
> it is on our ToDo list to bump that to |128| (but honoring the limit
> set by the NFSv4.1 server during session negotiation) since it now
> supports very long paths ([1]) and this issue is a known performance
> bottleneck.


A better way to optimize this case is to walk the path once
and cache the terminal component's file handle. This is what
Linux does, and it sounds like Dan's directory walker
optimizations do effectively the same thing.


--
Chuck Lever


2024-01-13 21:14:25

by Jeffrey Layton

[permalink] [raw]
Subject: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Sat, 2024-01-13 at 16:10 +0000, Chuck Lever III wrote:
>
> > On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> > >
> > > On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
> > > > We've been experiencing significant nfsd performance problems with a
> > > > customer who has a deeply nested filesystem hierarchy, lots of
> > > > subdirs, some of them 60-80 dirs deep (!!), which leads to an
> > > > exponentially slowdown with nfsd accesses.
> > > >
> > > > Some of the issues have been addressed by implementing a better
> > > > directory walker via multiple dir fds and openat() (instead of just
> > > > cwd+open()), but the nfsd side still was a pretty dramatic issue,
> > > > until we bumped #define NFSD_MAX_OPS_PER_COMPOUND in
> > > > linux-6.7/fs/nfsd/nfsd.h from 50 to 96. After that the nfsd side
> > > > behaved MUCH more performant.
> > >
> > > More general question:
> > > Is it feasible to turn the values for NFSD_MAX_* (max_ops,
> > > max_req etc., e.g. everything which is being negotiated in a NFSv4.1
> > > session) into tuneables, which are set at nfsd startup ? It might help
> > > with Dan's scenario, benchmarking, client testing (e.g. my case, where
> > > I switched to nfs4j) and tuning...
> > >
> >
> > (re-cc'ing the mailing list...)
> >
> > We generally don't like to add knobs like this when we can get by with
> > just tuning a sane value for everyone. This particular value governs the
> > maximum number of operations per compound. I don't see any value in
> > keeping it artificially low.
> >
> > The only real argument against it that I can see is that it might make
> > it easier for a malicious or badly-designed client to DoS the server.
> > That's certainly something we should be wary of, but I don't expect that
> > increasing the max from 50 to ~100 will make a big difference there.
>
> The server allocates memory and other resources based on the
> largest COMPOUND it expects.
>
> If we crank the maximum number, it has an impact on server
> resource utilization. In particular, those extra COMPOUND
> slots will almost never be used except in a handful of corner
> cases.
>
> Plus, this becomes a race against applications and workloads
> that try to consume past that limit. We bump it, they use
> more and hit the new limit. We bump it, lather, rinse,
> repeat.
>
> Indeed, if we increase that value enough, it does become a
> server DoS vector by tying up all available nfsd threads
> trying to execute enormous COMPOUNDs.
>
> Upshot is I'm not in favor of increasing the max-ops limit or
> making it tunable, unless we have grossly misunderstood the
> issue.
>

Does it? The only thing that I could see that scales directly with that
value is the size of struct nfsd_genl_rqstp. That's just part of the new
netlink stats interface, so I don't see that as a show stopper. Am I
missing something else that scales directly with
NFSD_MAX_OPS_PER_COMPOUND?

>
> > > Solaris 11 is known to send COMPOUNDs that are too large
> > > during mount, but the rest of the time these three client
> > > implementations are not known to send large COMPOUNDs.
> > Actually the FreeBSD client is the same as Solaris, in that it does the
> > entire mount path in one compound. If you were to attempt a mount
> > with more than 48 components, it would exceed 50 ops in the compound.
> > I don't think it can exceed 50 ops any other way.
>
>
> I'd like to see the raw packet captures to confirm that our
> speculation about the problem is indeed correct. Since this
> limit is hit only when mounting (and not at all by Linux
> clients), I don't yet see how that would "make NFSD slow".
>

It seems quite plausible that keeping the max low causes the client to
have to do a deep pathwalk using multiple RPCs instead of one. That
seems like it could have performance implications.

> > > I guess your clients are trying to do a long pathwalk in a single
> > > COMPOUND?
> >
> > Is there a problem with that (assuming NFSv4.1 session limits are honored) ?
>
> Yes: very clearly the client will hit a rather artificial
> path length limit. And the limit isn't based on the character
> length of the path: the limit is hit much sooner with a path
> that is constructed from a series of very short component
> names, for instance.
>
> Good client implementations keep the number of operations per
> COMPOUND limited to a small number, and break up operations
> like path walks to ensure that the protocol and server
> implementation do not impose any kind of application-visible
> constraint.
>
>

Sure, and good servers try their best to deal with whatever the clients
throw at them. I don't really see the value in limiting the number of
ops per compound. Are we really any better off having the client break
those up into multiple round trips? Why?
--
Jeff Layton <[email protected]>

2024-01-13 23:09:56

by Chuck Lever

[permalink] [raw]
Subject: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96



> On Jan 13, 2024, at 4:14 PM, Jeff Layton <[email protected]> wrote:
>
> On Sat, 2024-01-13 at 16:10 +0000, Chuck Lever III wrote:
>>
>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
>>
>>>> Solaris 11 is known to send COMPOUNDs that are too large
>>>> during mount, but the rest of the time these three client
>>>> implementations are not known to send large COMPOUNDs.
>>> Actually the FreeBSD client is the same as Solaris, in that it does the
>>> entire mount path in one compound. If you were to attempt a mount
>>> with more than 48 components, it would exceed 50 ops in the compound.
>>> I don't think it can exceed 50 ops any other way.
>>
>> I'd like to see the raw packet captures to confirm that our
>> speculation about the problem is indeed correct. Since this
>> limit is hit only when mounting (and not at all by Linux
>> clients), I don't yet see how that would "make NFSD slow".
>
> It seems quite plausible that keeping the max low causes the client to
> have to do a deep pathwalk using multiple RPCs instead of one. That
> seems like it could have performance implications.

That's a lot of "mights" and "coulds." Not saying you're
wrong, but this needs some evidentiary backup.

No one has yet demonstrated how this limit directly impacts
perceived NFS server performance in this case. There has
been only speculation about what the clients are doing and
how much breaking up round trips would slow them down.

Again, if path walking is happening only at mount time, I
don't understand why it would have /any/ workload
performance impact to do it in multiple steps versus one.
Do you? A lengthy path walk during mount is not in the
performance path.

That's my main concern, and it's specific to this problem
report. It's not a concern about the actual value of NFSD's
max-ops, large or small.

Let's see packet captures and performance numbers before
making code changes, please? I don't think that's an
unreasonable request. My guess is there is some (bogus)
error handling logic that is gumming up the works that is
side-stepped when max-ops is large enough to handle these
requests in one COMPOUND.


> I don't really see the value in limiting the number of
> ops per compound. Are we really any better off having the client break
> those up into multiple round trips?

Yes, clients are better off handling this properly.


> Why?


Clients don't have any control over the max-ops limit that
a server places on a session. They really cannot depend on
it being large.

In fact, servers that are resource-constrained are
permitted to reduce max-ops and the maximum session slot
count (CB_RECALL_SLOT is one mechanism to do this). That
is totally expected and valid server behavior. (No, NFSD
does not do this currently, but the protocol allows it).

Fix the clients once, and they will be able to handle all
these scenarios transparently and efficiently against any
server, old or new.


--
Chuck Lever


2024-01-14 17:51:14

by Cedric Blancher

[permalink] [raw]
Subject: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Sat, 13 Jan 2024 at 17:11, Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> >>
> >> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
> > Is there a problem with that (assuming NFSv4.1 session limits are honored) ?
>
> Yes: very clearly the client will hit a rather artificial
> path length limit. And the limit isn't based on the character
> length of the path: the limit is hit much sooner with a path
> that is constructed from a series of very short component
> names, for instance.
>
> Good client implementations keep the number of operations per
> COMPOUND limited to a small number, and break up operations
> like path walks to ensure that the protocol and server
> implementation do not impose any kind of application-visible
> constraint.

This is not "good client implementation", this is bad design to force
single operations into smaller pieces.

This has drastic implications, and all are BAD:
- increased latency, by adding more round trips to complete a single
vfs operation. Right now the NFSv4 Linux server implementation already
has enough issues with bad latency
- increased volume of network traffic
- decreased throughput
- worker threads have less to do per compound, but the number of
compounds goes up. But neither are there more server threads, and the
per compound overhead is static, and just multiples with the
additional requests

This is basically what ruined X11 in the long run. The protocol split
everything into little requests, but over 20 years the networks did
not scale with the increment in CPU power, making X11 less and less
capable over the network. No one added more complex and powerful X11
requests, dooming the X11 performance over network.

So the Linux NFSv4 implementation is now doing the same, but at least
the protocol has knobs to scale it better.

Ced
--
Cedric Blancher <[email protected]>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

2024-01-14 20:23:36

by Chuck Lever

[permalink] [raw]
Subject: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96


> On Jan 14, 2024, at 12:50 PM, Cedric Blancher <[email protected]> wrote:
>
> On Sat, 13 Jan 2024 at 17:11, Chuck Lever III <[email protected]> wrote:
>>
>>
>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
>>>
>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>>>>
>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
>>> Is there a problem with that (assuming NFSv4.1 session limits are honored) ?
>>
>> Yes: very clearly the client will hit a rather artificial
>> path length limit. And the limit isn't based on the character
>> length of the path: the limit is hit much sooner with a path
>> that is constructed from a series of very short component
>> names, for instance.
>>
>> Good client implementations keep the number of operations per
>> COMPOUND limited to a small number, and break up operations
>> like path walks to ensure that the protocol and server
>> implementation do not impose any kind of application-visible
>> constraint.
>
> This is not "good client implementation", this is bad design to force
> single operations into smaller pieces.

NFSv4 client implementers have had 20+ years to find
ways to innovate using complex COMPOUNDs, and have yet
to do so. I am not forcing any design constraint on
NFSv4 clients -- clients already work this way, because
their VFS layers have already broken up the operations
before the NFS client layer even sees them.

You can blame the design of VFS for that. It really
isn't the result of NFSv4's COMPOUND architecture.

Now, for Dan's issue:

The mean size of NFSv4 COMPOUNDs observed in packet
captures is less than 10 ops. A 50 operation max-ops
limit has zero effect on the vast majority of on-the-wire
operations from these clients. Doubling that limit will
have no impact on these operations.

We already know that Solaris and FreeBSD send large
COMPOUNDs at mount time. And in particular, Solaris
and FreeBSD clients do not walk path names as part of
OPEN, READ, or WRITE operations, since both have very
capable directory name caches. So I honestly feel that
the path name walk thing is a red herring for Dan's
issue.

If the workloads involve complex readv() and writev()
system calls, these client implementations /might/ be
building complex COMPOUNDs to handle those calls in
a single RPC. We need to see packet captures to
understand what's going on.

That is why IMO it's unwise to increase upstream's
NFSD_MAX_OPS_PER_COMPOUND value without a proper
root-cause analysis. So far I have not seen any
convincing hard data that suggests that increasing
max-ops is doing anything but masking a deeper
problem.

For Roland's client, as I said, NFSv4.1 clients
have to stay within the bounds of the server's
max-ops and clients have no control of that. NFSD
might be changed to provide a larger max-ops, but
you guys have no control over other server
implementations. The better approach is to manage
what you do have control over.


--
Chuck Lever


2024-01-18 01:58:15

by Roland Mainz

[permalink] [raw]
Subject: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
> > On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> > On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> >> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
[snip]
> >> Is this the windows client?
> > No, the ms-nfs41-client (see
> > https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but
> > it is on our ToDo list to bump that to |128| (but honoring the limit
> > set by the NFSv4.1 server during session negotiation) since it now
> > supports very long paths ([1]) and this issue is a known performance
> > bottleneck.
>
> A better way to optimize this case is to walk the path once
> and cache the terminal component's file handle. This is what
> Linux does, and it sounds like Dan's directory walker
> optimizations do effectively the same thing.

That assumes that no process does random access into deep subdirs. In
that case the performance is absolutely terrible, unless you devote
lots of memory to a giant cache (which is not feasible due to cache
expiration limits, unless someone (please!) finally implements
directory delegations).

This also ignores the use case of WAN (wide-area-networks) and WLAN
with the typical high latency and even higher amounts of network
package loss&&retransmit, where the splitting of the requests comes
with a HUGE latency penalty (you can reproduce this with network
tools, just export a large tmpfs on the server, add a package delay of
400ms between client and server, use a path like
"a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/0/1/2/3/4/5/6/7/8/9",
and compile gcc).

And in the real world the Linux nfsd |ca_maxoperations| default of
|16| is absolutely CRIPPELING.
For example in the mfs-nfs41-client we need 4 compounds for initial
setup for a file lookup, and then 3 per path component. That means
that a defaut of 16 just fits (16-4)/3=4 path elements.
Unfortunately the statistical average is not 4 - it's 11 (measured
over five weeks with 81 clients in our company).
Technically, in this scenario, a default of at least 11*3+4=37 would
be MUCH better.

That's why I think nfsd's |ca_maxoperations| should be at *least* |64|.

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) [email protected]
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

2024-01-18 09:44:59

by Martin Wege

[permalink] [raw]
Subject: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <[email protected]> wrote:
>
> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
> > > On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> > > On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> > >> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
> [snip]
> > >> Is this the windows client?
> > > No, the ms-nfs41-client (see
> > > https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but
> > > it is on our ToDo list to bump that to |128| (but honoring the limit
> > > set by the NFSv4.1 server during session negotiation) since it now
> > > supports very long paths ([1]) and this issue is a known performance
> > > bottleneck.
> >
> > A better way to optimize this case is to walk the path once
> > and cache the terminal component's file handle. This is what
> > Linux does, and it sounds like Dan's directory walker
> > optimizations do effectively the same thing.
>
> That assumes that no process does random access into deep subdirs. In
> that case the performance is absolutely terrible, unless you devote
> lots of memory to a giant cache (which is not feasible due to cache
> expiration limits, unless someone (please!) finally implements
> directory delegations).
>
> This also ignores the use case of WAN (wide-area-networks) and WLAN
> with the typical high latency and even higher amounts of network
> package loss&&retransmit, where the splitting of the requests comes
> with a HUGE latency penalty (you can reproduce this with network
> tools, just export a large tmpfs on the server, add a package delay of
> 400ms between client and server, use a path like
> "a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/0/1/2/3/4/5/6/7/8/9",
> and compile gcc).
>
> And in the real world the Linux nfsd |ca_maxoperations| default of
> |16| is absolutely CRIPPELING.
> For example in the mfs-nfs41-client we need 4 compounds for initial
> setup for a file lookup, and then 3 per path component. That means
> that a defaut of 16 just fits (16-4)/3=4 path elements.
> Unfortunately the statistical average is not 4 - it's 11 (measured
> over five weeks with 81 clients in our company).
> Technically, in this scenario, a default of at least 11*3+4=37 would
> be MUCH better.
>
> That's why I think nfsd's |ca_maxoperations| should be at *least* |64|.
>

+1

I consider the default value of 16 even a bug, given the circumstances.

Thanks,
Martin

2024-01-18 14:54:04

by Chuck Lever

[permalink] [raw]
Subject: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96


> On Jan 18, 2024, at 4:44 AM, Martin Wege <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <[email protected]> wrote:
>>
>> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
>>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
>>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
>> [snip]
>>>>> Is this the windows client?
>>>> No, the ms-nfs41-client (see
>>>> https://github.com/kofemann/ms-nfs41-client) uses a limit of |16|, but
>>>> it is on our ToDo list to bump that to |128| (but honoring the limit
>>>> set by the NFSv4.1 server during session negotiation) since it now
>>>> supports very long paths ([1]) and this issue is a known performance
>>>> bottleneck.
>>>
>>> A better way to optimize this case is to walk the path once
>>> and cache the terminal component's file handle. This is what
>>> Linux does, and it sounds like Dan's directory walker
>>> optimizations do effectively the same thing.
>>
>> That assumes that no process does random access into deep subdirs. In
>> that case the performance is absolutely terrible, unless you devote
>> lots of memory to a giant cache (which is not feasible due to cache
>> expiration limits, unless someone (please!) finally implements
>> directory delegations).

Do you mean not feasible for your client? Lookup caches
have been part of operating systems for decades. Solaris,
FreeBSD, and Linux all have one. Does the Windows kernel
have one that mfs-nfs41-client can use?


>> This also ignores the use case of WAN (wide-area-networks) and WLAN
>> with the typical high latency and even higher amounts of network
>> package loss&&retransmit, where the splitting of the requests comes
>> with a HUGE latency penalty (you can reproduce this with network
>> tools, just export a large tmpfs on the server, add a package delay of
>> 400ms between client and server, use a path like
>> "a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/0/1/2/3/4/5/6/7/8/9",
>> and compile gcc).

The most frequently implemented solution to this problem
is a lookup cache. Operating systems use it for local
on-disk filesystems as well as for NFS.

In the local filesystem case:

Think about how long each path resolution would take if
the operating system had to consult on-disk information
for every component in the pathname.

In the NFS case:

The fastest round trip is no round trip. Keep a local
cache and path resolution will be fast no matter what
the network latency is.

Note that the NFS server is going to use a lookup cache
to make large path resolution COMPOUNDs go fast. It
would be even faster (from the application's point of
view) if that cache were local to the client.

Sending a full path in a single COMPOUND is one way to
handle path resolution, but it has so many limitations
that it's really not the mechanism of choice.


>> And in the real world the Linux nfsd |ca_maxoperations| default of
>> |16| is absolutely CRIPPELING.
>> For example in the mfs-nfs41-client we need 4 compounds for initial
>> setup for a file lookup, and then 3 per path component. That means
>> that a defaut of 16 just fits (16-4)/3=4 path elements.
>> Unfortunately the statistical average is not 4 - it's 11 (measured
>> over five weeks with 81 clients in our company).
>> Technically, in this scenario, a default of at least 11*3+4=37 would
>> be MUCH better.
>>
>> That's why I think nfsd's |ca_maxoperations| should be at *least* |64|.
>
> +1
>
> I consider the default value of 16 even a bug, given the circumstances.

This is not an NFSD bug. Read to the bottom to see where
the real problem is.

Here are the CREATE_SESSION arguments from a Linux client:

csa_fore_chan_attrs
hdr pad size: 0
max req size: 1049620
max resp size: 1049480
max resp size cached: 7584
max ops: 8
max reqs: 64
csa_back_chan_attrs
hdr pad size: 0
max req size: 4096
max resp size: 4096
max resp size cached: 0
max ops: 2
max reqs: 16

The ca_maxoperations field contains 8.

The response from NFSD looks like this:

csr_fore_chan_attrs
hdr pad size: 0
max req size: 1049620
max resp size: 1049480
max resp size cached: 2128
max ops: 8
max reqs: 30
csr_back_chan_attrs
hdr pad size: 0
max req size: 4096
max resp size: 4096
max resp size cached: 0
max ops: 2
max reqs: 16

The ca_maxoperations field again contains 8.

Here's what RFC 8881 Section 18.36.3 says:

> ca_maxoperations:
> The maximum number of operations the replier will accept
> in a COMPOUND or CB_COMPOUND. For the backchannel, the
> server MUST NOT change the value the client offers. For
> the fore channel, the server MAY change the requested
> value. After the session is created, if a requester sends
> a COMPOUND or CB_COMPOUND with more operations than
> ca_maxoperations, the replier MUST return
> NFS4ERR_TOO_MANY_OPS.

The BCP 14 "MAY" here means that servers can return the same
value, but clients have to expect that a server might return
something different.

Further, the spec does not permit an NFS server to respond to
a COMPOUND with more than the client's ca_maxoperations in
any way other than to return NFS4ERR_TOO_MANY_OPS. So it
cannot return a larger ca_maxoperations than the client sent.

NFSD returns the minimum of the client's max-ops and its own
NFSD_MAX_OPS_PER_COMPOUND value, which is 50. Thus NFSD will
return the same value as the client, unless the client asks
for more than 50.

So, the only reason NFSD returns 16 to your client is because
your client sets a value of 16 in its CREATE_SESSION Call. If
your client sent a larger value (like, 11*3+4), then NFSD will
respect that limit instead.

The spec is very clear about how this needs to work, and
NFSD is 100% compliant to the spec here. It's the client that
has to request a larger limit.


--
Chuck Lever


2024-03-16 11:55:47

by Roland Mainz

[permalink] [raw]
Subject: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Thu, Jan 18, 2024 at 3:52 PM Chuck Lever III <[email protected]> wrote:
> > On Jan 18, 2024, at 4:44 AM, Martin Wege <[email protected]> wrote:
> > On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <[email protected]> wrote:
> >> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
> >>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> >>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> >>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
[snip]
> >> That assumes that no process does random access into deep subdirs. In
> >> that case the performance is absolutely terrible, unless you devote
> >> lots of memory to a giant cache (which is not feasible due to cache
> >> expiration limits, unless someone (please!) finally implements
> >> directory delegations).
>
> Do you mean not feasible for your client? Lookup caches
> have been part of operating systems for decades. Solaris,
> FreeBSD, and Linux all have one. Does the Windows kernel
> have one that mfs-nfs41-client can use?

The ms-nfs41-client has its own cache.
Technically Windows has another, but that is in the kernel and
difficult to connect to the NFS client daemon without performance
issues.

[snip]
> Sending a full path in a single COMPOUND is one way to
> handle path resolution, but it has so many limitations
> that it's really not the mechanism of choice.

Which limitations ?

The reason why I am looking to stuff more info into a request:
- VPN has very high latency, so splitting requests hurts performance *BADLY*.
I've been slapped about path/dir lookup performance now many times,
and while there is more than one issue (Cygwin looks for "file" and
"file.lnk"&co for each file + our readdir implementation needs lots of
work) the biggest issue that we split requests up because they usually
do not fit.
- Windows API is async+multithreaded, which results in that requests
do not always come in the logical/expected/useful order, which leads
to cache issues.
Seriously this issue is so bad that it is worth a research paper
- Real-world paths on Windows are LONG with many subdirs, even worse
when projects and organisations change, shift, reorganise, move,
merge, split, get outsourced etc. over *DECADES*. Plus non-IT-users
have zero awareness about "path limits", and sometimes dump whole
sentences into directory names (e.g. "customer XYZ. can be ignored he
terminated the business relationship on 26 May 2001. please do not
delete dir" <----- xxx@@!!!! ).
That issue haunts us in other ways too, e.g. in the ms-nfs41-client
project I had to extend the maximum supported path length multiple
times to support this craziness, right now we support 4096 byte paths
([1]), with the longest known path being 1772, and others reported
even more.
And this is not a specific issue to my current employer, I've seen
this in customer installations when I was at SUN (including long
debates about Solaris's 1024 byte limit) and RedHat too.

[1]=Windows opened the next can of pandora with removing the MAXPATH
limit a while ago, e.g. see
https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
- and even before that there was the "\\?\" prefix.

[snip]
> > ca_maxoperations:
> > The maximum number of operations the replier will accept
> > in a COMPOUND or CB_COMPOUND. For the backchannel, the
> > server MUST NOT change the value the client offers. For
> > the fore channel, the server MAY change the requested
> > value. After the session is created, if a requester sends
> > a COMPOUND or CB_COMPOUND with more operations than
> > ca_maxoperations, the replier MUST return
> > NFS4ERR_TOO_MANY_OPS.
>
> The BCP 14 "MAY" here means that servers can return the same
> value, but clients have to expect that a server might return
> something different.
>
> Further, the spec does not permit an NFS server to respond to
> a COMPOUND with more than the client's ca_maxoperations in
> any way other than to return NFS4ERR_TOO_MANY_OPS. So it
> cannot return a larger ca_maxoperations than the client sent.
>
> NFSD returns the minimum of the client's max-ops and its own
> NFSD_MAX_OPS_PER_COMPOUND value, which is 50. Thus NFSD will
> return the same value as the client, unless the client asks
> for more than 50.

I finally (yay - Saturday) had a look at this issue and
collected&&processed statistics.
With a Linux 6.6.20-rt25 kernel nfsd I get this in the ms-nfs41-client:
---- snip ----
1010: requested: req.csa_fore_chan_attrs.(ca_maxoperations=16384,
ca_maxrequests=128)
1010: response: session->fore_chan_attrs->(ca_maxoperations=50,
ca_maxrequests=66)
---- snip ----

So - if I understand it correctly - the negotiation works correctly,
and we get |ca_maxoperations=50| and |ca_maxrequests=66|.

But... this value is too small, at least for what we do on Windows.
I've collected samples (84 machines, a wide range of users, MS Office,
ERP, CAD, etc.) and 71% of all server lookup calls had to be split
(Linux 6.6 LTS kernel nfsd) for |ca_maxoperations==50|, 39% for
|ca_maxoperations==64| and <1% for |ca_maxoperations==80|.

Question is... should the values for |ca_*| be a tuneable, or just
increase the limit to |80| ([1]) ?

[1]=I can provide the patch, with sufficient curses about Windows
*USERS* included...

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) [email protected]
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

2024-03-16 13:16:36

by Roland Mainz

[permalink] [raw]
Subject: Re: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Sat, Mar 16, 2024 at 12:55 PM Roland Mainz <[email protected]> wrote:
> On Thu, Jan 18, 2024 at 3:52 PM Chuck Lever III <[email protected]> wrote:
> > > On Jan 18, 2024, at 4:44 AM, Martin Wege <[email protected]> wrote:
> > > On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <[email protected]> wrote:
> > >> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
> > >>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> > >>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> > >>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
[snip]
> Question is... should the values for |ca_*| be a tuneable, or just
> increase the limit to |80| ([1]) ?

Actually a tuneable (which defaults to |80|) would be good, since we
have to test against different configs/kernels anyway, and a tuneable
would help...

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) [email protected]
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

2024-03-16 16:35:55

by Chuck Lever

[permalink] [raw]
Subject: Re: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96



> On Mar 16, 2024, at 7:55 AM, Roland Mainz <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 3:52 PM Chuck Lever III <[email protected]> wrote:
>>> On Jan 18, 2024, at 4:44 AM, Martin Wege <[email protected]> wrote:
>>> On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <[email protected]> wrote:
>>>> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
>>>>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
>>>>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
>>>>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
> [snip]
>>>> That assumes that no process does random access into deep subdirs. In
>>>> that case the performance is absolutely terrible, unless you devote
>>>> lots of memory to a giant cache (which is not feasible due to cache
>>>> expiration limits, unless someone (please!) finally implements
>>>> directory delegations).
>>
>> Do you mean not feasible for your client? Lookup caches
>> have been part of operating systems for decades. Solaris,
>> FreeBSD, and Linux all have one. Does the Windows kernel
>> have one that mfs-nfs41-client can use?
>
> The ms-nfs41-client has its own cache.
> Technically Windows has another, but that is in the kernel and
> difficult to connect to the NFS client daemon without performance
> issues.
>
> [snip]
>> Sending a full path in a single COMPOUND is one way to
>> handle path resolution, but it has so many limitations
>> that it's really not the mechanism of choice.
>
> Which limitations ?

The most important limitation is the maximum size of
a forward channel RPC Call and Reply:

count4 ca_maxrequestsize;
count4 ca_maxresponsesize;

You can't put more COMPOUND operations in a single RPC
than will fit within these limits.


> The reason why I am looking to stuff more info into a request:
> - VPN has very high latency, so splitting requests hurts performance *BADLY*.

Sure, if your client serializes the requests as you
describe below, adding a network transit latency is
going to be a problem. I recommend that your client
not rely on the server and network to guarantee
request processing order. It should instead enforce
its own ordering requirements.


> I've been slapped about path/dir lookup performance now many times,
> and while there is more than one issue (Cygwin looks for "file" and
> "file.lnk"&co for each file + our readdir implementation needs lots of
> work) the biggest issue that we split requests up because they usually
> do not fit.

High latency is something that is a well-understood
problem. You are better off caching lookup results
on your client to reduce the amount of slow
interaction a client has with the server. This is
the way every other NFS client works.


> - Windows API is async+multithreaded, which results in that requests
> do not always come in the logical/expected/useful order, which leads
> to cache issues.
> Seriously this issue is so bad that it is worth a research paper

Your client really should serialize itself and not
rely on the server for ordering. If the client has a
serialization requirement, it needs to enforce those
itself. Any modern I/O system is going to be "fire
and forget" -- it will then wait and handle the
replies in whatever order they arrive. Your client
caches should do the same.


> - Real-world paths on Windows are LONG with many subdirs, even worse
> when projects and organisations change, shift, reorganise, move,
> merge, split, get outsourced etc. over *DECADES*. Plus non-IT-users
> have zero awareness about "path limits", and sometimes dump whole
> sentences into directory names (e.g. "customer XYZ. can be ignored he
> terminated the business relationship on 26 May 2001. please do not
> delete dir" <----- xxx@@!!!! ).
> That issue haunts us in other ways too, e.g. in the ms-nfs41-client
> project I had to extend the maximum supported path length multiple
> times to support this craziness, right now we support 4096 byte paths
> ([1]), with the longest known path being 1772, and others reported
> even more.

Again, your client really needs to handle this
scalably by breaking the path into a component at
a time and caching the directory hierarchy
locally. It's not going to work by bumping up
these limits over time because you will always
hit some limit in the protocol.


> And this is not a specific issue to my current employer, I've seen
> this in customer installations when I was at SUN (including long
> debates about Solaris's 1024 byte limit) and RedHat too.

POSIX based filesystems have hard limits on path
length in number of bytes. That's not going to
change just because these file systems are
exported via NFS.


> [1]=Windows opened the next can of pandora with removing the MAXPATH
> limit a while ago, e.g. see
> https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
> - and even before that there was the "\\?\" prefix.
>
> [snip]
>>> ca_maxoperations:
>>> The maximum number of operations the replier will accept
>>> in a COMPOUND or CB_COMPOUND. For the backchannel, the
>>> server MUST NOT change the value the client offers. For
>>> the fore channel, the server MAY change the requested
>>> value. After the session is created, if a requester sends
>>> a COMPOUND or CB_COMPOUND with more operations than
>>> ca_maxoperations, the replier MUST return
>>> NFS4ERR_TOO_MANY_OPS.
>>
>> The BCP 14 "MAY" here means that servers can return the same
>> value, but clients have to expect that a server might return
>> something different.
>>
>> Further, the spec does not permit an NFS server to respond to
>> a COMPOUND with more than the client's ca_maxoperations in
>> any way other than to return NFS4ERR_TOO_MANY_OPS. So it
>> cannot return a larger ca_maxoperations than the client sent.
>>
>> NFSD returns the minimum of the client's max-ops and its own
>> NFSD_MAX_OPS_PER_COMPOUND value, which is 50. Thus NFSD will
>> return the same value as the client, unless the client asks
>> for more than 50.
>
> I finally (yay - Saturday) had a look at this issue and
> collected&&processed statistics.
> With a Linux 6.6.20-rt25 kernel nfsd I get this in the ms-nfs41-client:
> ---- snip ----
> 1010: requested: req.csa_fore_chan_attrs.(ca_maxoperations=16384,
> ca_maxrequests=128)
> 1010: response: session->fore_chan_attrs->(ca_maxoperations=50,
> ca_maxrequests=66)
> ---- snip ----
>
> So - if I understand it correctly - the negotiation works correctly,
> and we get |ca_maxoperations=50| and |ca_maxrequests=66|.

> But... this value is too small, at least for what we do on Windows.
> I've collected samples (84 machines, a wide range of users, MS Office,
> ERP, CAD, etc.) and 71% of all server lookup calls had to be split
> (Linux 6.6 LTS kernel nfsd) for |ca_maxoperations==50|, 39% for
> |ca_maxoperations==64| and <1% for |ca_maxoperations==80|.

I can't imagine 80 being sufficient for more than
a year or two, given the other things you've
mentioned in this thread.

Have you considered adding a local NFS caching
server between your local Windows clients and
the network-distant NFS servers where the data
is stored?


> Question is... should the values for |ca_*| be a tuneable, or just
> increase the limit to |80| ([1]) ?

A server tunable will never completely address
this issue, and everyone will ask what's the
right value for this tunable? Where's the
documentation? Why can't I have another tunable
just for my favorite issue? So for me, yet
another server tunable is off the table.

Jeff suggested a plan to remove the max-operations
limit, and rely on ca_maxrequestsize instead,
which is a more solid limit though it would allow
more operations per COMPOUND.

But it sounds like you'll hit that limit too
rather quickly until your client caches lookups
properly.

TL;DR: relying on the ability to resolve a full
pathname in a single NFSv4 COMPOUND is a mistaken
and limited design and is already biting you. You
should address this root cause instead of
plastering over the real problem.

Yes, COMPOUND was added to NFSv4 as a possible
way to manage network latency, but in hindsight
I think the NFS community now recognizes that
there are more effective strategies to deal with
network latency than creating more and more
complicated COMPOUND operations. Client-side
caching, for instance, is a much better choice.


--
Chuck Lever


2024-03-19 06:27:08

by Cedric Blancher

[permalink] [raw]
Subject: Re: |ca_maxoperations| - tuneable ? / was: Re: RFE: Linux nfsd's |ca_maxoperations| should be at *least* |64| ... / was: Re: kernel.org list issues... / was: Fwd: Turn NFSD_MAX_* into tuneables ? / was: Re: Increasing NFSD_MAX_OPS_PER_COMPOUND to 96

On Sat, 16 Mar 2024 at 17:35, Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Mar 16, 2024, at 7:55 AM, Roland Mainz <[email protected]> wrote:
> >
> > On Thu, Jan 18, 2024 at 3:52 PM Chuck Lever III <[email protected]> wrote:
> >>> On Jan 18, 2024, at 4:44 AM, Martin Wege <[email protected]> wrote:
> >>> On Thu, Jan 18, 2024 at 2:57 AM Roland Mainz <[email protected]> wrote:
> >>>> On Sat, Jan 13, 2024 at 5:10 PM Chuck Lever III <[email protected]> wrote:
> >>>>>> On Jan 13, 2024, at 10:09 AM, Jeff Layton <[email protected]> wrote:
> >>>>>> On Sat, 2024-01-13 at 15:47 +0100, Roland Mainz wrote:
> >>>>>>> On Sat, Jan 13, 2024 at 1:19 AM Dan Shelton <[email protected]> wrote:
> > [snip]
> >>>> That assumes that no process does random access into deep subdirs. In
> >>>> that case the performance is absolutely terrible, unless you devote
> >>>> lots of memory to a giant cache (which is not feasible due to cache
> >>>> expiration limits, unless someone (please!) finally implements
> >>>> directory delegations).
> >>
> >> Do you mean not feasible for your client? Lookup caches
> >> have been part of operating systems for decades. Solaris,
> >> FreeBSD, and Linux all have one. Does the Windows kernel
> >> have one that mfs-nfs41-client can use?
> >
> > The ms-nfs41-client has its own cache.
> > Technically Windows has another, but that is in the kernel and
> > difficult to connect to the NFS client daemon without performance
> > issues.
> >
> > [snip]
> >> Sending a full path in a single COMPOUND is one way to
> >> handle path resolution, but it has so many limitations
> >> that it's really not the mechanism of choice.
>
> Yes, COMPOUND was added to NFSv4 as a possible
> way to manage network latency, but in hindsight
> I think the NFS community now recognizes that
> there are more effective strategies to deal with
> network latency than creating more and more
> complicated COMPOUND operations. Client-side
> caching, for instance, is a much better choice.

I have a severe hiccup now after reading THAT comment. Every
generation of IT engineers makes the same damn mistakes, and it takes
them ~10 years to realise their mistakes.

So here is the comment - before my first coffee - from someone with a
grey beard, who is old enough to deal with Mintel, the first UNIX and
the first RFS, NFS, AFS, DFS:
Mistake 1: Caching will solve it all. DFS (the follow up to AFS) tried
that to an absurd extent, and failed badly, too complex, too buggy and
too cpu and memory intensive. Granted the bugs were fixed over time,
but by then the reputation was ruined.
Mistake 2: Caching is always possible. Mounting /var/mail with
actimeo=0 is the classical example, HPC another popular one.
Mistake 3: The cache memory is unlimited. We had that one with
Solaris's builtin name cache, and then ZFS. Memory is limited, and
just making the caches 2x, 8x, 32x times bigger doesn't give you any
benefits, because cache expiration/timeout. Of course you can try to
keep the cache "hot", or try delegations, or move data ownership to
another server closer to the client. See DFS above. Did not work.
Google also "law of diminishing returns"
Mistake 4: The network has unlimited bandwidth, so we can keep the
local cache updated/hot, or abuse it otherwise. Unlike our dreams in
the 1990 that we will have 100GB/s Infiniband networks in our laptops
by 2020, the real word laptop in 2024 maxes out at 1000baseT, and most
rural offices still have 100baseT
Mistake 5: The main memory is unlimited. That ignores the fact that
SUN promised us that NFSv4 will not require more memory than NFSv3.
NFSv4 still has to serve the embedded/IoT use case, either for data,
or for diskless boot from NFS(v4). Those machines cannot waste 512MB
on your dream cache with their 8MB main memory, which is also not
going to work because of "Mistake 3". The law of diminishing returns
sends you your greetings.

So complex COMPOUND operations are not that bad, but they are also not
the perfect solution for everything. Likewise, giant client-side
caches are not the perfect solution for everything, neither are they
feasible in all scenarios. Oh delete the "all" and replace with
"most".

Ced
--
Cedric Blancher <[email protected]>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur