hi,
my name is guy keren, and the company i work at is looking at
implementing an NFS 4.1 server for our existing storage product.
during the design, we encountered some issues with high-availability and
persistent sessions handling by the linux NFS client, and i would like
to understand a few things about the linux NFS client - i read all
relevant material on http://www.linux-nfs.org, and spent a while reading the
relevant recovery code in the nfs4.1 client kernel sources, but i am
missing some things (a pointer to the relevant part in the recovery code
will be appreciated as well):
1. suppose there is a persistent session that got disconnected (because
of a server restart, for example). i see that the client is re-sending
all the in-flight commands as part of
the recovery. however, suppose that one of the commands was a
compound command containing 2 requests, and the reply to the first of
them was NFS4_OK, and to the 2nd it was NFS4ERR_DELAY - will the
client's code know that after it finishes recovery of the session - then
when it creates a new session, it needs to re-send the 2nd request in
this compound command? the broader question is about a compound with N
commands, where the first X have an NFS4_OK reply and the last N-X have
NFS4_DELAY - will the client re-send a new compound with the last N-X
commands after establishing a new session?
2. if there is a non-persistent session, on which the client sent a
non-idempotent request (e.g. rename of a file into a different
directory), and the server restarted before the client received the
response - will the client just blindly re-send the same request again
after establishing a new session, or will it take some measures to
attempt to understand whether the command was already executed? i.e. if
the server already executed the rename, then re-sending it will return a
failure to locate the source file handle (because it moved to a new
directory). does the linux NFS client attempt to recover from this, or
will it simply return an error to the application layer?
3. what NFS server with persistent sessions is used (or was used) when
testing the persistent sessions support in the linux NFS client? the
linux NFS server, as far as i understood, cannot support persistent
sessions (due to lack of assured persistent memory).
thanks in advance,
--guy keren
Vast data.
On Sat, Oct 10, 2020 at 11:39:30PM +0300, guy keren wrote:
> during the design, we encountered some issues with high-availability
> and persistent sessions handling by the linux NFS client, and i
> would like to understand a few things about the linux NFS client - i
> read all relevant material on http://www.linux-nfs.org, and spent a while
> reading the relevant recovery code in the nfs4.1 client kernel
> sources, but i am missing some things (a pointer to the relevant
> part in the recovery code will be appreciated as well):
>
>
> 1. suppose there is a persistent session that got disconnected
> (because of a server restart, for example). i see that the client is
> re-sending all the in-flight commands as part of
>
> the recovery. however, suppose that one of the commands was a
> compound command containing 2 requests, and the reply to the first
> of them was NFS4_OK, and to the 2nd it was NFS4ERR_DELAY - will the
> client's code know that after it finishes recovery of the session -
> then when it creates a new session, it needs to re-send the 2nd
> request in this compound command?
If the client received the reply, it shouldn't have to resend the
compound at all.
If the client didn't see the reply, it will resend the whole compound.
Its behavior won't be affected by how the compound failed, since it
can't know that.
> the broader question is about a
> compound with N commands, where the first X have an NFS4_OK reply
> and the last N-X have NFS4_DELAY
The server always stops processing a compound at the first failure, so
N-X is always <=1.
> - will the client re-send a new
> compound with the last N-X commands after establishing a new
> session?
A resend by definition is a resend of exactly the same compound. The
client won't break it into pieces in that way.
(And typical compounds can't be broken up that way anyway--often earlier
ops in the compound are things like PUTFH's that supply required
information to later ops.)
> 2. if there is a non-persistent session, on which the client sent a
> non-idempotent request (e.g. rename of a file into a different
> directory), and the server restarted before the client received the
> response - will the client just blindly re-send the same request
> again after establishing a new session, or will it take some
> measures to attempt to understand whether the command was already
> executed? i.e. if the server already executed the rename, then
> re-sending it will return a failure to locate the source file handle
> (because it moved to a new directory).
In a rename of A/X to B/Y, the source filehandle refers to the directory
"A", so that filehandle will still work. You might get a NFS4ERR_NOENT
if there's nothing at A/X any more, and you could guess that meant the
rename succeeded. But it could equally well be that your rename was
never executed, and it's somebody else's rename or unlink that caused
A/X to no longer exist. Similarly, the A/X might have executed but
another operation might have immediately created something else at A/X.
> does the linux NFS client
> attempt to recover from this, or will it simply return an error to
> the application layer?
I suspect that's all any client does. You can imagine all sorts of
complicated hueristics, but none of them will be 100% right. Persistent
sessions is what you really need to fix this kind of bug.
> 3. what NFS server with persistent sessions is used (or was used)
> when testing the persistent sessions support in the linux NFS
> client? the linux NFS server, as far as i understood, cannot support
> persistent sessions (due to lack of assured persistent memory).
I don't think any special hardware is necessary. Or if it is, we could
just disable the feature in the absence of that hardware. Mainly what
we need is some cooperation from the filesystem--some way the can ID
particular operations so the server can ask the filesystem if a
particular operation was committed to disk. I talked to the XFS
developers about it informally and they seemed open to the idea, but
they need some sort of explanation of the requirements and I haven't
gotten around to it....
--b.
hi Bruce,
thanks for the response. this opens up a few questions about things i
thought i understood initially, so i did a re-read of parts of the NFS
4.1 RFC (RFC 5661), and i would like to clarify some things further.
see answers below:
On Wed, Oct 14, 2020 at 10:27 PM J. Bruce Fields <[email protected]> wrote:
>
> On Sat, Oct 10, 2020 at 11:39:30PM +0300, guy keren wrote:
> > during the design, we encountered some issues with high-availability
> > and persistent sessions handling by the linux NFS client, and i
> > would like to understand a few things about the linux NFS client - i
> > read all relevant material on http://www.linux-nfs.org, and spent a while
> > reading the relevant recovery code in the nfs4.1 client kernel
> > sources, but i am missing some things (a pointer to the relevant
> > part in the recovery code will be appreciated as well):
> >
> >
> > 1. suppose there is a persistent session that got disconnected
> > (because of a server restart, for example). i see that the client is
> > re-sending all the in-flight commands as part of
> >
> > the recovery. however, suppose that one of the commands was a
> > compound command containing 2 requests, and the reply to the first
> > of them was NFS4_OK, and to the 2nd it was NFS4ERR_DELAY - will the
> > client's code know that after it finishes recovery of the session -
> > then when it creates a new session, it needs to re-send the 2nd
> > request in this compound command?
>
> If the client received the reply, it shouldn't have to resend the
> compound at all.
>
> If the client didn't see the reply, it will resend the whole compound.
> Its behavior won't be affected by how the compound failed, since it
> can't know that.
according to what you wrote here, an NFS4ERR_DELAY response is
something that needs to be sent at the level of the entire compound
request - i.e. the server is not allowed to send a compound response
where the first few requests have a status of NFS4_OK, while the last
have a status of NFS4ERR_DELAY. i tried looking exactly where the spec
specifies the possibility of the server sending an NFS4ERR_DELAY, and
one example is on delegation recall. i am quoting from a paragraph
from section 10.2 of the spec:
===================
On recall, the client holding the delegation needs to flush modified
state (such as modified data) to the server and return the
delegation. The conflicting request will not be acted on until the
recall is complete. The recall is considered complete when the
client returns the delegation or the server times its wait for the
delegation to be returned and revokes the delegation as a result of
the timeout. In the interim, the server will either delay responding
to conflicting requests or respond to them with NFS4ERR_DELAY.
Following the resolution of the recall, the server has the
information necessary to grant or deny the second client's request.
===========================
according to what you say, if the OPEN request is in the middle of the
compound request, and is preceded by state-modifying requests (e.g.
creation of other files, writes into other open handles, renames,
etc.), then the server must avoid processing them until it recalled
the delegation to the file (i.e. it must process the entire command to
make sure it doesn't need to send an NFS4ERR_DELAY response due to any
of the requests inside it, before it starts processing, and it must
also lock the state of all files involved in the request, to avoid
another client acquiring a delegation on any of the files in the
request that have an OPEN request in the same compound. alternatively,
it must not send an NFS4ERR_DELAY request, and instead just keep the
request pending until the delegation recall was completed.
do i understand you correctly here?
>
> > the broader question is about a
> > compound with N commands, where the first X have an NFS4_OK reply
> > and the last N-X have NFS4_DELAY
>
> The server always stops processing a compound at the first failure, so
> N-X is always <=1.
granted.
>
> > - will the client re-send a new
> > compound with the last N-X commands after establishing a new
> > session?
>
> A resend by definition is a resend of exactly the same compound. The
> client won't break it into pieces in that way.
>
> (And typical compounds can't be broken up that way anyway--often earlier
> ops in the compound are things like PUTFH's that supply required
> information to later ops.)
i would assume that the same mechanism used to create the compound
request in the first place (adding the PUTFH in front, etc.) could be
used during a re-building of a smaller compound request - provided
that the client knows which requests from the compound were already
completed - and which were not.
but i understand that there's no such mechanism today on the linux NFS
client kernel - which is what i initially asked - so that clarifies
things.
>
> > 2. if there is a non-persistent session, on which the client sent a
> > non-idempotent request (e.g. rename of a file into a different
> > directory), and the server restarted before the client received the
> > response - will the client just blindly re-send the same request
> > again after establishing a new session, or will it take some
> > measures to attempt to understand whether the command was already
> > executed? i.e. if the server already executed the rename, then
> > re-sending it will return a failure to locate the source file handle
> > (because it moved to a new directory).
>
> In a rename of A/X to B/Y, the source filehandle refers to the directory
> "A", so that filehandle will still work. You might get a NFS4ERR_NOENT
> if there's nothing at A/X any more, and you could guess that meant the
> rename succeeded. But it could equally well be that your rename was
> never executed, and it's somebody else's rename or unlink that caused
> A/X to no longer exist. Similarly, the A/X might have executed but
> another operation might have immediately created something else at A/X.
i see. understood.
>
> > does the linux NFS client
> > attempt to recover from this, or will it simply return an error to
> > the application layer?
>
> I suspect that's all any client does. You can imagine all sorts of
> complicated hueristics, but none of them will be 100% right. Persistent
> sessions is what you really need to fix this kind of bug.
what about a situation in which instead of a server restart event, the
client just disconnected before receiving a rename response, and
re-connected with the same session to the same session? in that case,
i presume that the Linux NFS client will re-send the compound request,
and get the results from the server's Duplicate-Request cache, without
returning errors to the application. correct?
>
> > 3. what NFS server with persistent sessions is used (or was used)
> > when testing the persistent sessions support in the linux NFS
> > client? the linux NFS server, as far as i understood, cannot support
> > persistent sessions (due to lack of assured persistent memory).
>
> I don't think any special hardware is necessary. Or if it is, we could
> just disable the feature in the absence of that hardware. Mainly what
> we need is some cooperation from the filesystem--some way the can ID
> particular operations so the server can ask the filesystem if a
> particular operation was committed to disk. I talked to the XFS
> developers about it informally and they seemed open to the idea, but
> they need some sort of explanation of the requirements and I haven't
> gotten around to it....
you might also need the file system to be aware of delegations at some
level, in order to break delegations held by NFS4 clients, when a
local application attempts to open a file in a conflicting manner.
and this doesn't answer the original question: how was the "persistent
sessions" support in the linux NFS 4.1 client tested?
when i tried to find an NFS 4.1 server that supports "persistent
sessions" i first went to NetApp - and doing a "node takeover"
operation on it revealed that the session is unknown on the 2nd node -
making it practically irrelevant for such scenarios (unless there is
some way to change the behaviour of this feature to behave more like
SMB3 CA volumes).
>
> --b.
on an aside - i see that you are also the maintainer of the pynfs test
suite. would you be interested in patches fixing its install
operation, and if yes - should we send them to this mailing list, or
directly to you? i failed to find a mailing list dedicated to pynfs
development.
thanks,
--guy keren
Vast Data
On Sat, Oct 17, 2020 at 11:40:09PM +0300, Guy Keren wrote:
> according to what you wrote here, an NFS4ERR_DELAY response is
> something that needs to be sent at the level of the entire compound
> request - i.e. the server is not allowed to send a compound response
> where the first few requests have a status of NFS4_OK, while the last
> have a status of NFS4ERR_DELAY.
Oh, no, it's absolutely fine for a server to do that.
Sorry, you mentioned persistent sessions, so I assumed somehow this was
about retries after crashes or reboots, where the client may not have
received the reply and doesn't know whether it executed.
> according to what you say, if the OPEN request is in the middle of the
> compound request, and is preceded by state-modifying requests (e.g.
> creation of other files, writes into other open handles, renames,
> etc.), then the server must avoid processing them until it recalled
> the delegation to the file (i.e. it must process the entire command to
> make sure it doesn't need to send an NFS4ERR_DELAY response due to any
> of the requests inside it, before it starts processing, and it must
> also lock the state of all files involved in the request, to avoid
> another client acquiring a delegation on any of the files in the
> request that have an OPEN request in the same compound. alternatively,
> it must not send an NFS4ERR_DELAY request, and instead just keep the
> request pending until the delegation recall was completed.
No, sorry for the confusion, you're correct, if the client had a bunch
of non-idempotent ops all in one compound, and got a DELAY partway
through, then, yes, it would have to deal with retrying only the part
that didn't execute.
I don't know of any client that actually does that, for what it's worth.
The Linux client, for example, doesn't send any compounds that I can
think of that have more than one nonidempotent op.
> i would assume that the same mechanism used to create the compound
> request in the first place (adding the PUTFH in front, etc.) could be
> used during a re-building of a smaller compound request - provided
> that the client knows which requests from the compound were already
> completed - and which were not.
>
> but i understand that there's no such mechanism today on the linux NFS
> client kernel - which is what i initially asked - so that clarifies
> things.
Right, in theory you could imagine clients doing very general things
with compounds. In practice I don't know of any that do.
(Not that that allows a spec-compliant server to assume they won't.)
> what about a situation in which instead of a server restart event, the
> client just disconnected before receiving a rename response, and
> re-connected with the same session to the same session? in that case,
> i presume that the Linux NFS client will re-send the compound request,
> and get the results from the server's Duplicate-Request cache, without
> returning errors to the application. correct?
Right, assuming the client managed to hang on to its lease.
> and this doesn't answer the original question: how was the "persistent
> sessions" support in the linux NFS 4.1 client tested?
I don't know, sorry.
> on an aside - i see that you are also the maintainer of the pynfs test
> suite. would you be interested in patches fixing its install
> operation, and if yes - should we send them to this mailing list, or
> directly to you? i failed to find a mailing list dedicated to pynfs
> development.
Just send them to me, cc'd to this list. Thanks!
--b.
On Sun, Oct 18, 2020 at 12:14 AM J. Bruce Fields <[email protected]> wrote:
>
> On Sat, Oct 17, 2020 at 11:40:09PM +0300, Guy Keren wrote:
> > according to what you wrote here, an NFS4ERR_DELAY response is
> > something that needs to be sent at the level of the entire compound
> > request - i.e. the server is not allowed to send a compound response
> > where the first few requests have a status of NFS4_OK, while the last
> > have a status of NFS4ERR_DELAY.
>
> Oh, no, it's absolutely fine for a server to do that.
>
> Sorry, you mentioned persistent sessions, so I assumed somehow this was
> about retries after crashes or reboots, where the client may not have
> received the reply and doesn't know whether it executed.
>
> > according to what you say, if the OPEN request is in the middle of the
> > compound request, and is preceded by state-modifying requests (e.g.
> > creation of other files, writes into other open handles, renames,
> > etc.), then the server must avoid processing them until it recalled
> > the delegation to the file (i.e. it must process the entire command to
> > make sure it doesn't need to send an NFS4ERR_DELAY response due to any
> > of the requests inside it, before it starts processing, and it must
> > also lock the state of all files involved in the request, to avoid
> > another client acquiring a delegation on any of the files in the
> > request that have an OPEN request in the same compound. alternatively,
> > it must not send an NFS4ERR_DELAY request, and instead just keep the
> > request pending until the delegation recall was completed.
>
> No, sorry for the confusion, you're correct, if the client had a bunch
> of non-idempotent ops all in one compound, and got a DELAY partway
> through, then, yes, it would have to deal with retrying only the part
> that didn't execute.
actually, it is my understanding that, with persistent sessions, the
client has no way to distinguish between a temporary network
connection loss, and a server restart, if the server stores the client
state (client_id and all stateids) in persistent store.
so suppose that the client sent two 'Open' requests in one compound.
the server finished processing the first, but then had a delegation on
the 2nd one, so it is supposed to return an NFS4_OK to the first Open
and a NFSERR_DELAY for the 2nd open (and this is also the compound
response that the server will store in its Duplicate Request Cache).
if the server had a temporary network disconnection, or had a server
restart, then when the client re-connects and re-sends this compound
request, it receives the response from the server's Duplicate Request
Cache (with OK for the first open and DELA?Y For the 2nd). than, i
presume that the client needs to accept that the first Open already
succeeded, and when creating a new session, re-send only the 2nd Open
request. does this make sense?
>
> I don't know of any client that actually does that, for what it's worth.
> The Linux client, for example, doesn't send any compounds that I can
> think of that have more than one nonidempotent op.
does it mean that the linux NFS 4.1 client will also never send two
Write requests in the same compound? and never send an Open request
which might create a file, with a Write request in the same compound?
because, although these are not non-idempotent requests, it could be
that one of them was executed while the next one was not (at least
according to the spec, the server might return NFS4ERR_DELAY for all
of the NFS4.1 Request types)?
>
> > i would assume that the same mechanism used to create the compound
> > request in the first place (adding the PUTFH in front, etc.) could be
> > used during a re-building of a smaller compound request - provided
> > that the client knows which requests from the compound were already
> > completed - and which were not.
> >
> > but i understand that there's no such mechanism today on the linux NFS
> > client kernel - which is what i initially asked - so that clarifies
> > things.
>
> Right, in theory you could imagine clients doing very general things
> with compounds. In practice I don't know of any that do.
>
> (Not that that allows a spec-compliant server to assume they won't.)
>
> > what about a situation in which instead of a server restart event, the
> > client just disconnected before receiving a rename response, and
> > re-connected with the same session to the same session? in that case,
> > i presume that the Linux NFS client will re-send the compound request,
> > and get the results from the server's Duplicate-Request cache, without
> > returning errors to the application. correct?
>
> Right, assuming the client managed to hang on to its lease.
right. which will be the case if the server doesn't revoke state
immediately upon lease expiration, and no other client performed
conflicting requests.
>
> > and this doesn't answer the original question: how was the "persistent
> > sessions" support in the linux NFS 4.1 client tested?
>
> I don't know, sorry.
ok, thanks.
>
> > on an aside - i see that you are also the maintainer of the pynfs test
> > suite. would you be interested in patches fixing its install
> > operation, and if yes - should we send them to this mailing list, or
> > directly to you? i failed to find a mailing list dedicated to pynfs
> > development.
>
> Just send them to me, cc'd to this list. Thanks!
ok. we'll clean-up what we have and send it within a few days. thanks.
>
> --b.
--guy keren
Vast Data
On Sun, Oct 18, 2020 at 02:18:55PM +0300, Guy Keren wrote:
> so suppose that the client sent two 'Open' requests in one compound.
> the server finished processing the first, but then had a delegation on
> the 2nd one, so it is supposed to return an NFS4_OK to the first Open
> and a NFSERR_DELAY for the 2nd open (and this is also the compound
> response that the server will store in its Duplicate Request Cache).
> if the server had a temporary network disconnection, or had a server
> restart, then when the client re-connects and re-sends this compound
> request, it receives the response from the server's Duplicate Request
> Cache (with OK for the first open and DELA?Y For the 2nd). than, i
> presume that the client needs to accept that the first Open already
> succeeded, and when creating a new session, re-send only the 2nd Open
> request. does this make sense?
Sounds right.
> > I don't know of any client that actually does that, for what it's worth.
> > The Linux client, for example, doesn't send any compounds that I can
> > think of that have more than one nonidempotent op.
>
> does it mean that the linux NFS 4.1 client will also never send two
> Write requests in the same compound? and never send an Open request
> which might create a file, with a Write request in the same compound?
"Will never" might be a little strong--maybe there'll be a reason to do
it some day. A server should be prepared to handle it. But the client
doesn't currently do either of those things.
--b.