2015-08-31 16:19:45

by Andrew W Elble

[permalink] [raw]
Subject: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL

We have observed the server sending recalls for delegation stateids
that have already been successfully returned. Change
nfsd4_cb_recall_done() to return success if the client has returned
the delegation. While this does not completely eliminate the sending
of recalls for delegations that have already been returned, this
does prevent unnecessarily declaring the callback path to be down.

Reported-by: Eric Meddaugh <[email protected]>
Signed-off-by: Andrew Elble <[email protected]>
---
fs/nfsd/nfs4state.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index de45c2de1668..ac235a7470e3 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3509,6 +3509,9 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
{
struct nfs4_delegation *dp = cb_to_delegation(cb);

+ if (dp->dl_stid.sc_type == NFS4_CLOSED_DELEG_STID)
+ return 1;
+
switch (task->tk_status) {
case 0:
return 1;
--
2.4.6



2015-09-01 13:18:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL

On Mon, 31 Aug 2015 12:06:41 -0400
Andrew Elble <[email protected]> wrote:

> We have observed the server sending recalls for delegation stateids
> that have already been successfully returned. Change
> nfsd4_cb_recall_done() to return success if the client has returned
> the delegation. While this does not completely eliminate the sending
> of recalls for delegations that have already been returned, this
> does prevent unnecessarily declaring the callback path to be down.
>
> Reported-by: Eric Meddaugh <[email protected]>
> Signed-off-by: Andrew Elble <[email protected]>
> ---
> fs/nfsd/nfs4state.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index de45c2de1668..ac235a7470e3 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3509,6 +3509,9 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
> {
> struct nfs4_delegation *dp = cb_to_delegation(cb);
>
> + if (dp->dl_stid.sc_type == NFS4_CLOSED_DELEG_STID)
> + return 1;
> +
> switch (task->tk_status) {
> case 0:
> return 1;

Yeah, there will always be some potential for a CB_RECALL/DELEGRETURN
race since the former is driven by the server and the latter by the
client.

What error are you usually getting back from the client when you see
this happen? NFS4ERR_BAD_STATEID? If so, then maybe we should confine
this check to that error case?

OTOH, maybe there's no harm in just ignoring the error if the
delegation is already returned...

Acked-by: Jeff Layton <[email protected]>

2015-09-01 13:55:29

by Andrew W Elble

[permalink] [raw]
Subject: Re: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL


> Yeah, there will always be some potential for a CB_RECALL/DELEGRETURN
> race since the former is driven by the server and the latter by the
> client.

I believe there are some client differences to account for as well. i.e. when
does the client "unhash" the delegation? Before the DELEGRETURN is sent,
or after the response comes back?

> What error are you usually getting back from the client when you see
> this happen? NFS4ERR_BAD_STATEID? If so, then maybe we should confine
> this check to that error case?

Usually EBADHANDLE. I actually had a version of this patch where it was
confined to those cases. My rationale in changing it was that continuing
to solicit the return of the delegation when we have it is pointless.

i.e. RFC5661 20.2.4:

"...The recall is not complete until the delegation is returned using a
DELEGRETURN operation."

Thanks,

Andy

--
Andrew W. Elble
[email protected]
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

2015-09-01 14:48:29

by Andrew W Elble

[permalink] [raw]
Subject: Re: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL


I should probably state what I'm currently chasing:

1.) Somehow, delegations wind up on the cl_revoked list at the server.
2.) The server continually asserts SEQ4_STATUS_RECALLABLE_STATE_REVOKED.
3.) The client for some reason doesn't have anything to give the server
to satisfy it.

I speculate that this patch will assist in making this go away - I'm
just not 100% sure of all the conditions that result in making #1 possible.
(Enough recalls backed up on the callback slot on the same filehandle
stacking timeouts/retries and resultant path down errors to exceed lease
time?). Unfortunately this problem seems to always occur at ~1AM - and
obtaining a network capture of the problem in action has been elusive.

Thanks,

Andy

--
Andrew W. Elble
[email protected]
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

2015-09-01 18:04:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL

On Tue, Sep 01, 2015 at 09:18:42AM -0400, Jeff Layton wrote:
> On Mon, 31 Aug 2015 12:06:41 -0400
> Andrew Elble <[email protected]> wrote:
>
> > We have observed the server sending recalls for delegation stateids
> > that have already been successfully returned. Change
> > nfsd4_cb_recall_done() to return success if the client has returned
> > the delegation. While this does not completely eliminate the sending
> > of recalls for delegations that have already been returned, this
> > does prevent unnecessarily declaring the callback path to be down.
> >
> > Reported-by: Eric Meddaugh <[email protected]>
> > Signed-off-by: Andrew Elble <[email protected]>
> > ---
> > fs/nfsd/nfs4state.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index de45c2de1668..ac235a7470e3 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -3509,6 +3509,9 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
> > {
> > struct nfs4_delegation *dp = cb_to_delegation(cb);
> >
> > + if (dp->dl_stid.sc_type == NFS4_CLOSED_DELEG_STID)
> > + return 1;
> > +
> > switch (task->tk_status) {
> > case 0:
> > return 1;
>
> Yeah, there will always be some potential for a CB_RECALL/DELEGRETURN
> race since the former is driven by the server and the latter by the
> client.
>
> What error are you usually getting back from the client when you see
> this happen? NFS4ERR_BAD_STATEID? If so, then maybe we should confine
> this check to that error case?
>
> OTOH, maybe there's no harm in just ignoring the error if the
> delegation is already returned...
>
> Acked-by: Jeff Layton <[email protected]>

Agreed, applying--thanks.--b.

2015-09-01 18:08:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL

On Tue, Sep 01, 2015 at 10:48:27AM -0400, Andrew W Elble wrote:
>
> I should probably state what I'm currently chasing:
>
> 1.) Somehow, delegations wind up on the cl_revoked list at the server.
> 2.) The server continually asserts SEQ4_STATUS_RECALLABLE_STATE_REVOKED.
> 3.) The client for some reason doesn't have anything to give the server
> to satisfy it.
>
> I speculate that this patch will assist in making this go away - I'm
> just not 100% sure of all the conditions that result in making #1 possible.
> (Enough recalls backed up on the callback slot on the same filehandle
> stacking timeouts/retries and resultant path down errors to exceed lease
> time?).

Even in the worst case where the client never gets the recall, in theory
it should take the RECALLABLE_STATE_REVOKED as a sign that it's missed
one and return its delegations. Is the client attempting any such
recovery when it sees that flag?

> Unfortunately this problem seems to always occur at ~1AM - and
> obtaining a network capture of the problem in action has been elusive.

Anyway, sounds interesting, let us know what you find....

--b.