2015-10-20 14:22:20

by Andrew W Elble

[permalink] [raw]
Subject: [PATCH RFC v2] nfsd: don't revoke delegations that a client has stated it doesn't have

Assuming a client has lost a delegation: If the server goes to recall
the delegation, an attempt is made to recall it twice separated by
a delay of 2 seconds. Both times, the client will state that it
doesn't have the delegation via -EBADHANDLE or -NFS4ERR_BAD_STATEID.

1.) Any race between a delegation grant and a recall has been
presumably avoided by the delay and second attempt.
2.) The client doesn't have the delegation.
3.) The backchannel is responsive.

After these two attempts fail, the laundromat will eventually revoke
them and add these delegations to cl_revoked. This results in another
attempt to get the client to return the delegation via
TEST/FREE STATEID. This will also fail with no means
of resolution, and will cause the server and client to loop
indefinitely, as the client has nothing to give the server to satisfy it.

The changes here are to establish a safe way to recover by:

If the client has responded with -EBADHANDLE or -NFS4ERR_BAD_STATEID:
1.) Not failing the backchannel after two attempts at a recall.
2.) At the time revocation would normally occur: destroying the
delegation on the server side.

Signed-off-by: Andrew Elble <[email protected]>
---
fs/nfsd/nfs4state.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index da21df673ed9..340ff365df4d 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3578,7 +3578,7 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
rpc_delay(task, 2 * HZ);
return 0;
}
- /*FALLTHRU*/
+ return 1;
default:
return -1;
}
@@ -4451,7 +4451,17 @@ nfs4_laundromat(struct nfsd_net *nn)
dp = list_first_entry(&reaplist, struct nfs4_delegation,
dl_recall_lru);
list_del_init(&dp->dl_recall_lru);
- revoke_delegation(dp);
+ if ((dp->dl_recall.cb_status != -EBADHANDLE) &&
+ (dp->dl_recall.cb_status != -NFS4ERR_BAD_STATEID)) {
+ revoke_delegation(dp);
+ } else {
+ dprintk("nfsd: client: %.*s is losing delegations",
+ (int)dp->dl_recall.cb_clp->cl_name.len,
+ dp->dl_recall.cb_clp->cl_name.data);
+ put_clnt_odstate(dp->dl_clnt_odstate);
+ nfs4_put_deleg_lease(dp->dl_stid.sc_file);
+ nfs4_put_stid(&dp->dl_stid);
+ }
}

spin_lock(&nn->client_lock);
--
2.4.6



2015-10-20 17:29:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v2] nfsd: don't revoke delegations that a client has stated it doesn't have

On Tue, Oct 20, 2015 at 10:21:51AM -0400, Andrew Elble wrote:
> Assuming a client has lost a delegation:

Are clients really allowed to just lose a delegation? (Have you seen
such a case, other than the duplicate-delegation case which you already
fixed?)

> If the server goes to recall
> the delegation, an attempt is made to recall it twice separated by
> a delay of 2 seconds. Both times, the client will state that it
> doesn't have the delegation via -EBADHANDLE or -NFS4ERR_BAD_STATEID.
>
> 1.) Any race between a delegation grant and a recall has been
> presumably avoided by the delay and second attempt.

If something happened to the forechannel connection, then I believe it
could take longer than 2 seconds to time out and recover.

So I'm inclined to instead fix any bugs that result in servers and
client disagreeing about whether there's a delegation.

Another thing we could do here is finally implement the server-side
support for referring triples (I think the client's done?):

http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues#Referring_triples
https://tools.ietf.org/html/rfc5661#section-2.10.6.3

That would eliminate the need for the recall retries.

Though that would still leave open the question of how to handle those
errors on a recall. We still not be able to conclude that it's safe for
the server to destroy the delegation.

--b.


> 2.) The client doesn't have the delegation.
> 3.) The backchannel is responsive.
>
> After these two attempts fail, the laundromat will eventually revoke
> them and add these delegations to cl_revoked. This results in another
> attempt to get the client to return the delegation via
> TEST/FREE STATEID. This will also fail with no means
> of resolution, and will cause the server and client to loop
> indefinitely, as the client has nothing to give the server to satisfy it.
>
> The changes here are to establish a safe way to recover by:
>
> If the client has responded with -EBADHANDLE or -NFS4ERR_BAD_STATEID:
> 1.) Not failing the backchannel after two attempts at a recall.
> 2.) At the time revocation would normally occur: destroying the
> delegation on the server side.
>
> Signed-off-by: Andrew Elble <[email protected]>
> ---
> fs/nfsd/nfs4state.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index da21df673ed9..340ff365df4d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3578,7 +3578,7 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
> rpc_delay(task, 2 * HZ);
> return 0;
> }
> - /*FALLTHRU*/
> + return 1;
> default:
> return -1;
> }
> @@ -4451,7 +4451,17 @@ nfs4_laundromat(struct nfsd_net *nn)
> dp = list_first_entry(&reaplist, struct nfs4_delegation,
> dl_recall_lru);
> list_del_init(&dp->dl_recall_lru);
> - revoke_delegation(dp);
> + if ((dp->dl_recall.cb_status != -EBADHANDLE) &&
> + (dp->dl_recall.cb_status != -NFS4ERR_BAD_STATEID)) {
> + revoke_delegation(dp);
> + } else {
> + dprintk("nfsd: client: %.*s is losing delegations",
> + (int)dp->dl_recall.cb_clp->cl_name.len,
> + dp->dl_recall.cb_clp->cl_name.data);
> + put_clnt_odstate(dp->dl_clnt_odstate);
> + nfs4_put_deleg_lease(dp->dl_stid.sc_file);
> + nfs4_put_stid(&dp->dl_stid);
> + }
> }
>
> spin_lock(&nn->client_lock);
> --
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-10-20 18:34:17

by Andrew W Elble

[permalink] [raw]
Subject: Re: [PATCH RFC v2] nfsd: don't revoke delegations that a client has stated it doesn't have


> Are clients really allowed to just lose a delegation? (Have you seen
> such a case, other than the duplicate-delegation case which you already
> fixed?)

In short, yes, we're still seeing it. We have also been seeing increasing
stability from the work that has been done (which also increases the
time between replication).

The reason for v2 was I got coverage on the destruction path in
testing and discovered my mistake in v1. This one is extremely
frustrating to chase down (I've been 14+ hours deep in
packet captures to try and find the allocation to the recall - I
keep running out of disk space).

The operational reason to avoid this is because a lost delegation
effectively kills the mount when the state managers are looping. I have
a patch that alters fault injection to recreate the scenario.

>> 1.) Any race between a delegation grant and a recall has been
>> presumably avoided by the delay and second attempt.
>
> If something happened to the forechannel connection, then I believe it
> could take longer than 2 seconds to time out and recover.

Hrm, yes I see.

> So I'm inclined to instead fix any bugs that result in servers and
> client disagreeing about whether there's a delegation.

Not abandoning trying to track it down. It's quite clearly going to take
a while.

> Another thing we could do here is finally implement the server-side
> support for referring triples (I think the client's done?):
>
> http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues#Referring_triples
> https://tools.ietf.org/html/rfc5661#section-2.10.6.3
>
> That would eliminate the need for the recall retries.
>
> Though that would still leave open the question of how to handle those
> errors on a recall. We still not be able to conclude that it's safe for
> the server to destroy the delegation.

Would this be more appropriately "fixed" by supporting DELEGPURGE in a
limited fashion to clear out cl_revoked? (I'm not quite sure
that's a valid interpretation of RFC5561) to clear out any "lost"
delegations?

Thanks,

Andy

--
Andrew W. Elble
[email protected]
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

2015-10-20 21:10:20

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH RFC v2] nfsd: don't revoke delegations that a client has stated it doesn't have

On Tue, Oct 20, 2015 at 02:34:15PM -0400, Andrew W Elble wrote:
>
> > Are clients really allowed to just lose a delegation? (Have you seen
> > such a case, other than the duplicate-delegation case which you already
> > fixed?)
>
> In short, yes, we're still seeing it. We have also been seeing increasing
> stability from the work that has been done (which also increases the
> time between replication).
>
> The reason for v2 was I got coverage on the destruction path in
> testing and discovered my mistake in v1. This one is extremely
> frustrating to chase down (I've been 14+ hours deep in
> packet captures to try and find the allocation to the recall - I
> keep running out of disk space).

Ugh. Maybe patching in some well-chosen printk's could help.

Or doing some filtering as you capture. You could capture just
operations that might have a delegation stateid--OPEN, DELEGRETURN,
CB_RECALL. And probably SEQUENCE replies with any flags set.

> > Another thing we could do here is finally implement the server-side
> > support for referring triples (I think the client's done?):
> >
> > http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues#Referring_triples
> > https://tools.ietf.org/html/rfc5661#section-2.10.6.3
> >
> > That would eliminate the need for the recall retries.
> >
> > Though that would still leave open the question of how to handle those
> > errors on a recall. We still not be able to conclude that it's safe for
> > the server to destroy the delegation.
>
> Would this be more appropriately "fixed" by supporting DELEGPURGE in a
> limited fashion to clear out cl_revoked? (I'm not quite sure
> that's a valid interpretation of RFC5561) to clear out any "lost"
> delegations?

I haven't thought it through, but that's clearly not what's intended
for, so I'm pessimistic. Note also we'd need to implement it first.

There may be other ways the client could better recover, but we'd still
rather avoid such situations in the first place. The recovery logic is
already complicated enough, it would be worse if it also needed to
handle a lot of cases that could only occur due to outright bugs.

--b.