On 8/26/23 10:30 AM, Trond Myklebust wrote:
> Yes. For instance the Linux knfsd server will drop requests if the GSS
> sequence number lies outside the window (no, I don't know why it
> doesn't just return RPCSEC_GSS_CTXPROBLEM). It will also happily drop
> deferred requests (i.e. requests waiting for a reply to an upcall) once
> they start piling up. Finally, if the knfsd reply cache says that an
> earlier transmission of the NFSv3 RPC request is still being processed,
> then the new request gets dropped.
>
>> Wouldn't the rpc code behave the same as v4 and setup a new
>> connection
>> before doing the retrans?
>> At least in our experimentation if we leave the connection down for
>> more
>> 63 seconds we can see from the rpc traces that is what is happening.
>> Once there is a new connection then old message is ignored and
>> processing continues with the new set request / responses.
>>
>>>
>>> The right thing to do is to just fix up rpc_decode_header() to
>>> retry
>>> instead of firing off an error in this case.
>> So you are thinking that rpc_decode_header just returns EAGAIN if the
>> checksum fails?
>> What happens if the GSS context actually goes bad (times out etc)
>> wouldn't that also result in the client get stuck just doing re-sends
>> over and over?
>
> If the GSS context goes bad, then the server is supposed to return
> either RPCSEC_GSS_CREDPROBLEM (if the server no longer has context for
> that handle) or RPCSEC_GSS_CTXPROBLEM (context is stale due to ticket
> expiry, etc).
Test environment got reset so took a day or so try this.
I change the return when rpcauth_checkverf fails to be a EKEYREJECTED
error vs EAGAIN
Which puts the decode failure down this path vs just re-transmitting the
same XID but with a new GSS sequence / checksum.
case -EKEYREJECTED:
task->tk_action = call_reserve;
rpc_check_timeout(task);
rpcauth_invalcred(task);
/* Ensure we obtain a new XID if we retry! */
xprt_release(task);
This does appear to work / address the failure that we are able
to introduce with iptable down ; iptables up.
But question is this a valid thing to do when the gss checksum fails?
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index ad3e9a40b061..d0bcb6c6b3df 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2645,7 +2645,8 @@ rpc_decode_header(struct rpc_task *task, struct
xdr_stream *xdr)
out_verifier:
trace_rpc_bad_verifier(task);
- goto out_garbage;
+ return -EKEYREJECTED;
+ //goto out_garbage;
out_msg_denied:
error = -EACCES;
>
> If it just times out, then surely the replay cache should ensure that
> it gets processed quickly after a retry, no?
>
>>
>> I'm really not that up to speed on subtleties of NFS kerberos.
>>
>> Oh note this isn't even krb5p just krb5 mounts. (not that should
>> matter
>> all that much)
>>
>> --Russell Cattelan
>