From: Olga Kornievskaia <[email protected]>
In the case where we have received a successful reply to an RPC request,
but while processing the reply the client in rpc_decode_header() finds
an expired context, the code ends up propagating the error to the caller
instead of getting a new context and retrying the request.
To give more details, in rpc_decode_header() we call rpcauth_checkverf()
will call into the gss and internally will at some point call
gss_validate() which has a check if the current’s context lifetime
expired, and it would fail. The reason for the failure gets ‘scrubbed’
and translated to EACCES so when we get back to rpc_decode_header() we
just go to “out_verifier” which for that error would get converted to
“out_garbage” (ie it’s treated as garballed reply) and the next
action is call_encode. Which (1) doesn’t reencode or re-send (not to
mention no upcall happens because context expires as that reason just
not known) and it again fails in the same decoding process. After
re-trying it 3 times the error is propagated back to the caller
(ie nfs4_write_done_cb() in the case a failing write).
To fix this, instead we need to look to the case where the server
decides that context has expired and replies with an RPC auth error.
In that case, the rpc_decode_header() goes to "out_msg_denied" in that
we return EKEYREJECTED which in call_decode() is sent to “call_reserve”
which triggers an upcalls and a re-try of the operation.
The proposed fix is in case of a failed rpc_decode_header() to check
if credentials were set to be invalid and use that as a proxy for
deciding that context has expired and then treat is same way as
receiving an auth error.
Signed-off-by: Olga Kornievskaia <[email protected]>
--- This looks like a day-0 problem and should probably be back
ported to earlier kernels.
---
net/sunrpc/clnt.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 28f3749f6dc6..f19ad55017c9 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2698,8 +2698,19 @@ rpc_decode_header(struct rpc_task *task, struct xdr_stream *xdr)
goto out_msg_denied;
error = rpcauth_checkverf(task, xdr);
- if (error)
+ if (error) {
+ struct rpc_cred *cred = task->tk_rqstp->rq_cred;
+
+ if (!test_bit(RPCAUTH_CRED_UPTODATE, &cred->cr_flags)) {
+ rpcauth_invalcred(task);
+ if (!task->tk_cred_retry)
+ goto out_err;
+ task->tk_cred_retry--;
+ trace_rpc__stale_creds(task);
+ return -EKEYREJECTED;
+ }
goto out_verifier;
+ }
p = xdr_inline_decode(xdr, sizeof(*p));
if (!p)
--
2.39.1
On 18 Apr 2024, at 10:11, Olga Kornievskaia wrote:
> From: Olga Kornievskaia <[email protected]>
>
> In the case where we have received a successful reply to an RPC request,
> but while processing the reply the client in rpc_decode_header() finds
> an expired context, the code ends up propagating the error to the caller
> instead of getting a new context and retrying the request.
>
> To give more details, in rpc_decode_header() we call rpcauth_checkverf()
> will call into the gss and internally will at some point call
> gss_validate() which has a check if the current’s context lifetime
> expired, and it would fail. The reason for the failure gets ‘scrubbed’
> and translated to EACCES so when we get back to rpc_decode_header() we
> just go to “out_verifier” which for that error would get converted to
> “out_garbage” (ie it’s treated as garballed reply) and the next
> action is call_encode. Which (1) doesn’t reencode or re-send (not to
> mention no upcall happens because context expires as that reason just
> not known) and it again fails in the same decoding process. After
> re-trying it 3 times the error is propagated back to the caller
> (ie nfs4_write_done_cb() in the case a failing write).
>
> To fix this, instead we need to look to the case where the server
> decides that context has expired and replies with an RPC auth error.
> In that case, the rpc_decode_header() goes to "out_msg_denied" in that
> we return EKEYREJECTED which in call_decode() is sent to “call_reserve”
> which triggers an upcalls and a re-try of the operation.
>
> The proposed fix is in case of a failed rpc_decode_header() to check
> if credentials were set to be invalid and use that as a proxy for
> deciding that context has expired and then treat is same way as
> receiving an auth error.
>
> Signed-off-by: Olga Kornievskaia <[email protected]>
>
> --- This looks like a day-0 problem and should probably be back
> ported to earlier kernels.
Looks good to me.
Reviewed-by: Benjamin Coddington <[email protected]>
Ben