Hey Trond,
Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
request" I can sometimes pop the WARN_ON() in call_decode(), usually on
generic/642.
I think there's a kworker halfway through
xprt_complete_rqst() ->
xprt_request_dequeue_receive_locked()
between these two linse:
xdr_free_bvec(&req->rq_rcv_buf);
req->rq_private_buf.bvec = NULL;
And at the same time the task is doing the WARN_ON(memcmp()) in
call_decode.
I'm not sure of a good fix - perhaps we can fixup the other paths that
require the NULL check in xdr_alloc_bvec() so we don't have to set bvec =
NULL.
Any thoughts?
Ben
On Wed, 2022-08-31 at 16:20 -0400, Benjamin Coddington wrote:
> Hey Trond,
>
> Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
> request" I can sometimes pop the WARN_ON() in call_decode(), usually
> on
> generic/642.
>
> I think there's a kworker halfway through
>
> xprt_complete_rqst() ->
> xprt_request_dequeue_receive_locked()
>
> between these two linse:
> xdr_free_bvec(&req->rq_rcv_buf);
> req->rq_private_buf.bvec = NULL;
>
> And at the same time the task is doing the WARN_ON(memcmp()) in
> call_decode.
>
> I'm not sure of a good fix - perhaps we can fixup the other paths
> that
> require the NULL check in xdr_alloc_bvec() so we don't have to set
> bvec =
> NULL.
>
> Any thoughts?
>
How about this?
8<----------------------------------------
From 7523908b87edf245994b213108e2ce451f63529f Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 31 Aug 2022 17:28:13 -0400
Subject: [PATCH] SUNRPC: Fix call completion races with call_decode()
We need to make sure that the req->rq_private_buf is completely up to
date before we make req->rq_reply_bytes_recvd visible to the
call_decode() routine in order to avoid triggering the WARN_ON().
Reported-by: Benjamin Coddington <[email protected]>
Fixes: 72691a269f0b ("SUNRPC: Don't reuse bvec on retransmission of the request")
Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/xprt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index d71eec494826..79265ee25d05 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1221,12 +1221,12 @@ void xprt_complete_rqst(struct rpc_task *task, int copied)
xprt->stat.recvs++;
+ xprt_request_dequeue_receive_locked(task);
req->rq_private_buf.len = copied;
/* Ensure all writes are done before we update */
/* req->rq_reply_bytes_recvd */
smp_wmb();
req->rq_reply_bytes_recvd = copied;
- xprt_request_dequeue_receive_locked(task);
rpc_wake_up_queued_task(&xprt->pending, task);
}
EXPORT_SYMBOL_GPL(xprt_complete_rqst);
--
2.37.2
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On 31 Aug 2022, at 17:34, Trond Myklebust wrote:
> On Wed, 2022-08-31 at 16:20 -0400, Benjamin Coddington wrote:
>> Hey Trond,
>>
>> Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
>> request" I can sometimes pop the WARN_ON() in call_decode(), usually
>> on
>> generic/642.
>>
>> I think there's a kworker halfway through
>>
>> xprt_complete_rqst() ->
>> xprt_request_dequeue_receive_locked()
>>
>> between these two linse:
>> xdr_free_bvec(&req->rq_rcv_buf);
>> req->rq_private_buf.bvec = NULL;
>>
>> And at the same time the task is doing the WARN_ON(memcmp()) in
>> call_decode.
>>
>> I'm not sure of a good fix - perhaps we can fixup the other paths
>> that
>> require the NULL check in xdr_alloc_bvec() so we don't have to set
>> bvec =
>> NULL.
>>
>> Any thoughts?
>>
>
> How about this?
I think that will fix it. I'll test it overnight and see what happens.
Ben
On 31 Aug 2022, at 18:44, Benjamin Coddington wrote:
> On 31 Aug 2022, at 17:34, Trond Myklebust wrote:
>
>> On Wed, 2022-08-31 at 16:20 -0400, Benjamin Coddington wrote:
>>> Hey Trond,
>>>
>>> Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
>>> request" I can sometimes pop the WARN_ON() in call_decode(), usually
>>> on
>>> generic/642.
>>>
>>> I think there's a kworker halfway through
>>>
>>> xprt_complete_rqst() ->
>>> xprt_request_dequeue_receive_locked()
>>>
>>> between these two linse:
>>> xdr_free_bvec(&req->rq_rcv_buf);
>>> req->rq_private_buf.bvec = NULL;
>>>
>>> And at the same time the task is doing the WARN_ON(memcmp()) in
>>> call_decode.
>>>
>>> I'm not sure of a good fix - perhaps we can fixup the other paths
>>> that
>>> require the NULL check in xdr_alloc_bvec() so we don't have to set
>>> bvec =
>>> NULL.
>>>
>>> Any thoughts?
>>>
>>
>> How about this?
>
> I think that will fix it. I'll test it overnight and see what happens.
On a second look, I'm not sure this will fix it, because I don't think its a
memory-barrier problem, rather its that setting the two buffer's bvec = NULL
is not atomic.
Meanwhile, I am testing your patch.
Ben