2022-08-31 20:40:17

by Benjamin Coddington

[permalink] [raw]
Subject: client call_decode WARN_ON memcmp race since 72691a269f0b

Hey Trond,

Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
request" I can sometimes pop the WARN_ON() in call_decode(), usually on
generic/642.

I think there's a kworker halfway through

xprt_complete_rqst() ->
xprt_request_dequeue_receive_locked()

between these two linse:
xdr_free_bvec(&req->rq_rcv_buf);
req->rq_private_buf.bvec = NULL;

And at the same time the task is doing the WARN_ON(memcmp()) in
call_decode.

I'm not sure of a good fix - perhaps we can fixup the other paths that
require the NULL check in xdr_alloc_bvec() so we don't have to set bvec =
NULL.

Any thoughts?

Ben


2022-08-31 21:38:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: client call_decode WARN_ON memcmp race since 72691a269f0b

On Wed, 2022-08-31 at 16:20 -0400, Benjamin Coddington wrote:
> Hey Trond,
>
> Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
> request" I can sometimes pop the WARN_ON() in call_decode(), usually
> on
> generic/642.
>
> I think there's a kworker halfway through
>
> xprt_complete_rqst() ->
>        xprt_request_dequeue_receive_locked()
>
> between these two linse:
>          xdr_free_bvec(&req->rq_rcv_buf);
>          req->rq_private_buf.bvec = NULL;
>
> And at the same time the task is doing the WARN_ON(memcmp()) in
> call_decode.
>
> I'm not sure of a good fix - perhaps we can fixup the other paths
> that
> require the NULL check in xdr_alloc_bvec() so we don't have to set
> bvec =
> NULL.
>
> Any thoughts?
>

How about this?
8<----------------------------------------
From 7523908b87edf245994b213108e2ce451f63529f Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 31 Aug 2022 17:28:13 -0400
Subject: [PATCH] SUNRPC: Fix call completion races with call_decode()

We need to make sure that the req->rq_private_buf is completely up to
date before we make req->rq_reply_bytes_recvd visible to the
call_decode() routine in order to avoid triggering the WARN_ON().

Reported-by: Benjamin Coddington <[email protected]>
Fixes: 72691a269f0b ("SUNRPC: Don't reuse bvec on retransmission of the request")
Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/xprt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index d71eec494826..79265ee25d05 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1221,12 +1221,12 @@ void xprt_complete_rqst(struct rpc_task *task, int copied)

xprt->stat.recvs++;

+ xprt_request_dequeue_receive_locked(task);
req->rq_private_buf.len = copied;
/* Ensure all writes are done before we update */
/* req->rq_reply_bytes_recvd */
smp_wmb();
req->rq_reply_bytes_recvd = copied;
- xprt_request_dequeue_receive_locked(task);
rpc_wake_up_queued_task(&xprt->pending, task);
}
EXPORT_SYMBOL_GPL(xprt_complete_rqst);
--
2.37.2
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-08-31 22:48:22

by Benjamin Coddington

[permalink] [raw]
Subject: Re: client call_decode WARN_ON memcmp race since 72691a269f0b

On 31 Aug 2022, at 17:34, Trond Myklebust wrote:

> On Wed, 2022-08-31 at 16:20 -0400, Benjamin Coddington wrote:
>> Hey Trond,
>>
>> Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
>> request" I can sometimes pop the WARN_ON() in call_decode(), usually
>> on
>> generic/642.
>>
>> I think there's a kworker halfway through
>>
>> xprt_complete_rqst() ->
>>        xprt_request_dequeue_receive_locked()
>>
>> between these two linse:
>>          xdr_free_bvec(&req->rq_rcv_buf);
>>          req->rq_private_buf.bvec = NULL;
>>
>> And at the same time the task is doing the WARN_ON(memcmp()) in
>> call_decode.
>>
>> I'm not sure of a good fix - perhaps we can fixup the other paths
>> that
>> require the NULL check in xdr_alloc_bvec() so we don't have to set
>> bvec =
>> NULL.
>>
>> Any thoughts?
>>
>
> How about this?

I think that will fix it. I'll test it overnight and see what happens.

Ben

2022-08-31 23:14:04

by Benjamin Coddington

[permalink] [raw]
Subject: Re: client call_decode WARN_ON memcmp race since 72691a269f0b

On 31 Aug 2022, at 18:44, Benjamin Coddington wrote:

> On 31 Aug 2022, at 17:34, Trond Myklebust wrote:
>
>> On Wed, 2022-08-31 at 16:20 -0400, Benjamin Coddington wrote:
>>> Hey Trond,
>>>
>>> Since "72691a269f0b SUNRPC: Don't reuse bvec on retransmission of the
>>> request" I can sometimes pop the WARN_ON() in call_decode(), usually
>>> on
>>> generic/642.
>>>
>>> I think there's a kworker halfway through
>>>
>>> xprt_complete_rqst() ->
>>>        xprt_request_dequeue_receive_locked()
>>>
>>> between these two linse:
>>>          xdr_free_bvec(&req->rq_rcv_buf);
>>>          req->rq_private_buf.bvec = NULL;
>>>
>>> And at the same time the task is doing the WARN_ON(memcmp()) in
>>> call_decode.
>>>
>>> I'm not sure of a good fix - perhaps we can fixup the other paths
>>> that
>>> require the NULL check in xdr_alloc_bvec() so we don't have to set
>>> bvec =
>>> NULL.
>>>
>>> Any thoughts?
>>>
>>
>> How about this?
>
> I think that will fix it. I'll test it overnight and see what happens.

On a second look, I'm not sure this will fix it, because I don't think its a
memory-barrier problem, rather its that setting the two buffer's bvec = NULL
is not atomic.

Meanwhile, I am testing your patch.

Ben