> On Oct 24, 2016, at 3:17 PM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2016-10-24 at 14:08 -0400, J. Bruce Fields wrote:
>> On Mon, Oct 24, 2016 at 11:24:40AM -0400, Jeff Layton wrote:
>>>
>>> On Mon, 2016-10-24 at 11:19 -0400, Jeff Layton wrote:
>>>>
>>>> On Mon, 2016-10-24 at 09:51 -0400, Chuck Lever wrote:
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Oct 24, 2016, at 9:31 AM, Jeff Layton <[email protected]> wrote:
>>>>>>
>>>>>> On Mon, 2016-10-24 at 11:15 +0800, Eryu Guan wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 23, 2016 at 02:21:15PM -0400, J. Bruce Fields wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm getting an intermittent crash in the nfs server as of
>>>>>>>> 68778945e46f143ed7974b427a8065f69a4ce944 "SUNRPC: Separate buffer
>>>>>>>> pointers for RPC Call and Reply messages".
>>>>>>>>
>>>>>>>> I haven't tried to understand that commit or why it would be a problem yet, I
>>>>>>>> don't see an obvious connection--I can take a closer look Monday.
>>>>>>>>
>>>>>>>> Could even be that I just landed on this commit by chance, the problem is a
>>>>>>>> little hard to reproduce so I don't completely trust my testing.
>>>>>>>
>>>>>>> I've hit the same crash on 4.9-rc1 kernel, and it's reproduced for me
>>>>>>> reliably by running xfstests generic/013 case, on a loopback mounted
>>>>>>> NFSv4.1 (or NFSv4.2), XFS is the underlying exported fs. More details
>>>>>>> please see
>>>>>>>
>>>>>>> http://marc.info/?l=linux-nfs&m=147714320129362&w=2
>>>>>>>
>>>>>>
>>>>>> Looks like you landed at the same commit as Bruce, so that's probably
>>>>>> legit. That commit is very small though. The only real change that
>>>>>> doesn't affect the new field is this:
>>>>>>
>>>>>>
>>>>>> @@ -1766,7 +1766,7 @@ rpc_xdr_encode(struct rpc_task *task)
>>>>>> req->rq_buffer,
>>>>>> req->rq_callsize);
>>>>>> xdr_buf_init(&req->rq_rcv_buf,
>>>>>> - (char *)req->rq_buffer + req->rq_callsize,
>>>>>> + req->rq_rbuffer,
>>>>>> req->rq_rcvsize);
>>>>>>
>>>>>>
>>>>>> So I'm guessing this is breaking the callback channel somehow?
>>>>>
>>>>> Could be the TCP backchannel code is using rq_buffer in a different
>>>>> way than RDMA backchannel or the forward channel code.
>>>>>
>>>>
>>>> Well, it basically allocates a page per rpc_rqst and then maps that.
>>>>
>>>> One thing I notice is that this patch ensures that rq_rbuffer gets set
>>>> up in rpc_malloc and xprt_rdma_allocate, but it looks like
>>>> xprt_alloc_bc_req didn't get the same treatment.
>>>>
>>>> I suspect that that may be the problem...
>>>>
>>> In fact, maybe we just need this here? (untested and probably
>>> whitespace damaged):
>>
>> No change in results for me.
>>
>> --b.
>>>
>>>
>>> diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
>>> index ac701c28f44f..c561aa8ce05b 100644
>>> --- a/net/sunrpc/backchannel_rqst.c
>>> +++ b/net/sunrpc/backchannel_rqst.c
>>> @@ -100,6 +100,7 @@ struct rpc_rqst *xprt_alloc_bc_req(struct rpc_xprt *xprt, gfp_t gfp_flags)
>>> goto out_free;
>>> }
>>> req->rq_rcv_buf.len = PAGE_SIZE;
>>> + req->rq_rbuffer = req->rq_rcv_buf.head[0].iov_base;
>>>
>>> /* Preallocate one XDR send buffer */
>>> if (xprt_alloc_xdr_buf(&req->rq_snd_buf, gfp_flags) < 0) {
>
> Ahh ok, I think I see.
>
> We probably also need to set rq_rbuffer in bc_malloc and and
> xprt_rdma_bc_allocate.
>
> My guess is that we're ending up in rpc_xdr_encode with a NULL
> rq_rbuffer pointer, so the right fix would seem to be to ensure that it
> is properly set whenever rq_buffer is set.
>
> So I think this may be what we want, actually. I'll plan to test it out
> but may not get to it before tomorrow.
>
> --
> Jeff Layton <[email protected]><0001-sunrpc-fix-some-missing-rq_rbuffer-assignments.patch>

This may not be working as well as I thought (at least for NFS/RDMA).
xprt_rdma_bc_send_request releases the page allocated by
xprt_rdma_bc_allocate before the reply arrives. call_decode
then tries to dereference rq_rbuffer, but that's now a pointer
to freed memory.

--
Chuck Lever