LinuxLists.cc - nfsd: managing pages under network I/O

2016-10-12 17:44:10

Subject: nfsd: managing pages under network I/O

I'm studying the way that the ->recvfrom and ->sendto calls work
for RPC-over-RDMA.

The ->sendto path moves pages out of the svc_rqst before posting
I/O (RDMA Write and Send). Once the work is posted, ->sendto
returns, and looks like svc_rqst is released at that point. The
subsequent completion of the Send then releases those moved pages.

I'm wondering if the transport can be simplified: instead of
moving pages around, ->sendto could just wait until the Write and
Send activity is complete, then return. The upper layer then
releases everything.

Another option would be for ->sendto to return a value that means
the transport will release the svc_rqst and pages.

Or, the svc_rqst could be reference counted.

Anyone have thoughts about this?

--
Chuck Lever

2016-10-13 06:36:33

by Christoph Hellwig

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
> I'm studying the way that the ->recvfrom and ->sendto calls work
> for RPC-over-RDMA.
>
> The ->sendto path moves pages out of the svc_rqst before posting
> I/O (RDMA Write and Send). Once the work is posted, ->sendto
> returns, and looks like svc_rqst is released at that point. The
> subsequent completion of the Send then releases those moved pages.
>
> I'm wondering if the transport can be simplified: instead of
> moving pages around, ->sendto could just wait until the Write and
> Send activity is complete, then return. The upper layer then
> releases everything.

I'd prefer not block for no reason at all.

> Another option would be for ->sendto to return a value that means
> the transport will release the svc_rqst and pages.

Or just let the transport always release it. We only have two
different implementations of the relevant ops anyway.

2016-10-13 13:36:55

by Chuck Lever III

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

> On Oct 13, 2016, at 2:35 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
>> I'm studying the way that the ->recvfrom and ->sendto calls work
>> for RPC-over-RDMA.
>>
>> The ->sendto path moves pages out of the svc_rqst before posting
>> I/O (RDMA Write and Send). Once the work is posted, ->sendto
>> returns, and looks like svc_rqst is released at that point. The
>> subsequent completion of the Send then releases those moved pages.
>>
>> I'm wondering if the transport can be simplified: instead of
>> moving pages around, ->sendto could just wait until the Write and
>> Send activity is complete, then return. The upper layer then
>> releases everything.
>
> I'd prefer not block for no reason at all.

The reason to block is to wait for I/O to complete. Can you
elaborate: for example, are you leery of an extra context
switch?

Note that in the Read/recvfrom case, RDMA Reads are posted,
and then svc_rdma_recvfrom returns "deferred." When the Reads
complete, the completion handler enqueues the work on the
svc_xprt and svc_rdma_recvfrom is called again. Can someone
explain why svc_rdma_recvfrom doesn't just wait for the
RDMA Reads to complete?

What is the purpose of pulling the pages out of the svc_rqst
while waiting for the RDMA Reads?

>> Another option would be for ->sendto to return a value that means
>> the transport will release the svc_rqst and pages.
>
> Or just let the transport always release it. We only have two
> different implementations of the relevant ops anyway.

If each transport is responsible for releasing svc_rqst and
pages, code is duplicated. Both transports need to do this
releasing, so it should be done in common code (the svc RPC
code).

I'd prefer not to have to touch the TCP transport in order
to improve the RDMA transport implementation. However, some
clean-up and refactoring in this area may be unavoidable.

--
Chuck Lever

2016-10-13 14:34:10

by Chuck Lever III

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

> On Oct 13, 2016, at 9:36 AM, Chuck Lever <[email protected]> wrote:
>
>>
>> On Oct 13, 2016, at 2:35 AM, Christoph Hellwig <[email protected]> wrote:
>>
>> On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
>>> I'm studying the way that the ->recvfrom and ->sendto calls work
>>> for RPC-over-RDMA.
>>>
>>> The ->sendto path moves pages out of the svc_rqst before posting
>>> I/O (RDMA Write and Send). Once the work is posted, ->sendto
>>> returns, and looks like svc_rqst is released at that point. The
>>> subsequent completion of the Send then releases those moved pages.
>>>
>>> I'm wondering if the transport can be simplified: instead of
>>> moving pages around, ->sendto could just wait until the Write and
>>> Send activity is complete, then return. The upper layer then
>>> releases everything.
>>
>> I'd prefer not block for no reason at all.
>
> The reason to block is to wait for I/O to complete. Can you
> elaborate: for example, are you leery of an extra context
> switch?
>
> Note that in the Read/recvfrom case, RDMA Reads are posted,
> and then svc_rdma_recvfrom returns "deferred." When the Reads
> complete, the completion handler enqueues the work on the
> svc_xprt and svc_rdma_recvfrom is called again. Can someone
> explain why svc_rdma_recvfrom doesn't just wait for the
> RDMA Reads to complete?
>
> What is the purpose of pulling the pages out of the svc_rqst
> while waiting for the RDMA Reads?
>
>
>>> Another option would be for ->sendto to return a value that means
>>> the transport will release the svc_rqst and pages.
>>
>> Or just let the transport always release it. We only have two
>> different implementations of the relevant ops anyway.
>
> If each transport is responsible for releasing svc_rqst and
> pages, code is duplicated. Both transports need to do this
> releasing, so it should be done in common code (the svc RPC
> code).
>
> I'd prefer not to have to touch the TCP transport in order
> to improve the RDMA transport implementation. However, some
> clean-up and refactoring in this area may be unavoidable.

svc_send does this:

938 mutex_lock(&xprt->xpt_mutex);

943 len = xprt->xpt_ops->xpo_sendto(rqstp);
944 mutex_unlock(&xprt->xpt_mutex);
945 rpc_wake_up(&xprt->xpt_bc_pending);
946 svc_xprt_release(rqstp);

Thus waiting in sendto is indeed not good: it would hold up
other replies.

But I think you're suggesting that svc_xprt_release() should
be invoked by the transport, and not by svc_send(). That
could work if there isn't a hidden dependency on the ordering
of lines 944-946.

The socket transport appears to be able to complete send
processing synchronously. Invoking svc_xprt_release() inside
the mutex critical section would probably have a negative
performance impact.

We could add another xpo callout here. For sockets it would
invoke svc_xprt_release, and for RDMA it would be a no-op.
RDMA would then be responsible for invoking svc_xprt_release
in its send completion handler.

--
Chuck Lever

2016-10-19 17:26:13

by J. Bruce Fields

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
> I'm studying the way that the ->recvfrom and ->sendto calls work
> for RPC-over-RDMA.
>
> The ->sendto path moves pages out of the svc_rqst before posting
> I/O (RDMA Write and Send). Once the work is posted, ->sendto
> returns, and looks like svc_rqst is released at that point. The
> subsequent completion of the Send then releases those moved pages.
>
> I'm wondering if the transport can be simplified: instead of
> moving pages around, ->sendto could just wait until the Write and
> Send activity is complete, then return. The upper layer then
> releases everything.

I don't understand what problem you're trying to fix. Is "moving pages
around" really especially complicated or expensive?

--b.

>
> Another option would be for ->sendto to return a value that means
> the transport will release the svc_rqst and pages.
>
> Or, the svc_rqst could be reference counted.
>
> Anyone have thoughts about this?
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-10-19 18:28:45

by Chuck Lever III

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

> On Oct 19, 2016, at 1:26 PM, [email protected] wrote:
>
> On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
>> I'm studying the way that the ->recvfrom and ->sendto calls work
>> for RPC-over-RDMA.
>>
>> The ->sendto path moves pages out of the svc_rqst before posting
>> I/O (RDMA Write and Send). Once the work is posted, ->sendto
>> returns, and looks like svc_rqst is released at that point. The
>> subsequent completion of the Send then releases those moved pages.
>>
>> I'm wondering if the transport can be simplified: instead of
>> moving pages around, ->sendto could just wait until the Write and
>> Send activity is complete, then return. The upper layer then
>> releases everything.
>
> I don't understand what problem you're trying to fix. Is "moving pages
> around" really especially complicated or expensive?

Moving pages is complicated and undocumented, and adds host CPU and
memory costs (loads and stores). There is also a whole lot of page
allocator traffic in this code, and that will be enabling and
disabling interrupts and bottom-halves with some frequency.

The less involvement by the host CPU the better. The current RDMA
transport code documents the RPC-over-RDMA protocol to some extent,
but the need for moving the pages is to the reader's imagination.

2016-10-19 19:16:10

by J. Bruce Fields

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

On Wed, Oct 19, 2016 at 02:28:40PM -0400, Chuck Lever wrote:
>
> > On Oct 19, 2016, at 1:26 PM, [email protected] wrote:
> >
> > On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
> >> I'm studying the way that the ->recvfrom and ->sendto calls work
> >> for RPC-over-RDMA.
> >>
> >> The ->sendto path moves pages out of the svc_rqst before posting
> >> I/O (RDMA Write and Send). Once the work is posted, ->sendto
> >> returns, and looks like svc_rqst is released at that point. The
> >> subsequent completion of the Send then releases those moved pages.
> >>
> >> I'm wondering if the transport can be simplified: instead of
> >> moving pages around, ->sendto could just wait until the Write and
> >> Send activity is complete, then return. The upper layer then
> >> releases everything.
> >
> > I don't understand what problem you're trying to fix. Is "moving pages
> > around" really especially complicated or expensive?
>
> Moving pages is complicated and undocumented, and adds host CPU and
> memory costs (loads and stores). There is also a whole lot of page
> allocator traffic in this code, and that will be enabling and
> disabling interrupts and bottom-halves with some frequency.

I thought "moving pages around" here is basically just this, from
isvc_rdma_sendto.c:send_reply():

pages = rqstp->rq_next_page - rqstp->rq_respages;
for (page_no = 0; page_no < pages; page_no++) {
ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
...
rqstp->rq_respages[page_no] = NULL;
...
}

So we're just copying an array of page pointers from one place to
another, and zeroing out the source array.

For a short reply that could be 1 page or even none. In the worst case
(a 1MB read result) that could be 256 8-byte pointers, so 2K.

Am I missing something? Has that up-to-2K operation been a problem?

--b.

>
> The less involvement by the host CPU the better. The current RDMA
> transport code documents the RPC-over-RDMA protocol to some extent,
> but the need for moving the pages is to the reader's imagination.
>
> From my code audit:
>
> The svc_rqst is essentially a static piece of memory, and the calling
> NFSD kthread will re-use it immediately when svc_sendto() returns. It
> cannot be manipulated asynchronously. The recvfrom calls and the
> sendto call for one RPC request can each use a different svc_rqst,
> for example. So if the transport needs to use pages for longer the
> length of one svc_sendto call (for example, to wait for RDMA Write to
> complete) then it has to move those pages out of the svc_rqst before
> it returns. (Let me know if I've got this wrong).
>
> I'm falling back to looking at ways to clean up and document the
> movement of these pages so it is better understood why it is done.
> There also could be some opportunity to share page movement helpers
> between the RDMA read and write path (and perhaps the TCP transport
> too).
>
> HTH.
>
>
> > --b.
> >
> >>
> >> Another option would be for ->sendto to return a value that means
> >> the transport will release the svc_rqst and pages.
> >>
> >> Or, the svc_rqst could be reference counted.
> >>
> >> Anyone have thoughts about this?
>
>
> --
> Chuck Lever
>
>

2016-10-19 19:21:07

by Chuck Lever III

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

> On Oct 19, 2016, at 3:16 PM, J. Bruce Fields <[email protected]> wrote:
>
> On Wed, Oct 19, 2016 at 02:28:40PM -0400, Chuck Lever wrote:
>>
>>> On Oct 19, 2016, at 1:26 PM, [email protected] wrote:
>>>
>>> On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote:
>>>> I'm studying the way that the ->recvfrom and ->sendto calls work
>>>> for RPC-over-RDMA.
>>>>
>>>> The ->sendto path moves pages out of the svc_rqst before posting
>>>> I/O (RDMA Write and Send). Once the work is posted, ->sendto
>>>> returns, and looks like svc_rqst is released at that point. The
>>>> subsequent completion of the Send then releases those moved pages.
>>>>
>>>> I'm wondering if the transport can be simplified: instead of
>>>> moving pages around, ->sendto could just wait until the Write and
>>>> Send activity is complete, then return. The upper layer then
>>>> releases everything.
>>>
>>> I don't understand what problem you're trying to fix. Is "moving pages
>>> around" really especially complicated or expensive?
>>
>> Moving pages is complicated and undocumented, and adds host CPU and
>> memory costs (loads and stores). There is also a whole lot of page
>> allocator traffic in this code, and that will be enabling and
>> disabling interrupts and bottom-halves with some frequency.
>
> I thought "moving pages around" here is basically just this, from
> isvc_rdma_sendto.c:send_reply():
>
> pages = rqstp->rq_next_page - rqstp->rq_respages;
> for (page_no = 0; page_no < pages; page_no++) {
> ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
> ...
> rqstp->rq_respages[page_no] = NULL;
> ...
> }
>
> So we're just copying an array of page pointers from one place to
> another, and zeroing out the source array.
>
> For a short reply that could be 1 page or even none. In the worst case
> (a 1MB read result) that could be 256 8-byte pointers, so 2K.
>
> Am I missing something? Has that up-to-2K operation been a problem?

Not a problem, but not optimal either. Like I said, I can put together
some helpers so that this code is not duplicated, and the call-sites
would be a little easier to understand.

> --b.
>
>>
>> The less involvement by the host CPU the better. The current RDMA
>> transport code documents the RPC-over-RDMA protocol to some extent,
>> but the need for moving the pages is to the reader's imagination.
>>
>> From my code audit:
>>
>> The svc_rqst is essentially a static piece of memory, and the calling
>> NFSD kthread will re-use it immediately when svc_sendto() returns. It
>> cannot be manipulated asynchronously. The recvfrom calls and the
>> sendto call for one RPC request can each use a different svc_rqst,
>> for example. So if the transport needs to use pages for longer the
>> length of one svc_sendto call (for example, to wait for RDMA Write to
>> complete) then it has to move those pages out of the svc_rqst before
>> it returns. (Let me know if I've got this wrong).
>>
>> I'm falling back to looking at ways to clean up and document the
>> movement of these pages so it is better understood why it is done.
>> There also could be some opportunity to share page movement helpers
>> between the RDMA read and write path (and perhaps the TCP transport
>> too).
>>
>> HTH.
>>
>>
>>> --b.
>>>
>>>>
>>>> Another option would be for ->sendto to return a value that means
>>>> the transport will release the svc_rqst and pages.
>>>>
>>>> Or, the svc_rqst could be reference counted.
>>>>
>>>> Anyone have thoughts about this?
>>
>>
>> --
>> Chuck Lever
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

2016-10-19 19:26:03

by J. Bruce Fields

[permalink] [raw]

Subject: Re: nfsd: managing pages under network I/O

On Wed, Oct 19, 2016 at 03:21:02PM -0400, Chuck Lever wrote:
> > On Oct 19, 2016, at 3:16 PM, J. Bruce Fields <[email protected]> wrote:
> > I thought "moving pages around" here is basically just this, from
> > isvc_rdma_sendto.c:send_reply():
> >
> > pages = rqstp->rq_next_page - rqstp->rq_respages;
> > for (page_no = 0; page_no < pages; page_no++) {
> > ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
> > ...
> > rqstp->rq_respages[page_no] = NULL;
> > ...
> > }
> >
> > So we're just copying an array of page pointers from one place to
> > another, and zeroing out the source array.
> >
> > For a short reply that could be 1 page or even none. In the worst case
> > (a 1MB read result) that could be 256 8-byte pointers, so 2K.
> >
> > Am I missing something? Has that up-to-2K operation been a problem?
>
> Not a problem, but not optimal either. Like I said, I can put together
> some helpers so that this code is not duplicated, and the call-sites
> would be a little easier to understand.

Sure, I've no objection if you find some cleanup that makes sense.

--b.