Hi Neil-
I'd like to reduce the amount of page allocation that NFSD does,
and was wondering about the release and reset of pages in
svc_xprt_release(). This logic was added when the socket transport
was converted to use kernel_sendpage() back in 2002. Do you
remember why releasing the result pages is necessary?
--
Chuck Lever
On Fri, Jan 29 2021, Chuck Lever wrote:
> Hi Neil-
>
> I'd like to reduce the amount of page allocation that NFSD does,
> and was wondering about the release and reset of pages in
> svc_xprt_release(). This logic was added when the socket transport
> was converted to use kernel_sendpage() back in 2002. Do you
> remember why releasing the result pages is necessary?
>
Hi Chuck,
as I recall, kernel_sendpage() (or sock->ops->sendpage() as it was
then) takes a reference to the page and will hold that reference until
the content has been sent and ACKed. nfsd has no way to know when the
ACK comes, so cannot know when the page can be re-used, so it must
release the page and allocate a new one.
This is the price we pay for zero-copy, and I acknowledge that it is a
real price. I wouldn't be surprised if the trade-offs between
zero-copy and single-copy change over time, and between different
hardware.
NeilBrown
> On Jan 29, 2021, at 5:43 PM, NeilBrown <[email protected]> wrote:
>
> On Fri, Jan 29 2021, Chuck Lever wrote:
>
>> Hi Neil-
>>
>> I'd like to reduce the amount of page allocation that NFSD does,
>> and was wondering about the release and reset of pages in
>> svc_xprt_release(). This logic was added when the socket transport
>> was converted to use kernel_sendpage() back in 2002. Do you
>> remember why releasing the result pages is necessary?
>>
>
> Hi Chuck,
> as I recall, kernel_sendpage() (or sock->ops->sendpage() as it was
> then) takes a reference to the page and will hold that reference until
> the content has been sent and ACKed. nfsd has no way to know when the
> ACK comes, so cannot know when the page can be re-used, so it must
> release the page and allocate a new one.
>
> This is the price we pay for zero-copy, and I acknowledge that it is a
> real price. I wouldn't be surprised if the trade-offs between
> zero-copy and single-copy change over time, and between different
> hardware.
Very interesting, thanks for the history! Two observations:
- I thought without MSG_DONTWAIT, the sendpage operation would be
total synchronous -- when the network layer was done with retransmissions,
it would unblock the caller. But that's likely a mistaken assumption
on my part. That could be why sendmsg is so much slower than sendpage
in this particular application.
- IIUC, nfsd_splice_read() replaces anonymous pages in rq_pages with
actual page cache pages. Those of course cannot be used to construct
subsequent RPC Replies, so that introduces a second release requirement.
So I have a way to make the first case unnecessary for RPC/RDMA. It
has a reliable Send completion mechanism. Sounds like releasing is
still necessary for TCP, though; maybe that could be done in the
xpo_release_rqst callback.
As far as nfsd_splice_read(), I had thought of moving those pages to
a separate array which would always be released. That would need to
deal with the transport requirements above.
If nothing else, I would like to add mention of these requirements
somewhere in the code too.
What's your opinion?
--
Chuck Lever
On Fri, Jan 29 2021, Chuck Lever wrote:
>> On Jan 29, 2021, at 5:43 PM, NeilBrown <[email protected]> wrote:
>>
>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>
>>> Hi Neil-
>>>
>>> I'd like to reduce the amount of page allocation that NFSD does,
>>> and was wondering about the release and reset of pages in
>>> svc_xprt_release(). This logic was added when the socket transport
>>> was converted to use kernel_sendpage() back in 2002. Do you
>>> remember why releasing the result pages is necessary?
>>>
>>
>> Hi Chuck,
>> as I recall, kernel_sendpage() (or sock->ops->sendpage() as it was
>> then) takes a reference to the page and will hold that reference until
>> the content has been sent and ACKed. nfsd has no way to know when the
>> ACK comes, so cannot know when the page can be re-used, so it must
>> release the page and allocate a new one.
>>
>> This is the price we pay for zero-copy, and I acknowledge that it is a
>> real price. I wouldn't be surprised if the trade-offs between
>> zero-copy and single-copy change over time, and between different
>> hardware.
>
> Very interesting, thanks for the history! Two observations:
>
> - I thought without MSG_DONTWAIT, the sendpage operation would be
> total synchronous -- when the network layer was done with retransmissions,
> it would unblock the caller. But that's likely a mistaken assumption
> on my part. That could be why sendmsg is so much slower than sendpage
> in this particular application.
>
On the "send" side, I think MSG_DONTWAIT is primarily about memory
allocation. send_msg() can only return when the message is queued. If
it needs to allocate memory (or wait for space in a restricted queue),
then MSG_DONTWAIT says "fail instead". It certainly doesn't wait for
successful xmit and ack.
On the "recv" side it is quite different of course.
> - IIUC, nfsd_splice_read() replaces anonymous pages in rq_pages with
> actual page cache pages. Those of course cannot be used to construct
> subsequent RPC Replies, so that introduces a second release requirement.
Yep. I wonder if those pages are protected against concurrent updates
.. so that a computed checksum will remain accurate.
>
> So I have a way to make the first case unnecessary for RPC/RDMA. It
> has a reliable Send completion mechanism. Sounds like releasing is
> still necessary for TCP, though; maybe that could be done in the
> xpo_release_rqst callback.
It isn't clear to me what particular cost you are trying to reduce. Is
handing a page back from RDMA to nfsd cheaper than nfsd calling
alloc_page(), or do you hope to keep batches of pages together to avoid
multi-page overheads, or is this about cache-hot pages, or ???
>
> As far as nfsd_splice_read(), I had thought of moving those pages to
> a separate array which would always be released. That would need to
> deal with the transport requirements above.
>
> If nothing else, I would like to add mention of these requirements
> somewhere in the code too.
Strongly agree with that.
>
> What's your opinion?
To form a coherent opinion, I would need to know what that problem is.
I certainly accept that there could be performance problems in releasing
and re-allocating pages which might be resolved by batching, or by copying,
or by better tracking. But without knowing what hot-spot you want to
cool down, I cannot think about how that fits into the big picture.
So: what exactly is the problem that you see?
Thanks,
NeilBrown
>
>
> --
> Chuck Lever
> On Jan 31, 2021, at 6:45 PM, NeilBrown <[email protected]> wrote:
>
> On Fri, Jan 29 2021, Chuck Lever wrote:
>
>>> On Jan 29, 2021, at 5:43 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>>
>>>> Hi Neil-
>>>>
>>>> I'd like to reduce the amount of page allocation that NFSD does,
>>>> and was wondering about the release and reset of pages in
>>>> svc_xprt_release(). This logic was added when the socket transport
>>>> was converted to use kernel_sendpage() back in 2002. Do you
>>>> remember why releasing the result pages is necessary?
>>>>
>>>
>>> Hi Chuck,
>>> as I recall, kernel_sendpage() (or sock->ops->sendpage() as it was
>>> then) takes a reference to the page and will hold that reference until
>>> the content has been sent and ACKed. nfsd has no way to know when the
>>> ACK comes, so cannot know when the page can be re-used, so it must
>>> release the page and allocate a new one.
>>>
>>> This is the price we pay for zero-copy, and I acknowledge that it is a
>>> real price. I wouldn't be surprised if the trade-offs between
>>> zero-copy and single-copy change over time, and between different
>>> hardware.
>>
>> Very interesting, thanks for the history! Two observations:
>>
>> - I thought without MSG_DONTWAIT, the sendpage operation would be
>> total synchronous -- when the network layer was done with retransmissions,
>> it would unblock the caller. But that's likely a mistaken assumption
>> on my part. That could be why sendmsg is so much slower than sendpage
>> in this particular application.
>>
>
> On the "send" side, I think MSG_DONTWAIT is primarily about memory
> allocation. send_msg() can only return when the message is queued. If
> it needs to allocate memory (or wait for space in a restricted queue),
> then MSG_DONTWAIT says "fail instead". It certainly doesn't wait for
> successful xmit and ack.
Fair enough.
> On the "recv" side it is quite different of course.
>
>> - IIUC, nfsd_splice_read() replaces anonymous pages in rq_pages with
>> actual page cache pages. Those of course cannot be used to construct
>> subsequent RPC Replies, so that introduces a second release requirement.
>
> Yep. I wonder if those pages are protected against concurrent updates
> .. so that a computed checksum will remain accurate.
That thought has been lingering in the back of my mind too. But the
server has used sendpage() for many years without a reported issue
(since RQ_SPLICE_OK was added).
>> So I have a way to make the first case unnecessary for RPC/RDMA. It
>> has a reliable Send completion mechanism. Sounds like releasing is
>> still necessary for TCP, though; maybe that could be done in the
>> xpo_release_rqst callback.
>
> It isn't clear to me what particular cost you are trying to reduce. Is
> handing a page back from RDMA to nfsd cheaper than nfsd calling
> alloc_page(), or do you hope to keep batches of pages together to avoid
> multi-page overheads, or is this about cache-hot pages, or ???
RDMA gives consumers a reliable indication that the NIC is done with
each page. There's really no need to cycle the page at all (except
for the splice case).
I outline the savings below.
>> As far as nfsd_splice_read(), I had thought of moving those pages to
>> a separate array which would always be released. That would need to
>> deal with the transport requirements above.
>>
>> If nothing else, I would like to add mention of these requirements
>> somewhere in the code too.
>
> Strongly agree with that.
>
>>
>> What's your opinion?
>
> To form a coherent opinion, I would need to know what that problem is.
> I certainly accept that there could be performance problems in releasing
> and re-allocating pages which might be resolved by batching, or by copying,
> or by better tracking. But without knowing what hot-spot you want to
> cool down, I cannot think about how that fits into the big picture.
> So: what exactly is the problem that you see?
The problem is that each 1MB NFS READ, for example, hands 257 pages
back to the page allocator, then allocates another 257 pages. One page
is not terrible, but 510+ allocator calls for every RPC begins to get
costly.
Also, remember that both allocating and freeing a page means an irqsave
spin lock -- that will serialize all other page allocations, including
allocation done by other nfsd threads.
So I'd like to lower page allocator contention and the rate at which
IRQs are disabled and enabled when the NFS server becomes busy, as it
might with several 25 GbE NICs, for instance.
Holding onto the same pages means we can make better use of TLB
entries -- fewer TLB flushes is always a good thing.
I know that the network folks at Red Hat have been staring hard at
reducing memory allocation in the stack for several years. I recall
that Matthew Wilcox recently made similar improvements to the block
layer.
With the advent of 100GbE and Optane-like durable storage, the balance
of memory allocation cost to device latency has shifted so that
superfluous memory allocation is noticeable.
At first I thought of creating a page allocator API that could grab
or free an array of pages while taking the allocator locks once. But
now I wonder if there are opportunities to reduce the amount of page
allocator traffic.
--
Chuck Lever
On Mon, Feb 01 2021, Chuck Lever wrote:
>> On Jan 31, 2021, at 6:45 PM, NeilBrown <[email protected]> wrote:
>>
>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>>
>>> What's your opinion?
>>
>> To form a coherent opinion, I would need to know what that problem is.
>> I certainly accept that there could be performance problems in releasing
>> and re-allocating pages which might be resolved by batching, or by copying,
>> or by better tracking. But without knowing what hot-spot you want to
>> cool down, I cannot think about how that fits into the big picture.
>> So: what exactly is the problem that you see?
>
> The problem is that each 1MB NFS READ, for example, hands 257 pages
> back to the page allocator, then allocates another 257 pages. One page
> is not terrible, but 510+ allocator calls for every RPC begins to get
> costly.
>
> Also, remember that both allocating and freeing a page means an irqsave
> spin lock -- that will serialize all other page allocations, including
> allocation done by other nfsd threads.
>
> So I'd like to lower page allocator contention and the rate at which
> IRQs are disabled and enabled when the NFS server becomes busy, as it
> might with several 25 GbE NICs, for instance.
>
> Holding onto the same pages means we can make better use of TLB
> entries -- fewer TLB flushes is always a good thing.
>
> I know that the network folks at Red Hat have been staring hard at
> reducing memory allocation in the stack for several years. I recall
> that Matthew Wilcox recently made similar improvements to the block
> layer.
>
> With the advent of 100GbE and Optane-like durable storage, the balance
> of memory allocation cost to device latency has shifted so that
> superfluous memory allocation is noticeable.
>
>
> At first I thought of creating a page allocator API that could grab
> or free an array of pages while taking the allocator locks once. But
> now I wonder if there are opportunities to reduce the amount of page
> allocator traffic.
Thanks. This helps me a lot.
I wonder if there is some low-hanging fruit here.
If I read the code correctly (which is not certain, but what I see does
seem to agree with vague memories of how it all works), we currently do
a lot of wasted alloc/frees for zero-copy reads.
We allocate lots of pages and store the pointers in ->rq_respages
(i.e. ->rq_pages)
Then nfsd_splice_actor frees many of those pages and
replaces the pointers with pointers to page-cache pages. Then we release
those page-cache pages.
We need to have allocated them, but we don't need to free them.
We can add some new array for storing them, have nfsd_splice_actor move
them to that array, and have svc_alloc_arg() move pages back from the
store rather than re-allocating them.
Or maybe something even more sophisticated where we only move them out
of the store when we actually need them.
Having the RDMA layer return pages when they are finished with might
help. You might even be able to use atomics (cmpxchg) to handle the
contention. But I'm not convinced it would be worth it.
I *really* like your idea of a batch-API for page-alloc and page-free.
This would likely be useful for other users, and it would be worth
writing more code to get peak performance - things such as per-cpu
queues of returned pages and so-forth (which presumably already exist).
I cannot be sure that the batch-API would be better than a focused API
just for RDMA -> NFSD. But my guess is that it would be at least nearly
as good, and would likely get a lot more eyes on the code.
NeilBrown
> On Feb 1, 2021, at 6:27 PM, NeilBrown <[email protected]> wrote:
>
> On Mon, Feb 01 2021, Chuck Lever wrote:
>
>>> On Jan 31, 2021, at 6:45 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>>>
>>>> What's your opinion?
>>>
>>> To form a coherent opinion, I would need to know what that problem is.
>>> I certainly accept that there could be performance problems in releasing
>>> and re-allocating pages which might be resolved by batching, or by copying,
>>> or by better tracking. But without knowing what hot-spot you want to
>>> cool down, I cannot think about how that fits into the big picture.
>>> So: what exactly is the problem that you see?
>>
>> The problem is that each 1MB NFS READ, for example, hands 257 pages
>> back to the page allocator, then allocates another 257 pages. One page
>> is not terrible, but 510+ allocator calls for every RPC begins to get
>> costly.
>>
>> Also, remember that both allocating and freeing a page means an irqsave
>> spin lock -- that will serialize all other page allocations, including
>> allocation done by other nfsd threads.
>>
>> So I'd like to lower page allocator contention and the rate at which
>> IRQs are disabled and enabled when the NFS server becomes busy, as it
>> might with several 25 GbE NICs, for instance.
>>
>> Holding onto the same pages means we can make better use of TLB
>> entries -- fewer TLB flushes is always a good thing.
>>
>> I know that the network folks at Red Hat have been staring hard at
>> reducing memory allocation in the stack for several years. I recall
>> that Matthew Wilcox recently made similar improvements to the block
>> layer.
>>
>> With the advent of 100GbE and Optane-like durable storage, the balance
>> of memory allocation cost to device latency has shifted so that
>> superfluous memory allocation is noticeable.
>>
>>
>> At first I thought of creating a page allocator API that could grab
>> or free an array of pages while taking the allocator locks once. But
>> now I wonder if there are opportunities to reduce the amount of page
>> allocator traffic.
>
> Thanks. This helps me a lot.
>
> I wonder if there is some low-hanging fruit here.
>
> If I read the code correctly (which is not certain, but what I see does
> seem to agree with vague memories of how it all works), we currently do
> a lot of wasted alloc/frees for zero-copy reads.
>
> We allocate lots of pages and store the pointers in ->rq_respages
> (i.e. ->rq_pages)
> Then nfsd_splice_actor frees many of those pages and
> replaces the pointers with pointers to page-cache pages. Then we release
> those page-cache pages.
>
> We need to have allocated them, but we don't need to free them.
> We can add some new array for storing them, have nfsd_splice_actor move
> them to that array, and have svc_alloc_arg() move pages back from the
> store rather than re-allocating them.
>
> Or maybe something even more sophisticated where we only move them out
> of the store when we actually need them.
>
> Having the RDMA layer return pages when they are finished with might
> help. You might even be able to use atomics (cmpxchg) to handle the
> contention. But I'm not convinced it would be worth it.
>
> I *really* like your idea of a batch-API for page-alloc and page-free.
> This would likely be useful for other users, and it would be worth
> writing more code to get peak performance - things such as per-cpu
> queues of returned pages and so-forth (which presumably already exist).
Baby steps.
Because I'm perverse I started with bulk page freeing. In the course
of trying to invent a new API, I discovered there already is a batch
free_page() function called release_pages().
It seems to work as advertised for pages that are truly no longer
in use (ie RPC/RDMA pages) but not for pages that are still in flight
but released (ie TCP pages).
release_pages() chains the pages in the passed-in array onto a list
by their page->lru fields. This seems to be a problem if a page
is still in use.
> I cannot be sure that the batch-API would be better than a focused API
> just for RDMA -> NFSD. But my guess is that it would be at least nearly
> as good, and would likely get a lot more eyes on the code.
>
> NeilBrown
--
Chuck Lever
On Fri, Feb 05, 2021 at 08:20:28PM +0000, Chuck Lever wrote:
> Baby steps.
>
> Because I'm perverse I started with bulk page freeing. In the course
> of trying to invent a new API, I discovered there already is a batch
> free_page() function called release_pages().
>
> It seems to work as advertised for pages that are truly no longer
> in use (ie RPC/RDMA pages) but not for pages that are still in flight
> but released (ie TCP pages).
>
> release_pages() chains the pages in the passed-in array onto a list
> by their page->lru fields. This seems to be a problem if a page
> is still in use.
I thought I remembered reading an lwn article about bulk page
allocation. Looking around now all I can see is
https://lwn.net/Articles/684616/
https://lwn.net/Articles/711075/
and I can't tell if any of that work was ever finished.
--b.
> On Feb 5, 2021, at 4:13 PM, J. Bruce Fields <[email protected]> wrote:
>
> On Fri, Feb 05, 2021 at 08:20:28PM +0000, Chuck Lever wrote:
>> Baby steps.
>>
>> Because I'm perverse I started with bulk page freeing. In the course
>> of trying to invent a new API, I discovered there already is a batch
>> free_page() function called release_pages().
>>
>> It seems to work as advertised for pages that are truly no longer
>> in use (ie RPC/RDMA pages) but not for pages that are still in flight
>> but released (ie TCP pages).
>>
>> release_pages() chains the pages in the passed-in array onto a list
>> by their page->lru fields. This seems to be a problem if a page
>> is still in use.
>
> I thought I remembered reading an lwn article about bulk page
> allocation. Looking around now all I can see is
>
> https://lwn.net/Articles/684616/
> https://lwn.net/Articles/711075/
>
> and I can't tell if any of that work was ever finished.
Jesper is the engineer at Red Hat I alluded to earlier, and this is
similar to what I discussed with him. I didn't see anything like
alloc_pages_bulk() when sniffing around in v5.11-rc, but my search
was not exhaustive.
I think freeing pages, for NFSD, is complicated by the need to release
pages that are still in use (either by the page cache or by the network
layer). The page freeing logic in RPC/RDMA is freeing pages that are
not in use by anyone, and I have a couple of clear approaches that
eliminate the need for that logic completely.
Therefore, IMO concentrating on making svc_alloc_arg() more efficient
should provide the biggest bang for both socket and RDMA transports.
--
Chuck Lever
On Fri, Feb 05 2021, Chuck Lever wrote:
>
> Baby steps.
>
> Because I'm perverse I started with bulk page freeing. In the course
> of trying to invent a new API, I discovered there already is a batch
> free_page() function called release_pages().
>
> It seems to work as advertised for pages that are truly no longer
> in use (ie RPC/RDMA pages) but not for pages that are still in flight
> but released (ie TCP pages).
>
> release_pages() chains the pages in the passed-in array onto a list
> by their page->lru fields. This seems to be a problem if a page
> is still in use.
But if a page is still in use, it is not put on the list.
if (!put_page_testzero(page))
continue;
NeilBrown
On Fri, Feb 05 2021, J. Bruce Fields wrote:
> On Fri, Feb 05, 2021 at 08:20:28PM +0000, Chuck Lever wrote:
>> Baby steps.
>>
>> Because I'm perverse I started with bulk page freeing. In the course
>> of trying to invent a new API, I discovered there already is a batch
>> free_page() function called release_pages().
>>
>> It seems to work as advertised for pages that are truly no longer
>> in use (ie RPC/RDMA pages) but not for pages that are still in flight
>> but released (ie TCP pages).
>>
>> release_pages() chains the pages in the passed-in array onto a list
>> by their page->lru fields. This seems to be a problem if a page
>> is still in use.
>
> I thought I remembered reading an lwn article about bulk page
> allocation. Looking around now all I can see is
>
> https://lwn.net/Articles/684616/
> https://lwn.net/Articles/711075/
The work in this last one seems to have been merged, without the
alloc_pages_bulk() as foretold in the article. i.e. __rmqueue_pcp_list
has been split out.
Adding that alloc_page_bulk() for testing should be fairly easy
http://lkml.iu.edu/hypermail/linux/kernel/1701.1/00732.html
If you can show a benefit in a real use-case, it shouldn't be too hard
to get this API accepted.
> Therefore, IMO concentrating on making svc_alloc_arg() more efficient
> should provide the biggest bang for both socket and RDMA transports.
I agree that this is the best first step. It might make the other steps
irrelevant.
NeilBrown