2023-06-16 16:39:19

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API



On 16/06/2023 13.57, Yunsheng Lin wrote:
> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
>
> ...
>
>> You have mentioned veth as the use-case. I know I acked adding page_pool
>> use-case to veth, for when we need to convert an SKB into an
>> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
>> In this case in veth, the size is known at the page allocation time.
>> Thus, using the page_pool API is wasting memory. We did this for
>> performance reasons, but we are not using PP for what is was intended
>> for. We mostly use page_pool, because it an existing recycle return
>> path, and we were too lazy to add another alloc-type (see enum
>> xdp_mem_type).
>>
>> Maybe you/we can extend veth to use this dynamic size API, to show us
>> that this is API is a better approach. I will signup for benchmarking
>> this (and coordinating with CC Maryam as she came with use-case we
>> improved on).
>
> Thanks, let's find out if page pool is the right hammer for the
> veth XDP case.
>
> Below is the change for veth using the new api in this patch.
> Only compile test as I am not familiar enough with veth XDP and
> testing environment for it.
> Please try it if it is helpful.
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 614f3e3efab0..8850394f1d29 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> if (skb_shared(skb) || skb_head_is_locked(skb) ||
> skb_shinfo(skb)->nr_frags ||
> skb_headroom(skb) < XDP_PACKET_HEADROOM) {
> - u32 size, len, max_head_size, off;
> + u32 size, len, max_head_size, off, truesize, page_offset;
> struct sk_buff *nskb;
> struct page *page;
> int i, head_off;
> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
> goto drop;
>
> + size = min_t(u32, skb->len, max_head_size);
> + truesize = size;
> +
> /* Allocate skb head */
> - page = page_pool_dev_alloc_pages(rq->page_pool);
> + page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);

Maybe rename API to:

addr = netmem_alloc(rq->page_pool, &truesize);

> if (!page)
> goto drop;
>
> - nskb = napi_build_skb(page_address(page), PAGE_SIZE);
> + nskb = napi_build_skb(page_address(page) + page_offset, truesize);

IMHO this illustrates that API is strange/funky.
(I think this is what Alex Duyck is also pointing out).

This is the memory (virtual) address "pointer":
addr = page_address(page) + page_offset

This is what napi_build_skb() takes as input. (I looked at other users
of napi_build_skb() whom all give a mem ptr "va" as arg.)
So, why does your new API provide the "page" and not just the address?

As proposed above:
addr = netmem_alloc(rq->page_pool, &truesize);

Maybe the API should be renamed, to indicate this isn't returning a "page"?
We have talked about the name "netmem" before.

> if (!nskb) {
> page_pool_put_full_page(rq->page_pool, page, true);
> goto drop;
> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> skb_copy_header(nskb, skb);
> skb_mark_for_recycle(nskb);
>
> - size = min_t(u32, skb->len, max_head_size);
> if (skb_copy_bits(skb, 0, nskb->data, size)) {
> consume_skb(nskb);
> goto drop;
> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> len = skb->len - off;
>
> for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> - page = page_pool_dev_alloc_pages(rq->page_pool);
> + size = min_t(u32, len, PAGE_SIZE);
> + truesize = size;
> +
> + page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> + &truesize);
> if (!page) {
> consume_skb(nskb);
> goto drop;
> }
>
> - size = min_t(u32, len, PAGE_SIZE);
> - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> + skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);

Guess, this shows the opposite; that the "page" _is_ used by the
existing API.

> if (skb_copy_bits(skb, off, page_address(page),
> size)) {
> consume_skb(nskb);

--Jesper



2023-06-16 17:57:16

by Alexander H Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
<[email protected]> wrote:
>
>
>
> On 16/06/2023 13.57, Yunsheng Lin wrote:
> > On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
> >
> > ...
> >
> >> You have mentioned veth as the use-case. I know I acked adding page_pool
> >> use-case to veth, for when we need to convert an SKB into an
> >> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
> >> In this case in veth, the size is known at the page allocation time.
> >> Thus, using the page_pool API is wasting memory. We did this for
> >> performance reasons, but we are not using PP for what is was intended
> >> for. We mostly use page_pool, because it an existing recycle return
> >> path, and we were too lazy to add another alloc-type (see enum
> >> xdp_mem_type).
> >>
> >> Maybe you/we can extend veth to use this dynamic size API, to show us
> >> that this is API is a better approach. I will signup for benchmarking
> >> this (and coordinating with CC Maryam as she came with use-case we
> >> improved on).
> >
> > Thanks, let's find out if page pool is the right hammer for the
> > veth XDP case.
> >
> > Below is the change for veth using the new api in this patch.
> > Only compile test as I am not familiar enough with veth XDP and
> > testing environment for it.
> > Please try it if it is helpful.
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 614f3e3efab0..8850394f1d29 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> > if (skb_shared(skb) || skb_head_is_locked(skb) ||
> > skb_shinfo(skb)->nr_frags ||
> > skb_headroom(skb) < XDP_PACKET_HEADROOM) {
> > - u32 size, len, max_head_size, off;
> > + u32 size, len, max_head_size, off, truesize, page_offset;
> > struct sk_buff *nskb;
> > struct page *page;
> > int i, head_off;
> > @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> > if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
> > goto drop;
> >
> > + size = min_t(u32, skb->len, max_head_size);
> > + truesize = size;
> > +
> > /* Allocate skb head */
> > - page = page_pool_dev_alloc_pages(rq->page_pool);
> > + page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>
> Maybe rename API to:
>
> addr = netmem_alloc(rq->page_pool, &truesize);
>
> > if (!page)
> > goto drop;
> >
> > - nskb = napi_build_skb(page_address(page), PAGE_SIZE);
> > + nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>
> IMHO this illustrates that API is strange/funky.
> (I think this is what Alex Duyck is also pointing out).
>
> This is the memory (virtual) address "pointer":
> addr = page_address(page) + page_offset
>
> This is what napi_build_skb() takes as input. (I looked at other users
> of napi_build_skb() whom all give a mem ptr "va" as arg.)
> So, why does your new API provide the "page" and not just the address?
>
> As proposed above:
> addr = netmem_alloc(rq->page_pool, &truesize);
>
> Maybe the API should be renamed, to indicate this isn't returning a "page"?
> We have talked about the name "netmem" before.

Yeah, this is more-or-less what I was getting at. Keep in mind this is
likely the most common case since most frames passed and forth aren't
ever usually much larger than 1500B.

> > if (!nskb) {
> > page_pool_put_full_page(rq->page_pool, page, true);
> > goto drop;
> > @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> > skb_copy_header(nskb, skb);
> > skb_mark_for_recycle(nskb);
> >
> > - size = min_t(u32, skb->len, max_head_size);
> > if (skb_copy_bits(skb, 0, nskb->data, size)) {
> > consume_skb(nskb);
> > goto drop;
> > @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> > len = skb->len - off;
> >
> > for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> > - page = page_pool_dev_alloc_pages(rq->page_pool);
> > + size = min_t(u32, len, PAGE_SIZE);
> > + truesize = size;
> > +
> > + page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> > + &truesize);
> > if (!page) {
> > consume_skb(nskb);
> > goto drop;
> > }
> >
> > - size = min_t(u32, len, PAGE_SIZE);
> > - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> > + skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
>
> Guess, this shows the opposite; that the "page" _is_ used by the
> existing API.

This is a sort-of. One thing that has come up as of late is that all
this stuff is being moved over to folios anyway and getting away from
pages. In addition I am not sure how often we are having to take this
path as I am not sure how many non-Tx frames end up having to have
fragments added to them. For something like veth it might be more
common though since Tx becomes Rx in this case.

One thought I had on this is that we could look at adding a new
function that abstracts this away and makes use of netmem instead.
Then the whole page/folio thing would be that much further removed.

One other question I have now that I look at this code as well. Why is
it using page_pool and not just a frag cache allocator, or pages
themselves? It doesn't seem like it has a DMA mapping to deal with
since this is essentially copy-break code. Seems problematic that
there is no DMA involved here at all. This could be more easily
handled with just a single page_frag_cache style allocator.

2023-06-16 18:52:50

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API



On 16/06/2023 19.34, Alexander Duyck wrote:
> On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> <[email protected]> wrote:
>>
>>
>>
>> On 16/06/2023 13.57, Yunsheng Lin wrote:
>>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
>>>
>>> ...
>>>
>>>> You have mentioned veth as the use-case. I know I acked adding page_pool
>>>> use-case to veth, for when we need to convert an SKB into an
>>>> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
>>>> In this case in veth, the size is known at the page allocation time.
>>>> Thus, using the page_pool API is wasting memory. We did this for
>>>> performance reasons, but we are not using PP for what is was intended
>>>> for. We mostly use page_pool, because it an existing recycle return
>>>> path, and we were too lazy to add another alloc-type (see enum
>>>> xdp_mem_type).
>>>>
>>>> Maybe you/we can extend veth to use this dynamic size API, to show us
>>>> that this is API is a better approach. I will signup for benchmarking
>>>> this (and coordinating with CC Maryam as she came with use-case we
>>>> improved on).
>>>
>>> Thanks, let's find out if page pool is the right hammer for the
>>> veth XDP case.
>>>
>>> Below is the change for veth using the new api in this patch.
>>> Only compile test as I am not familiar enough with veth XDP and
>>> testing environment for it.
>>> Please try it if it is helpful.
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 614f3e3efab0..8850394f1d29 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>> skb_shinfo(skb)->nr_frags ||
>>> skb_headroom(skb) < XDP_PACKET_HEADROOM) {
>>> - u32 size, len, max_head_size, off;
>>> + u32 size, len, max_head_size, off, truesize, page_offset;
>>> struct sk_buff *nskb;
>>> struct page *page;
>>> int i, head_off;
>>> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>>> goto drop;
>>>
>>> + size = min_t(u32, skb->len, max_head_size);
>>> + truesize = size;
>>> +
>>> /* Allocate skb head */
>>> - page = page_pool_dev_alloc_pages(rq->page_pool);
>>> + page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>>
>> Maybe rename API to:
>>
>> addr = netmem_alloc(rq->page_pool, &truesize);
>>
>>> if (!page)
>>> goto drop;
>>>
>>> - nskb = napi_build_skb(page_address(page), PAGE_SIZE);
>>> + nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>>
>> IMHO this illustrates that API is strange/funky.
>> (I think this is what Alex Duyck is also pointing out).
>>
>> This is the memory (virtual) address "pointer":
>> addr = page_address(page) + page_offset
>>
>> This is what napi_build_skb() takes as input. (I looked at other users
>> of napi_build_skb() whom all give a mem ptr "va" as arg.)
>> So, why does your new API provide the "page" and not just the address?
>>
>> As proposed above:
>> addr = netmem_alloc(rq->page_pool, &truesize);
>>
>> Maybe the API should be renamed, to indicate this isn't returning a "page"?
>> We have talked about the name "netmem" before.
>
> Yeah, this is more-or-less what I was getting at. Keep in mind this is
> likely the most common case since most frames passed and forth aren't
> ever usually much larger than 1500B.
>

Good to get confirmed this is "more-or-less" your suggestion/direction.


>>> if (!nskb) {
>>> page_pool_put_full_page(rq->page_pool, page, true);
>>> goto drop;
>>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> skb_copy_header(nskb, skb);
>>> skb_mark_for_recycle(nskb);
>>>
>>> - size = min_t(u32, skb->len, max_head_size);
>>> if (skb_copy_bits(skb, 0, nskb->data, size)) {
>>> consume_skb(nskb);
>>> goto drop;
>>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> len = skb->len - off;
>>>
>>> for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
>>> - page = page_pool_dev_alloc_pages(rq->page_pool);
>>> + size = min_t(u32, len, PAGE_SIZE);
>>> + truesize = size;
>>> +
>>> + page = page_pool_dev_alloc(rq->page_pool, &page_offset,
>>> + &truesize);
>>> if (!page) {
>>> consume_skb(nskb);
>>> goto drop;
>>> }
>>>
>>> - size = min_t(u32, len, PAGE_SIZE);
>>> - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
>>> + skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
>>
>> Guess, this shows the opposite; that the "page" _is_ used by the
>> existing API.
>
> This is a sort-of. One thing that has come up as of late is that all
> this stuff is being moved over to folios anyway and getting away from
> pages. In addition I am not sure how often we are having to take this
> path as I am not sure how many non-Tx frames end up having to have
> fragments added to them. For something like veth it might be more
> common though since Tx becomes Rx in this case.

I'm thinking, that is it very unlikely that XDP have modified the
fragments. So, why are we allocating and copying the fragments?
Wouldn't it be possible for this veth code to bump the refcnt on these
fragments? (maybe I missed some detail).

>
> One thought I had on this is that we could look at adding a new
> function that abstracts this away and makes use of netmem instead.
> Then the whole page/folio thing would be that much further removed.

I like this "thought" of yours :-)

>
> One other question I have now that I look at this code as well. Why is
> it using page_pool and not just a frag cache allocator, or pages
> themselves? It doesn't seem like it has a DMA mapping to deal with
> since this is essentially copy-break code. Seems problematic that
> there is no DMA involved here at all. This could be more easily
> handled with just a single page_frag_cache style allocator.
>

Yes, precisely.
I distinctly remember what I tried to poke you and Eric on this approach
earlier, but I cannot find a link to that email.

I would really appreciate, if you Alex, could give the approach in
veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
potential for improvements that will lead to large performance
improvements. (I'm sure Maryam will be eager to help re-test performance
for her use-cases).

--Jesper


2023-06-16 19:00:36

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On Fri, 16 Jun 2023 20:41:45 +0200 Jesper Dangaard Brouer wrote:
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.
>
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments. So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

They may be page cache pages, AFAIU.

2023-06-16 20:06:40

by Alexander H Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On Fri, Jun 16, 2023 at 11:41 AM Jesper Dangaard Brouer
<[email protected]> wrote:
>
>
>
> On 16/06/2023 19.34, Alexander Duyck wrote:
> > On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 16/06/2023 13.57, Yunsheng Lin wrote:
> >>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:

[...]

>
>
> >>> if (!nskb) {
> >>> page_pool_put_full_page(rq->page_pool, page, true);
> >>> goto drop;
> >>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> skb_copy_header(nskb, skb);
> >>> skb_mark_for_recycle(nskb);
> >>>
> >>> - size = min_t(u32, skb->len, max_head_size);
> >>> if (skb_copy_bits(skb, 0, nskb->data, size)) {
> >>> consume_skb(nskb);
> >>> goto drop;
> >>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> len = skb->len - off;
> >>>
> >>> for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> >>> - page = page_pool_dev_alloc_pages(rq->page_pool);
> >>> + size = min_t(u32, len, PAGE_SIZE);
> >>> + truesize = size;
> >>> +
> >>> + page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> >>> + &truesize);
> >>> if (!page) {
> >>> consume_skb(nskb);
> >>> goto drop;
> >>> }
> >>>
> >>> - size = min_t(u32, len, PAGE_SIZE);
> >>> - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> >>> + skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
> >>
> >> Guess, this shows the opposite; that the "page" _is_ used by the
> >> existing API.
> >
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.
>
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments. So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

From what I can tell this is an exception case with multiple caveats
for shared, locked, or nonlinear frames.

As such I suspect there may be some deprecated cases in there too
since XDP multi-buf support has been around for a while so the code
shouldn't need to reallocate to linearize a nonlinear frame.

> >
> > One other question I have now that I look at this code as well. Why is
> > it using page_pool and not just a frag cache allocator, or pages
> > themselves? It doesn't seem like it has a DMA mapping to deal with
> > since this is essentially copy-break code. Seems problematic that
> > there is no DMA involved here at all. This could be more easily
> > handled with just a single page_frag_cache style allocator.
> >
>
> Yes, precisely.
> I distinctly remember what I tried to poke you and Eric on this approach
> earlier, but I cannot find a link to that email.
>
> I would really appreciate, if you Alex, could give the approach in
> veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> potential for improvements that will lead to large performance
> improvements. (I'm sure Maryam will be eager to help re-test performance
> for her use-cases).

Well just looking at it the quick and dirty answer would be to look at
making use of something like page_frag_cache. I won't go into details
since it isn't too different from the frag allocator, but it is much
simpler since it is just doing reference count hacks instead of having
to do the extra overhead to keep the DMA mapping in place. The veth
would then just be sitting on at most an order 3 page while it is
waiting to fully consume it rather than waiting on a full pool of
pages.

Alternatively it could do something similar to page_frag_alloc_align
itself and just bypass doing a custom allocator. If it went that route
it could do something almost like a ring buffer and greatly improve
the throughput since it would be able to allocate a higher order page
and just copy the entire skb in so the entire thing would be linear
rather than having to allocate a bunch of single pages.

2023-06-17 13:12:16

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On 2023/6/17 1:34, Alexander Duyck wrote:
...

>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 614f3e3efab0..8850394f1d29 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>> skb_shinfo(skb)->nr_frags ||
>>> skb_headroom(skb) < XDP_PACKET_HEADROOM) {
>>> - u32 size, len, max_head_size, off;
>>> + u32 size, len, max_head_size, off, truesize, page_offset;
>>> struct sk_buff *nskb;
>>> struct page *page;
>>> int i, head_off;
>>> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>>> goto drop;
>>>
>>> + size = min_t(u32, skb->len, max_head_size);
>>> + truesize = size;
>>> +
>>> /* Allocate skb head */
>>> - page = page_pool_dev_alloc_pages(rq->page_pool);
>>> + page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>>
>> Maybe rename API to:
>>
>> addr = netmem_alloc(rq->page_pool, &truesize);

Unless we create a subsystem called netmem, I am not sure about
the 'netmem', it seems more confusing to use it here.

>>
>>> if (!page)
>>> goto drop;
>>>
>>> - nskb = napi_build_skb(page_address(page), PAGE_SIZE);
>>> + nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>>
>> IMHO this illustrates that API is strange/funky.
>> (I think this is what Alex Duyck is also pointing out).
>>
>> This is the memory (virtual) address "pointer":
>> addr = page_address(page) + page_offset
>>
>> This is what napi_build_skb() takes as input. (I looked at other users
>> of napi_build_skb() whom all give a mem ptr "va" as arg.)
>> So, why does your new API provide the "page" and not just the address?
>>
>> As proposed above:
>> addr = netmem_alloc(rq->page_pool, &truesize);
>>
>> Maybe the API should be renamed, to indicate this isn't returning a "page"?
>> We have talked about the name "netmem" before.
>
> Yeah, this is more-or-less what I was getting at. Keep in mind this is
> likely the most common case since most frames passed and forth aren't
> ever usually much larger than 1500B.

I do feel the pain here, there is why I use a per cpu 'struct
page_pool_frag' to report the result back to user so that we
can report both 'va' and 'page' to the user in the RFC of this
patchset.

IHMO, compared to the above point, it is more importance that
we have a unified implementation for both of them instead
of page frag based on the page allocator.

Currently there are three implementations for page frag:
1. mm/page_alloc.c: net stack seems to be using it in the
rx part with 'struct page_frag_cache' and the main API
being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
tx part with 'struct page_frag' and the main API being
skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
xdp frame, and it's implementation seems to be a mix of
the above two.

Acctually I have a patchset to remove the third one waiting
to send out after this one.

And I wonder if the first and second one can be unified as
one, as it seems the only user facing difference is one
returning va, and the other returning page. other difference
seems to be implementation specific, for example, one is
doing offset incrementing, and the other doing offset
decrementing.

2023-06-18 16:37:39

by Lorenzo Bianconi

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

[...]
> >
> > Yes, precisely.
> > I distinctly remember what I tried to poke you and Eric on this approach
> > earlier, but I cannot find a link to that email.
> >
> > I would really appreciate, if you Alex, could give the approach in
> > veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> > potential for improvements that will lead to large performance
> > improvements. (I'm sure Maryam will be eager to help re-test performance
> > for her use-cases).
>
> Well just looking at it the quick and dirty answer would be to look at
> making use of something like page_frag_cache. I won't go into details
> since it isn't too different from the frag allocator, but it is much
> simpler since it is just doing reference count hacks instead of having
> to do the extra overhead to keep the DMA mapping in place. The veth
> would then just be sitting on at most an order 3 page while it is
> waiting to fully consume it rather than waiting on a full pool of
> pages.

Hi,

I did some experiments using page_frag_cache/page_frag_alloc() instead of
page_pools in a simple environment I used to test XDP for veth driver.
In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
the page_frag_cache in order to copy the full skb in the new one, actually
"linearizing" the packet (since we know the original skb length).
I run an iperf TCP connection over a veth pair where the
remote device runs the xdp_rxq_info sample (available in the kernel source
tree, with action XDP_PASS):

TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server

net-next (page_pool):
- MTU 1500B: ~ 7.5 Gbps
- MTU 8000B: ~ 15.3 Gbps

net-next + page_frag_alloc:
- MTU 1500B: ~ 8.4 Gbps
- MTU 8000B: ~ 14.7 Gbps

It seems there is no a clear "win" situation here (at least in this environment
and we this simple approach). Moreover:
- can the linearization introduce any issue whenever we perform XDP_REDIRECT
into a destination device?
- can the page_frag_cache introduce more memory fragmentation (IIRC we were
experiencing this issue in mt76 before switching to page_pools).

What do you think?

Regards,
Lorenzo

>
> Alternatively it could do something similar to page_frag_alloc_align
> itself and just bypass doing a custom allocator. If it went that route
> it could do something almost like a ring buffer and greatly improve
> the throughput since it would be able to allocate a higher order page
> and just copy the entire skb in so the entire thing would be linear
> rather than having to allocate a bunch of single pages.


Attachments:
(No filename) (2.55 kB)
signature.asc (235.00 B)
Download all attachments

2023-06-20 16:26:52

by Alexander H Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On Sun, Jun 18, 2023 at 8:05 AM Lorenzo Bianconi <[email protected]> wrote:
>
> [...]
> > >
> > > Yes, precisely.
> > > I distinctly remember what I tried to poke you and Eric on this approach
> > > earlier, but I cannot find a link to that email.
> > >
> > > I would really appreciate, if you Alex, could give the approach in
> > > veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> > > potential for improvements that will lead to large performance
> > > improvements. (I'm sure Maryam will be eager to help re-test performance
> > > for her use-cases).
> >
> > Well just looking at it the quick and dirty answer would be to look at
> > making use of something like page_frag_cache. I won't go into details
> > since it isn't too different from the frag allocator, but it is much
> > simpler since it is just doing reference count hacks instead of having
> > to do the extra overhead to keep the DMA mapping in place. The veth
> > would then just be sitting on at most an order 3 page while it is
> > waiting to fully consume it rather than waiting on a full pool of
> > pages.
>
> Hi,
>
> I did some experiments using page_frag_cache/page_frag_alloc() instead of
> page_pools in a simple environment I used to test XDP for veth driver.
> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
> the page_frag_cache in order to copy the full skb in the new one, actually
> "linearizing" the packet (since we know the original skb length).
> I run an iperf TCP connection over a veth pair where the
> remote device runs the xdp_rxq_info sample (available in the kernel source
> tree, with action XDP_PASS):
>
> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>
> net-next (page_pool):
> - MTU 1500B: ~ 7.5 Gbps
> - MTU 8000B: ~ 15.3 Gbps
>
> net-next + page_frag_alloc:
> - MTU 1500B: ~ 8.4 Gbps
> - MTU 8000B: ~ 14.7 Gbps
>
> It seems there is no a clear "win" situation here (at least in this environment
> and we this simple approach). Moreover:

For the 1500B packets it is a win, but for 8000B it looks like there
is a regression. Any idea what is causing it?

> - can the linearization introduce any issue whenever we perform XDP_REDIRECT
> into a destination device?

It shouldn't. If it does it would probably point to an issue w/ the
destination driver rather than an issue with the code doing this.

> - can the page_frag_cache introduce more memory fragmentation (IIRC we were
> experiencing this issue in mt76 before switching to page_pools).

I think it largely depends on where the packets are ending up. I know
this is the approach we are using for sockets, see
skb_page_frag_refill(). If nothing else, if you took a similar
approach to it you might be able to bypass the need for the
page_frag_cache itself, although you would likely still end up
allocating similar structures.

2023-06-20 21:45:52

by Lorenzo Bianconi

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

[...]

> > I did some experiments using page_frag_cache/page_frag_alloc() instead of
> > page_pools in a simple environment I used to test XDP for veth driver.
> > In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
> > the page_frag_cache in order to copy the full skb in the new one, actually
> > "linearizing" the packet (since we know the original skb length).
> > I run an iperf TCP connection over a veth pair where the
> > remote device runs the xdp_rxq_info sample (available in the kernel source
> > tree, with action XDP_PASS):
> >
> > TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
> >
> > net-next (page_pool):
> > - MTU 1500B: ~ 7.5 Gbps
> > - MTU 8000B: ~ 15.3 Gbps
> >
> > net-next + page_frag_alloc:
> > - MTU 1500B: ~ 8.4 Gbps
> > - MTU 8000B: ~ 14.7 Gbps
> >
> > It seems there is no a clear "win" situation here (at least in this environment
> > and we this simple approach). Moreover:
>
> For the 1500B packets it is a win, but for 8000B it looks like there
> is a regression. Any idea what is causing it?

nope, I have not looked into it yet.

>
> > - can the linearization introduce any issue whenever we perform XDP_REDIRECT
> > into a destination device?
>
> It shouldn't. If it does it would probably point to an issue w/ the
> destination driver rather than an issue with the code doing this.

ack, fine.

>
> > - can the page_frag_cache introduce more memory fragmentation (IIRC we were
> > experiencing this issue in mt76 before switching to page_pools).
>
> I think it largely depends on where the packets are ending up. I know
> this is the approach we are using for sockets, see
> skb_page_frag_refill(). If nothing else, if you took a similar
> approach to it you might be able to bypass the need for the
> page_frag_cache itself, although you would likely still end up
> allocating similar structures.

ack.

Regards,
Lorenzo


Attachments:
(No filename) (1.91 kB)
signature.asc (235.00 B)
Download all attachments

2023-06-21 12:04:06

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API



On 20/06/2023 23.16, Lorenzo Bianconi wrote:
> [...]
>
>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>> page_pools in a simple environment I used to test XDP for veth driver.
>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>> "linearizing" the packet (since we know the original skb length).
>>> I run an iperf TCP connection over a veth pair where the
>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>> tree, with action XDP_PASS):
>>>
>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>
>>> net-next (page_pool):
>>> - MTU 1500B: ~ 7.5 Gbps
>>> - MTU 8000B: ~ 15.3 Gbps
>>>
>>> net-next + page_frag_alloc:
>>> - MTU 1500B: ~ 8.4 Gbps
>>> - MTU 8000B: ~ 14.7 Gbps
>>>
>>> It seems there is no a clear "win" situation here (at least in this environment
>>> and we this simple approach). Moreover:
>>
>> For the 1500B packets it is a win, but for 8000B it looks like there
>> is a regression. Any idea what is causing it?
>
> nope, I have not looked into it yet.
>

I think I can explain via using micro-benchmark numbers.
(Lorenzo and I have discussed this over IRC, so this is our summary)

*** MTU 1500***

* The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):

The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).

The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
for packets. The order-3 alloc + free cost 514 cycles (page_bench01:
alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
1514+320+256 = 2090 bytes. In 32KB we can fit 15 packets. Thus, the
amortized cost per packet is only 34.3 cycles (514/15).

Thus, this explains why page_frag_alloc API have an advantage here, as
amortized cost per packet is lower (for page_frag_alloc).


*** MTU 8000 ***

* The MTU 8000 case, where PP is faster than page_frag_alloc.

The page_frag_alloc API cannot slice the same 32KB into as many packets.
The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes. This is can
only store 3 full packets (32768/8590 = 3.81).
Thus, cost is 514/3 = 171 cycles.

The PP is actually challenged at MTU 8000, because it unfortunately
leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
bytes. Thus cost is 3x 48 cycles = 144 cycles.
(There is also a chance of Jakubs "allow_direct" optimization in
page_pool_return_skb_page to increase performance for PP).

Thus, this explains why PP is fastest in this case.


*** Surprising insights ***

My (maybe) surprising conclusion is that we should combine the two
approaches. Which is basically what Lin's patchset is doing!
Thus, I'm actually suddenly become a fan of this patchset...

The insight is that PP can also work with higher-order pages and the
cost of PP recycles via ptr_ring will be the same, regardless of page
order size. Thus, we can reduced the order-3 cost 514 cycles to
basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
amortized allocator cost 48/15 = 3.2 cycles.

On the PP alloc-side this will be amazingly fast. When PP recycles frags
side, see page_pool_defrag_page() there is an atomic_sub operation.
I've measured atomic_inc to cost 17 cycles (for optimal non-contended
case), thus 3+17 = 20 cycles, it should still be a win.


--Jesper


2023-06-24 14:49:51

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On 2023/6/21 19:55, Jesper Dangaard Brouer wrote:
>
>
> On 20/06/2023 23.16, Lorenzo Bianconi wrote:
>> [...]
>>
>>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>>> page_pools in a simple environment I used to test XDP for veth driver.
>>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>>> "linearizing" the packet (since we know the original skb length).
>>>> I run an iperf TCP connection over a veth pair where the
>>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>>> tree, with action XDP_PASS):
>>>>
>>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>>
>>>> net-next (page_pool):
>>>> - MTU 1500B: ~  7.5 Gbps
>>>> - MTU 8000B: ~ 15.3 Gbps
>>>>
>>>> net-next + page_frag_alloc:
>>>> - MTU 1500B: ~  8.4 Gbps
>>>> - MTU 8000B: ~ 14.7 Gbps
>>>>
>>>> It seems there is no a clear "win" situation here (at least in this environment
>>>> and we this simple approach). Moreover:
>>>
>>> For the 1500B packets it is a win, but for 8000B it looks like there
>>> is a regression. Any idea what is causing it?
>>
>> nope, I have not looked into it yet.
>>
>
> I think I can explain via using micro-benchmark numbers.
> (Lorenzo and I have discussed this over IRC, so this is our summary)
>
> *** MTU 1500***
>
> * The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):
>
> The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
> ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).
>
> The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
> for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
> alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
> 1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
> amortized cost per packet is only 34.3 cycles (514/15).
>
> Thus, this explains why page_frag_alloc API have an advantage here, as
> amortized cost per packet is lower (for page_frag_alloc).
>
>
> *** MTU 8000 ***
>
> * The MTU 8000 case, where PP is faster than page_frag_alloc.
>
> The page_frag_alloc API cannot slice the same 32KB into as many packets.
> The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
> only store 3 full packets (32768/8590 = 3.81).
> Thus, cost is 514/3 = 171 cycles.
>
> The PP is actually challenged at MTU 8000, because it unfortunately
> leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
> bytes. Thus cost is 3x 48 cycles = 144 cycles.
> (There is also a chance of Jakubs "allow_direct" optimization in page_pool_return_skb_page to increase performance for PP).
>
> Thus, this explains why PP is fastest in this case.

Great analysis.
So the problem seems to be: can we optimize the page fragment cache
implementation so that it can at least match the performance of PP
for the above case? As Alexander seems to be against using PP for
the veth case without involving DMA mapping.

>
>
> *** Surprising insights ***
>
> My (maybe) surprising conclusion is that we should combine the two
> approaches.  Which is basically what Lin's patchset is doing!
> Thus, I'm actually suddenly become a fan of this patchset...
>
> The insight is that PP can also work with higher-order pages and the
> cost of PP recycles via ptr_ring will be the same, regardless of page
> order size.  Thus, we can reduced the order-3 cost 514 cycles to
> basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
> amortized allocator cost 48/15 = 3.2 cycles.
>
> On the PP alloc-side this will be amazingly fast. When PP recycles frags
> side, see page_pool_defrag_page() there is an atomic_sub operation.
> I've measured atomic_inc to cost 17 cycles (for optimal non-contended
> case), thus 3+17 = 20 cycles, it should still be a win.
>
>
> --Jesper
>
>