LinuxLists.cc - Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool

2023-06-15 15:01:38

Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On Wed, Jun 14, 2023 at 11:39 PM Yunsheng Lin <[email protected]> wrote:
>
> On 2023/6/14 22:18, Alexander Duyck wrote:
> > On Tue, Jun 13, 2023 at 8:51 PM Yunsheng Lin <[email protected]> wrote:
> >>
> >> On 2023/6/13 22:36, Alexander Duyck wrote:
> >>> On Fri, Jun 9, 2023 at 6:20 AM Yunsheng Lin <[email protected]> wrote:
> >>
> >> ...
> >>
> >>>>
> >>>> +static inline struct page *page_pool_alloc(struct page_pool *pool,
> >>>> + unsigned int *offset,
> >>>> + unsigned int *size, gfp_t gfp)
> >>>> +{
> >>>> + unsigned int max_size = PAGE_SIZE << pool->p.order;
> >>>> + struct page *page;
> >>>> +
> >>>> + *size = ALIGN(*size, dma_get_cache_alignment());
> >>>> +
> >>>> + if (WARN_ON(*size > max_size))
> >>>> + return NULL;
> >>>> +
> >>>> + if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
> >>>> + *size = max_size;
> >>>> + *offset = 0;
> >>>> + return page_pool_alloc_pages(pool, gfp);
> >>>> + }
> >>>> +
> >>>> + page = __page_pool_alloc_frag(pool, offset, *size, gfp);
> >>>> + if (unlikely(!page))
> >>>> + return NULL;
> >>>> +
> >>>> + /* There is very likely not enough space for another frag, so append the
> >>>> + * remaining size to the current frag to avoid truesize underestimate
> >>>> + * problem.
> >>>> + */
> >>>> + if (pool->frag_offset + *size > max_size) {
> >>>> + *size = max_size - *offset;
> >>>> + pool->frag_offset = max_size;
> >>>> + }
> >>>> +
> >>>
> >>> Rather than preventing a truesize underestimation this will cause one.
> >>> You are adding memory to the size of the page reserved and not
> >>> accounting for it anywhere as this isn't reported up to the network
> >>> stack. I would suggest dropping this from your patch.
> >>
> >> I was thinking about the driver author reporting it up to the network
> >> stack using the new API as something like below:
> >>
> >> int truesize = size;
> >> struct page *page;
> >> int offset;
> >>
> >> page = page_pool_dev_alloc(pool, &offset, &truesize);
> >> if (unlikely(!page))
> >> goto err;
> >>
> >> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
> >> offset, size, truesize);
> >>
> >> and similiar handling for *_build_skb() case too.
> >>
> >> Does it make senses for that? or did I miss something obvious here?
> >
> > It is more the fact that you are creating a solution in search of a
> > problem. As I said before most of the drivers will deal with these
> > sorts of issues by just doing the fragmentation themselves or
> > allocating fixed size frags and knowing how it will be divided into
> > the page.
>
> It seems that there are already some drivers which using the page pool
> API with different frag size for almost every calling, the virtio_net
> and veth are the obvious ones.
>
> When reviewing the page frag support for virtio_net, I found that it
> was manipulating the page_pool->frag_offset directly to do something
> as this patch does, see:
>
> https://lore.kernel.org/lkml/CAKhg4tL9PrUebqQHL+s7A6-xqNnju3erNQejMr7UFjwTaOduZw@mail.gmail.com/
>
> I am not sure we are both agreed that drivers should not be manipulating
> the page_pool->frag_offset directly unless it is really necessary?

Agreed, they are doing something similar to this. The difference is
though that they have chosen to do that. With your change you are
forcing driver writers into a setup that will likely not work for
most.

> For the specific case for virtio_net, it seems we have the below options:
> 1. both the driver and page pool do not handle it.
> 2. the driver handles it by manipulating the page_pool->frag_offset
> directly.

I view 2 as being the only acceptable approach. Otherwise we are
forcing drivers into a solution that may not fit and forcing yet
another fork of allocation setups. There is a reason vendors have
already taken the approach of manipulating frag_offset directly. In
many cases trying to pre-allocate things just isn't going to work.

> 3. the page pool handles it as this patch does.

The problem is the page pool isn't handling it. It is forcing the
allocations larger without reporting them as of this patch. It is
trying to forecast what the next request is going to be which is
problematic since it doesn't have enough info to necessarily be making
such assumptions.

> Is there any other options I missed for the specific case for virtio_net?
> What is your perfer option? And why?

My advice would be to leave it to the driver.

What concerns me is that you seem to be taking the page pool API in a
direction other than what it was really intended for. For any physical
device you aren't going to necessarily know what size fragment you are
working with until you have already allocated the page and DMA has
been performed. That is why drivers such as the Mellanox driver are
fragmenting in the driver instead of allocated pre-fragmented pages.

> >
> > If you are going to go down this path then you should have a consumer
> > for the API and fully implement it instead of taking half measures and
> > making truesize underreporting worse by evicting pages earlier.
>
> I am not sure I understand what do you mean by "a consumer for the API",
> Do you mean adding a new API something like page_pool_free() to do
> something ligthweight, such as decrementing the frag user and adjusting
> the frag_offset, which is corresponding to the page_pool_alloc() API
> introduced in this patch?

What I was getting at is that if you are going to add an API you have
to have a consumer for the API. That is rule #1 for kernel API
development. You don't add API without a consumer for it. The changes
you are making are to support some future implementation, and I see it
breaking most of the existing implementation. That is my concern.

2023-06-15 16:43:05

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

On 15/06/2023 16.45, Alexander Duyck wrote:
[..]
>
> What concerns me is that you seem to be taking the page pool API in a
> direction other than what it was really intended for. For any physical
> device you aren't going to necessarily know what size fragment you are
> working with until you have already allocated the page and DMA has
> been performed. That is why drivers such as the Mellanox driver are
> fragmenting in the driver instead of allocated pre-fragmented pages.
>

+1

I share concerns with Alexander Duyck here. As the inventor and
maintainer, I can say this is taking the page_pool API in a direction I
didn't intent or planned for. As Alex also says, the intent was for
fixed sized memory chunks that are DMA ready. Use-case was the physical
device RX "early demux problem", where the size is not known before hand.

I need to be convinced this is a good direction to take the page_pool
design/architecture into... e.g. allocations with dynamic sizes.
Maybe it is a good idea, but as below "consumers" of the API is usually
the way to show this is the case.

[...]
>
> What I was getting at is that if you are going to add an API you have
> to have a consumer for the API. That is rule #1 for kernel API
> development. You don't add API without a consumer for it. The changes
> you are making are to support some future implementation, and I see it
> breaking most of the existing implementation. That is my concern.
>

You have mentioned veth as the use-case. I know I acked adding page_pool
use-case to veth, for when we need to convert an SKB into an
xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
In this case in veth, the size is known at the page allocation time.
Thus, using the page_pool API is wasting memory. We did this for
performance reasons, but we are not using PP for what is was intended
for. We mostly use page_pool, because it an existing recycle return
path, and we were too lazy to add another alloc-type (see enum
xdp_mem_type).

Maybe you/we can extend veth to use this dynamic size API, to show us
that this is API is a better approach. I will signup for benchmarking
this (and coordinating with CC Maryam as she came with use-case we
improved on).

--Jesper