Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20230609131740.7496-1-linyunsheng@huawei.com> <20230609131740.7496-4-linyunsheng@huawei.com>
 <CAKgT0UfVwQ=ri7ZDNnsATH2RQpEz+zDBBb6YprvniMEWGdw+dQ@mail.gmail.com>
 <36366741-8df2-1137-0dd9-d498d0f770e4@huawei.com> <CAKgT0UdXTSv1fDHBX4UC6Ok9NXKMJ_9F88CEv5TK+mpzy0N21g@mail.gmail.com>
 <c06f6f59-6c35-4944-8f7a-7f6f0e076649@huawei.com>
In-Reply-To: <c06f6f59-6c35-4944-8f7a-7f6f0e076649@huawei.com>
From:   Alexander Duyck <alexander.duyck@gmail.com>
Date:   Thu, 15 Jun 2023 07:45:58 -0700
Message-ID: <CAKgT0UccmDe+CE6=zDYQHi1=3vXf5MptzDo+BsPrKdmP5j9kgQ@mail.gmail.com>
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
To:     Yunsheng Lin <linyunsheng@huawei.com>
Cc:     davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
        netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
        Lorenzo Bianconi <lorenzo@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>,
        Ilias Apalodimas <ilias.apalodimas@linaro.org>,
        Eric Dumazet <edumazet@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Wed, Jun 14, 2023 at 11:39=E2=80=AFPM Yunsheng Lin <linyunsheng@huawei.c=
om> wrote:
>
> On 2023/6/14 22:18, Alexander Duyck wrote:
> > On Tue, Jun 13, 2023 at 8:51=E2=80=AFPM Yunsheng Lin <linyunsheng@huawe=
i.com> wrote:
> >>
> >> On 2023/6/13 22:36, Alexander Duyck wrote:
> >>> On Fri, Jun 9, 2023 at 6:20=E2=80=AFAM Yunsheng Lin <linyunsheng@huaw=
ei.com> wrote:
> >>
> >> ...
> >>
> >>>>
> >>>> +static inline struct page *page_pool_alloc(struct page_pool *pool,
> >>>> +                                          unsigned int *offset,
> >>>> +                                          unsigned int *size, gfp_t=
 gfp)
> >>>> +{
> >>>> +       unsigned int max_size =3D PAGE_SIZE << pool->p.order;
> >>>> +       struct page *page;
> >>>> +
> >>>> +       *size =3D ALIGN(*size, dma_get_cache_alignment());
> >>>> +
> >>>> +       if (WARN_ON(*size > max_size))
> >>>> +               return NULL;
> >>>> +
> >>>> +       if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COU=
NT) {
> >>>> +               *size =3D max_size;
> >>>> +               *offset =3D 0;
> >>>> +               return page_pool_alloc_pages(pool, gfp);
> >>>> +       }
> >>>> +
> >>>> +       page =3D __page_pool_alloc_frag(pool, offset, *size, gfp);
> >>>> +       if (unlikely(!page))
> >>>> +               return NULL;
> >>>> +
> >>>> +       /* There is very likely not enough space for another frag, s=
o append the
> >>>> +        * remaining size to the current frag to avoid truesize unde=
restimate
> >>>> +        * problem.
> >>>> +        */
> >>>> +       if (pool->frag_offset + *size > max_size) {
> >>>> +               *size =3D max_size - *offset;
> >>>> +               pool->frag_offset =3D max_size;
> >>>> +       }
> >>>> +
> >>>
> >>> Rather than preventing a truesize underestimation this will cause one=
.
> >>> You are adding memory to the size of the page reserved and not
> >>> accounting for it anywhere as this isn't reported up to the network
> >>> stack. I would suggest dropping this from your patch.
> >>
> >> I was thinking about the driver author reporting it up to the network
> >> stack using the new API as something like below:
> >>
> >> int truesize =3D size;
> >> struct page *page;
> >> int offset;
> >>
> >> page =3D page_pool_dev_alloc(pool, &offset, &truesize);
> >> if (unlikely(!page))
> >>         goto err;
> >>
> >> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
> >>                 offset, size, truesize);
> >>
> >> and similiar handling for *_build_skb() case too.
> >>
> >> Does it make senses for that? or did I miss something obvious here?
> >
> > It is more the fact that you are creating a solution in search of a
> > problem. As I said before most of the drivers will deal with these
> > sorts of issues by just doing the fragmentation themselves or
> > allocating fixed size frags and knowing how it will be divided into
> > the page.
>
> It seems that there are already some drivers which using the page pool
> API with different frag size for almost every calling, the virtio_net
> and veth are the obvious ones.
>
> When reviewing the page frag support for virtio_net, I found that it
> was manipulating the page_pool->frag_offset directly to do something
> as this patch does, see:
>
> https://lore.kernel.org/lkml/CAKhg4tL9PrUebqQHL+s7A6-xqNnju3erNQejMr7UFjw=
TaOduZw@mail.gmail.com/
>
> I am not sure we are both agreed that drivers should not be manipulating
> the page_pool->frag_offset directly unless it is really necessary?

Agreed, they are doing something similar to this. The difference is
though that they have chosen to do that. With your change you are
forcing driver writers into a setup that will likely not work for
most.

> For the specific case for virtio_net, it seems we have the below options:
> 1. both the driver and page pool do not handle it.
> 2. the driver handles it by manipulating the page_pool->frag_offset
>    directly.

I view 2 as being the only acceptable approach. Otherwise we are
forcing drivers into a solution that may not fit and forcing yet
another fork of allocation setups. There is a reason vendors have
already taken the approach of manipulating frag_offset directly. In
many cases trying to pre-allocate things just isn't going to work.

> 3. the page pool handles it as this patch does.

The problem is the page pool isn't handling it. It is forcing the
allocations larger without reporting them as of this patch. It is
trying to forecast what the next request is going to be which is
problematic since it doesn't have enough info to necessarily be making
such assumptions.

> Is there any other options I missed for the specific case for virtio_net?
> What is your perfer option? And why?

My advice would be to leave it to the driver.

What concerns me is that you seem to be taking the page pool API in a
direction other than what it was really intended for. For any physical
device you aren't going to necessarily know what size fragment you are
working with until you have already allocated the page and DMA has
been performed. That is why drivers such as the Mellanox driver are
fragmenting in the driver instead of allocated pre-fragmented pages.

> >
> > If you are going to go down this path then you should have a consumer
> > for the API and fully implement it instead of taking half measures and
> > making truesize underreporting worse by evicting pages earlier.
>
> I am not sure I understand what do you mean by "a consumer for the API",
> Do you mean adding a new API something like page_pool_free() to do
> something ligthweight, such as decrementing the frag user and adjusting
> the frag_offset, which is corresponding to the page_pool_alloc() API
> introduced in this patch?

What I was getting at is that if you are going to add an API you have
to have a consumer for the API. That is rule #1 for kernel API
development. You don't add API without a consumer for it. The changes
you are making are to support some future implementation, and I see it
breaking most of the existing implementation. That is my concern.