Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Subject: Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag
 page handling
To:     Alexander H Duyck <alexander.duyck@gmail.com>,
        Yunsheng Lin <linyunsheng@huawei.com>, davem@davemloft.net,
        kuba@kernel.org, pabeni@redhat.com
Cc:     netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
        Lorenzo Bianconi <lorenzo@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>,
        Ilias Apalodimas <ilias.apalodimas@linaro.org>,
        Eric Dumazet <edumazet@google.com>
References: <20230529092840.40413-1-linyunsheng@huawei.com>
 <20230529092840.40413-2-linyunsheng@huawei.com>
 <e8db47e3fe99349a998ded1ce4f8da88ea9cc660.camel@gmail.com>
 <5d728f88-2bd0-7e78-90d5-499e85dc6fdf@huawei.com>
 <160fea176559526b38a4080cc0f52666634a7a4f.camel@gmail.com>
From:   Yunsheng Lin <yunshenglin0825@gmail.com>
Message-ID: <21f2fe04-8674-cc73-7a6c-cb0765c84dba@gmail.com>
Date:   Sat, 3 Jun 2023 20:59:13 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.2
MIME-Version: 1.0
In-Reply-To: <160fea176559526b38a4080cc0f52666634a7a4f.camel@gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 2023/6/3 0:37, Alexander H Duyck wrote:
...

>>
>> Please let me know if the above makes sense, or if misunderstood your
>> concern here.
> 
> So my main concern is that what this is doing is masking things so that
> the veth and virtio_net drivers can essentially lie about the truesize
> of the memory they are using in order to improve their performance by
> misleading the socket layer about how much memory it is actually
> holding onto.
> 
> We have historically had an issue with reporting the truesize of
> fragments, but generally the underestimation was kept to something less
> than 100% because pages were generally split by at least half. Where it
> would get messy is if a misbehaviing socket held onto packets for an
> exceedingly long time.
> 
> What this patch set is doing is enabling explicit lying about the
> truesize, and it compounds that by allowing for mixing small
> allocations w/ large ones.
> 
>>>
>>> The problem is there are some architectures where we just cannot
>>> support having pp_frag_count due to the DMA size. So it makes sense to
>>> leave those with just basic page pool instead of trying to fake that it
>>> is a fragmented page.
>>
>> It kind of depend on how you veiw it, this patch view it as only supporting
>> one frag when we can't support having pp_frag_count, so I would not call it
>> faking.
> 
> So the big thing that make it "faking" is the truesize underestimation
> that will occur with these frames.

Let's discuss truesize issue in patch 2 instead of here.
Personally, I still believe that if the driver can compute the
truesize correctly by manipulating the page->pp_frag_count and
frag offset directly, the page pool can do that too.

> 
>>
>>>
>>>> ---

...

>>>
>>> What is the point of this line? It doesn't make much sense to me. Are
>>> you just trying to force an optiimization? You would be better off just
>>> taking the BUILD_BUG_ON contents and feeding them into an if statement
>>> below since the statement will compile out anyway.
>>
>> if the "if statement" you said refers to the below, then yes.
>>
>>>> +		if (!__builtin_constant_p(nr))
>>>> +			atomic_long_set(&page->pp_frag_count, 1);
>>
>> But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?
>>
>> Will move it down anyway to avoid confusion.
> 
> Actually now that I look at this more it is even more confusing. The
> whole point of this function was that we were supposed to be getting
> pp_frag_count to 0. However you are setting it to 1.
> 
> This is seriously flawed. If we are going to treat non-fragmented pages
> as mono-frags then that is what we should do. We should be pulling this
> acounting into all of the page pool freeing paths, not trying to force
> the value up to 1 for the non-fragmented case.

I am not sure I understand what do you mean by 'non-fragmented ',
'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
here. maybe describe it more detailed with something like the
pseudocode?

> 
>>>
>>> It seems like what you would want here is:
>>> 	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
>>>
>>> Otherwise you are potentially writing to a variable that shouldn't
>>> exist.
>>
>> Not if the driver use the page_pool_alloc_frag() API instead of manipulating
>> the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
>> The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
>> it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
>> case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
>> to keep the driver from implementation detail(pp_frag_count handling specifically)
>> of the frag support unless we have a very good reason.
>>
> 
> Getting the truesize is that "very good reason". The fact is the
> drivers were doing this long before page pool came around. Trying to
> pull that away from them is the wrong way to go in my opinion.

If the truesize is really the concern here, I think it make more
sense to enforce it in the page pool instead of each driver doing
their trick, so I also think we can do better here to handle
pp_frag_count in the page pool instead of driver handling it, so
let's continue the truesize disscussion in patch 2 to see if we
can come up with something better there.

> 
>>>>  	/* If nr == pp_frag_count then we have cleared all remaining
>>>>  	 * references to the page. No need to actually overwrite it, instead
>>>>  	 * we can leave this to be overwritten by the calling function.
>>>> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>>  	 * especially when dealing with a page that may be partitioned
>>>>  	 * into only 2 or 3 pieces.
>>>>  	 */
>>>> -	if (atomic_long_read(&page->pp_frag_count) == nr)
>>>> +	if (atomic_long_read(&page->pp_frag_count) == nr) {
>>>> +		/* As we have ensured nr is always one for constant case
>>>> +		 * using the BUILD_BUG_ON() as above, only need to handle
>>>> +		 * the non-constant case here for frag count draining.
>>>> +		 */
>>>> +		if (!__builtin_constant_p(nr))
>>>> +			atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>>  		return 0;
>>>> +	}
>>>>  
> 
> The optimization here was already the comparison since we didn't have
> to do anything if pp_frag_count == nr. The whole point of pp_frag_count
> going to 0 is that is considered non-fragmented in that case and ready
> to be freed. By resetting it to 1 you are implying that there is still
> one *other* user that is holding a fragment so the page cannot be
> freed.
> 
> We weren't bothering with writing the value since the page is in the
> free path and this value is going to be unused until the page is
> reallocated anyway.

I am not sure what you meant above.
But I will describe what is this patch trying to do again:
When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
page, so page_pool_alloc_pages() is not allowed to be called as the
page->pp_frag_count is not setup correctly for the case.

So in order to allow calling page_pool_alloc_pages(), as best as I
can think of, either we need a per page flag/bit to decide whether
to do something like dec_and_test for page->pp_frag_count in
page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
in page_pool_is_last_frag() so that we don't need a per page flag/bit.

This patch utilizes the optimization you mentioned above to unify the
page->pp_frag_count handling.

> 
>>>>  	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>  	WARN_ON(ret < 0);
>>>> +
>>>> +	/* Reset frag count back to 1, this should be the rare case when
>>>> +	 * two users call page_pool_defrag_page() currently.
>>>> +	 */
>>>> +	if (!ret)
>>>> +		atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>>  	return ret;
>>>>  }
>>>>

...

>> As above, it is about unifying handling for frag and non-frag page in
>> page_pool_is_last_frag(). please let me know if there is any better way
>> to do it without adding statements here.
> 
> I get what you are trying to get at but I feel like the implementation
> is going to cause more problems than it helps. The problem is it is
> going to hurt base page pool performance and it just makes the
> fragmented pages that much more confusing to deal with.

For base page pool performance, as I mentioned before:
It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
page_pool_fragment_page() in page_pool_set_pp_info(), which I
think it is negligible as we are already dirtying the same cache
line in page_pool_set_pp_info().

For the confusing, sometimes it is about personal taste, so I am
not going to argue with it:) But it would be good to provide a
non-confusing way to do that with minimal overhead. I feel like
you have provided it in the begin, but I am not able to understand
it yet.

> 
> My advice as a first step would be to look at first solving how to
> enable the PP_FLAG_PAGE_FRAG mode when you have
> PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
> frags as we are calling them, and we should have a way to get the
> truesize for those so we know when we are consuming significant amount
> of memory.

Does the way to get the truesize in the below RFC make sense to you?
https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/

> 
> Once that is solved then we can look at what it would take to apply
> mono-frags to the standard page pool case. Ideally we would need to
> find a way to do it with minimal overhead.
> 
>