LinuxLists.cc - [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

2023-07-05 16:40:57

Subject: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

Add a couple intuitive helpers to hide Rx buffer implementation details
in the library and not multiplicate it between drivers. The settings are
optimized for Intel hardware, but nothing really HW-specific here.
Use the new page_pool_dev_alloc() to dynamically switch between
split-page and full-page modes depending on MTU, page size, required
headroom etc. For example, on x86_64 with the default driver settings
each page is shared between 2 buffers. Turning on XDP (not in this
series) -> increasing headroom requirement pushes truesize out of 2048
boundary, leading to that each buffer starts getting a full page.
The "ceiling" limit is %PAGE_SIZE, as only order-0 pages are used to
avoid compound overhead. For the above architecture, this means maximum
linear frame size of 3712 w/o XDP.

Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/Kconfig | 1 +
drivers/net/ethernet/intel/libie/rx.c | 54 +++++++++++++
include/linux/net/intel/libie/rx.h | 111 +++++++++++++++++++++++++-
3 files changed, 165 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 86ecedeac115..e187942b5a5c 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -86,6 +86,7 @@ config E1000E_HWTS

config LIBIE
tristate
+ select PAGE_POOL
help
libie (Intel Ethernet library) is a common library containing
routines shared by several Intel Ethernet drivers.
diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
index f503476d8eef..c60d7b20ed20 100644
--- a/drivers/net/ethernet/intel/libie/rx.c
+++ b/drivers/net/ethernet/intel/libie/rx.c
@@ -3,6 +3,60 @@

#include <linux/net/intel/libie/rx.h>

+/* Rx buffer management */
+
+/**
+ * libie_rx_sync_len - get the actual buffer size to be synced and passed to HW
+ * @dev: &net_device to calculate the size for
+ * @hr: headroom in front of each frame
+ *
+ * Returns the buffer size to pass it to HW and use for DMA synchronization
+ * accounting: MTU the @dev has, HW required alignment, minimum and maximum
+ * allowed values, and system's page size.
+ */
+static u32 libie_rx_sync_len(const struct net_device *dev, u32 hr)
+{
+ u32 len;
+
+ len = READ_ONCE(dev->mtu) + LIBIE_RX_LL_LEN;
+ len = ALIGN(len, LIBIE_RX_BUF_LEN_ALIGN);
+ len = clamp(len, LIBIE_MIN_RX_BUF_LEN, LIBIE_RX_BUF_LEN(hr));
+
+ return len;
+}
+
+/**
+ * libie_rx_page_pool_create - create a PP with the default libie settings
+ * @napi: &napi_struct covering this PP (no usage outside its poll loops)
+ * @size: size of the PP, usually simply Rx queue len
+ *
+ * Returns &page_pool on success, casted -errno on failure.
+ */
+struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
+ u32 size)
+{
+ struct page_pool_params pp = {
+ .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+ .order = LIBIE_RX_PAGE_ORDER,
+ .pool_size = size,
+ .nid = NUMA_NO_NODE,
+ .dev = napi->dev->dev.parent,
+ .napi = napi,
+ .dma_dir = DMA_FROM_DEVICE,
+ .offset = LIBIE_SKB_HEADROOM,
+ };
+ size_t truesize;
+
+ pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
+
+ /* "Wanted" truesize, passed to page_pool_dev_alloc() */
+ truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
+ pp.init_arg = (void *)truesize;
+
+ return page_pool_create(&pp);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
+
/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
* bitfield struct.
*/
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
index 58bd0f35d025..8c0ccdff9a37 100644
--- a/include/linux/net/intel/libie/rx.h
+++ b/include/linux/net/intel/libie/rx.h
@@ -4,7 +4,116 @@
#ifndef __LIBIE_RX_H
#define __LIBIE_RX_H

-#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <net/page_pool.h>
+
+/* Rx MTU/buffer/truesize helpers. Mostly pure software-side; HW-defined values
+ * are valid for all Intel HW.
+ */
+
+/* Space reserved in front of each frame */
+#define LIBIE_SKB_HEADROOM (NET_SKB_PAD + NET_IP_ALIGN)
+/* Maximum headroom to calculate max MTU below */
+#define LIBIE_MAX_HEADROOM LIBIE_SKB_HEADROOM
+/* Link layer / L2 overhead: Ethernet, 2 VLAN tags (C + S), FCS */
+#define LIBIE_RX_LL_LEN (ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)
+
+/* Always use order-0 pages */
+#define LIBIE_RX_PAGE_ORDER 0
+/* Rx buffer size config is a multiple of 128 */
+#define LIBIE_RX_BUF_LEN_ALIGN 128
+/* HW-writeable space in one buffer: truesize - headroom/tailroom,
+ * HW-aligned
+ */
+#define __LIBIE_RX_BUF_LEN(hr) \
+ ALIGN_DOWN(SKB_MAX_ORDER(hr, LIBIE_RX_PAGE_ORDER), \
+ LIBIE_RX_BUF_LEN_ALIGN)
+/* The smallest and largest size for a single descriptor as per HW */
+#define LIBIE_MIN_RX_BUF_LEN 1024U
+#define LIBIE_MAX_RX_BUF_LEN 9728U
+/* "True" HW-writeable space: minimum from SW and HW values */
+#define LIBIE_RX_BUF_LEN(hr) min_t(u32, __LIBIE_RX_BUF_LEN(hr), \
+ LIBIE_MAX_RX_BUF_LEN)
+
+/* The maximum frame size as per HW (S/G) */
+#define __LIBIE_MAX_RX_FRM_LEN 16382U
+/* ATST, HW can chain up to 5 Rx descriptors */
+#define LIBIE_MAX_RX_FRM_LEN(hr) \
+ min_t(u32, __LIBIE_MAX_RX_FRM_LEN, LIBIE_RX_BUF_LEN(hr) * 5)
+/* Maximum frame size minus LL overhead */
+#define LIBIE_MAX_MTU \
+ (LIBIE_MAX_RX_FRM_LEN(LIBIE_MAX_HEADROOM) - LIBIE_RX_LL_LEN)
+
+/* Rx buffer management */
+
+/**
+ * struct libie_rx_buffer - structure representing an Rx buffer
+ * @page: page holding the buffer
+ * @offset: offset from the page start (to the headroom)
+ * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
+ *
+ * Depending on the MTU, API switches between one-page-per-frame and shared
+ * page model (to conserve memory on bigger-page platforms). In case of the
+ * former, @offset is always 0 and @truesize is always %PAGE_SIZE.
+ */
+struct libie_rx_buffer {
+ struct page *page;
+ u32 offset;
+ u32 truesize;
+};
+
+struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
+ u32 size);
+
+/**
+ * libie_rx_alloc - allocate a new Rx buffer
+ * @pool: page_pool to allocate from
+ * @buf: buffer structure to populate
+ *
+ * Returns &dma_addr_t to be passed to HW for Rx, %DMA_MAPPING_ERROR otherwise.
+ */
+static inline dma_addr_t libie_rx_alloc(struct page_pool *pool,
+ struct libie_rx_buffer *buf)
+{
+ /* "Wanted" truesize, see libie_rx_page_pool_create() */
+ buf->truesize = (size_t)pool->p.init_arg;
+ buf->page = page_pool_dev_alloc(pool, &buf->offset, &buf->truesize);
+ if (!buf->page)
+ return DMA_MAPPING_ERROR;
+
+ return page_pool_get_dma_addr(buf->page) + buf->offset +
+ pool->p.offset;
+}
+
+/**
+ * libie_rx_sync_for_cpu - synchronize or recycle buffer post DMA
+ * @buf: buffer to process
+ * @len: frame length from the descriptor
+ *
+ * Process the buffer after it's written by HW. The regular path is to
+ * synchronize DMA for CPU, but in case of no data it will be immediately
+ * recycled back to its PP.
+ *
+ * Returns true when there's data to process, false otherwise.
+ */
+static inline bool __must_check
+libie_rx_sync_for_cpu(const struct libie_rx_buffer *buf, u32 len)
+{
+ struct page *page = buf->page;
+
+ /* Very rare, but possible case. The most common reason:
+ * the last fragment contained FCS only, which was then
+ * stripped by the HW.
+ */
+ if (unlikely(!len)) {
+ page_pool_recycle_direct(page->pp, page);
+ return false;
+ }
+
+ page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
+
+ return true;
+}

/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
* bitfield struct.
--
2.41.0

2023-07-06 12:56:38

by Yunsheng Lin

[permalink] [raw]

Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

On 2023/7/5 23:55, Alexander Lobakin wrote:

> +/**
> + * libie_rx_page_pool_create - create a PP with the default libie settings
> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
> + * @size: size of the PP, usually simply Rx queue len
> + *
> + * Returns &page_pool on success, casted -errno on failure.
> + */
> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
> + u32 size)
> +{
> + struct page_pool_params pp = {
> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> + .order = LIBIE_RX_PAGE_ORDER,
> + .pool_size = size,
> + .nid = NUMA_NO_NODE,
> + .dev = napi->dev->dev.parent,
> + .napi = napi,
> + .dma_dir = DMA_FROM_DEVICE,
> + .offset = LIBIE_SKB_HEADROOM,

I think it worth mentioning that the '.offset' is not really accurate
when the page is split, as we do not really know what is the offset of
the frag of a page except for the first frag.

> + };
> + size_t truesize;
> +
> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
> +
> + /* "Wanted" truesize, passed to page_pool_dev_alloc() */
> + truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
> + pp.init_arg = (void *)truesize;

I am not sure if it is correct to use pp.init_arg here, as it is supposed to
be used along with init_callback. And if we want to change the implemetation
of init_callback, we may stuck with it as the driver is using it very
differently here.

Is it possible to pass the 'wanted true size' by adding a parameter for
libie_rx_alloc()?

> +
> + return page_pool_create(&pp);
> +}
> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);

2023-07-06 16:57:13

by Alexander Lobakin

[permalink] [raw]

Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

From: Yunsheng Lin <[email protected]>
Date: Thu, 6 Jul 2023 20:47:28 +0800

> On 2023/7/5 23:55, Alexander Lobakin wrote:
>
>> +/**
>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>> + * @size: size of the PP, usually simply Rx queue len
>> + *
>> + * Returns &page_pool on success, casted -errno on failure.
>> + */
>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>> + u32 size)
>> +{
>> + struct page_pool_params pp = {
>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>> + .order = LIBIE_RX_PAGE_ORDER,
>> + .pool_size = size,
>> + .nid = NUMA_NO_NODE,
>> + .dev = napi->dev->dev.parent,
>> + .napi = napi,
>> + .dma_dir = DMA_FROM_DEVICE,
>> + .offset = LIBIE_SKB_HEADROOM,
>
> I think it worth mentioning that the '.offset' is not really accurate
> when the page is split, as we do not really know what is the offset of
> the frag of a page except for the first frag.

Yeah, this is read as "offset from the start of the page or frag to the
actual frame start, i.e. its Ethernet header" or "this is just
xdp->data - xdp->data_hard_start".

>
>> + };
>> + size_t truesize;
>> +
>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>> +
>> + /* "Wanted" truesize, passed to page_pool_dev_alloc() */
>> + truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>> + pp.init_arg = (void *)truesize;
>
> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
> be used along with init_callback. And if we want to change the implemetation

I know. I abused it to save 1 function argument :p It's safe since I
don't use init_callback (not an argument).
I was thinking also of having a union in PP params or even a new field
like "wanted true size", so that your function could even take values
from there in certain cases (e.g. if I pass 0 as parameter).

> of init_callback, we may stuck with it as the driver is using it very
> differently here.
>
> Is it possible to pass the 'wanted true size' by adding a parameter for
> libie_rx_alloc()?

Yes, or I could store it somewhere on the ring, but looks uglier =\ This
one does as well to some degree, but at least hidden in the library and
doesn't show up in the drivers :D

>
>> +
>> + return page_pool_create(&pp);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);

Thanks,
Olek

2023-07-09 05:17:52

by Yunsheng Lin

[permalink] [raw]

Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

On 2023/7/7 0:28, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Thu, 6 Jul 2023 20:47:28 +0800
>
>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>
>>> +/**
>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>> + * @size: size of the PP, usually simply Rx queue len
>>> + *
>>> + * Returns &page_pool on success, casted -errno on failure.
>>> + */
>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>> + u32 size)
>>> +{
>>> + struct page_pool_params pp = {
>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>> + .order = LIBIE_RX_PAGE_ORDER,
>>> + .pool_size = size,
>>> + .nid = NUMA_NO_NODE,
>>> + .dev = napi->dev->dev.parent,
>>> + .napi = napi,
>>> + .dma_dir = DMA_FROM_DEVICE,
>>> + .offset = LIBIE_SKB_HEADROOM,
>>
>> I think it worth mentioning that the '.offset' is not really accurate
>> when the page is split, as we do not really know what is the offset of
>> the frag of a page except for the first frag.
>
> Yeah, this is read as "offset from the start of the page or frag to the
> actual frame start, i.e. its Ethernet header" or "this is just
> xdp->data - xdp->data_hard_start".

So the problem seems to be if most of drivers have a similar reading as
libie does here, as .offset seems to have a clear semantics which is used
to skip dma sync operation for buffer range that is not touched by the
dma operation. Even if it happens to have the same value of "offset from
the start of the page or frag to the actual frame start", I am not sure
it is future-proofing to reuse it.

When page frag is added, I didn't really give much thought about that as
we use it in a cache coherent system.
It seems we might need to extend or update that semantics if we really want
to skip dma sync operation for all the buffer ranges that are not touched
by the dma operation for page split case.
Or Skipping dma sync operation for all untouched ranges might not be worth
the effort, because it might need a per frag dma sync operation, which is
more costly than a batched per page dma sync operation. If it is true, page
pool already support that currently as my understanding, because the dma
sync operation is only done when the last frag is released/freed.

>
>>
>>> + };
>>> + size_t truesize;
>>> +
>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);

As mentioned above, if we depend on the last released/freed frag to do the
dma sync, the pp.max_len might need to cover all the frag.

>>> +
>>> + /* "Wanted" truesize, passed to page_pool_dev_alloc() */
>>> + truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>>> + pp.init_arg = (void *)truesize;
>>
>> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
>> be used along with init_callback. And if we want to change the implemetation
>
> I know. I abused it to save 1 function argument :p It's safe since I
> don't use init_callback (not an argument).
> I was thinking also of having a union in PP params or even a new field
> like "wanted true size", so that your function could even take values
> from there in certain cases (e.g. if I pass 0 as parameter).
>
>> of init_callback, we may stuck with it as the driver is using it very
>> differently here.
>>
>> Is it possible to pass the 'wanted true size' by adding a parameter for
>> libie_rx_alloc()?
>
> Yes, or I could store it somewhere on the ring, but looks uglier =\ This
> one does as well to some degree, but at least hidden in the library and
> doesn't show up in the drivers :D

It seems most hw driver know the size of memory it needs when creating
the ring/queue, setting the frag size and deciding how many is a page
split into before allocation seems like a possible future optimization.

For now, it would be better to add helper to acess pp.init_arg at least
instead of acess pp.init_arg directly to make it more obvious and make
the future optimization more easier.

>
>>
>>> +
>>> + return page_pool_create(&pp);
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
>
> Thanks,
> Olek

2023-07-10 13:48:28

by Alexander Lobakin

[permalink] [raw]

Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

From: Yunsheng Lin <[email protected]>
Date: Sun, 9 Jul 2023 13:16:33 +0800

> On 2023/7/7 0:28, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>
>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>
>>>> +/**
>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>> + * @size: size of the PP, usually simply Rx queue len
>>>> + *
>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>> + */
>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>> + u32 size)
>>>> +{
>>>> + struct page_pool_params pp = {
>>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>> + .order = LIBIE_RX_PAGE_ORDER,
>>>> + .pool_size = size,
>>>> + .nid = NUMA_NO_NODE,
>>>> + .dev = napi->dev->dev.parent,
>>>> + .napi = napi,
>>>> + .dma_dir = DMA_FROM_DEVICE,
>>>> + .offset = LIBIE_SKB_HEADROOM,
>>>
>>> I think it worth mentioning that the '.offset' is not really accurate
>>> when the page is split, as we do not really know what is the offset of
>>> the frag of a page except for the first frag.
>>
>> Yeah, this is read as "offset from the start of the page or frag to the
>> actual frame start, i.e. its Ethernet header" or "this is just
>> xdp->data - xdp->data_hard_start".
>
> So the problem seems to be if most of drivers have a similar reading as
> libie does here, as .offset seems to have a clear semantics which is used
> to skip dma sync operation for buffer range that is not touched by the
> dma operation. Even if it happens to have the same value of "offset from
> the start of the page or frag to the actual frame start", I am not sure
> it is future-proofing to reuse it.

Not sure I'm following :s

>
> When page frag is added, I didn't really give much thought about that as
> we use it in a cache coherent system.
> It seems we might need to extend or update that semantics if we really want
> to skip dma sync operation for all the buffer ranges that are not touched
> by the dma operation for page split case.
> Or Skipping dma sync operation for all untouched ranges might not be worth
> the effort, because it might need a per frag dma sync operation, which is
> more costly than a batched per page dma sync operation. If it is true, page
> pool already support that currently as my understanding, because the dma
> sync operation is only done when the last frag is released/freed.
>
>>
>>>
>>>> + };
>>>> + size_t truesize;
>>>> +
>>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>
> As mentioned above, if we depend on the last released/freed frag to do the
> dma sync, the pp.max_len might need to cover all the frag.

^^^^^^^^^^^^

You mean the whole page or...?
I think it's not the driver's duty to track all this. We always set
.offset to `data - data_hard_start` and .max_len to the maximum
HW-writeable length for one frame. We don't know whether PP will give us
a whole page or just a piece. DMA sync for device is performed in the PP
core code as well. Driver just creates a PP and don't care about the
internals.

>
>>>> +
>>>> + /* "Wanted" truesize, passed to page_pool_dev_alloc() */
>>>> + truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>>>> + pp.init_arg = (void *)truesize;
>>>
>>> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
>>> be used along with init_callback. And if we want to change the implemetation
>>
>> I know. I abused it to save 1 function argument :p It's safe since I
>> don't use init_callback (not an argument).
>> I was thinking also of having a union in PP params or even a new field
>> like "wanted true size", so that your function could even take values
>> from there in certain cases (e.g. if I pass 0 as parameter).
>>
>>> of init_callback, we may stuck with it as the driver is using it very
>>> differently here.
>>>
>>> Is it possible to pass the 'wanted true size' by adding a parameter for
>>> libie_rx_alloc()?
>>
>> Yes, or I could store it somewhere on the ring, but looks uglier =\ This
>> one does as well to some degree, but at least hidden in the library and
>> doesn't show up in the drivers :D
>
> It seems most hw driver know the size of memory it needs when creating
> the ring/queue, setting the frag size and deciding how many is a page
> split into before allocation seems like a possible future optimization.
>
> For now, it would be better to add helper to acess pp.init_arg at least
> instead of acess pp.init_arg directly to make it more obvious and make
> the future optimization more easier.

Makes senses.

>
>>
>>>
>>>> +
>>>> + return page_pool_create(&pp);
>>>> +}
>>>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
>>
>> Thanks,
>> Olek
>

Thanks,
Olek

2023-07-11 17:07:07

by Alexander Lobakin

[permalink] [raw]

Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

From: Yunsheng Lin <[email protected]>
Date: Tue, 11 Jul 2023 19:39:28 +0800

> On 2023/7/10 21:25, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Sun, 9 Jul 2023 13:16:33 +0800
>>
>>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>>> From: Yunsheng Lin <[email protected]>
>>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>>
>>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>>
>>>>>> +/**
>>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>>> + *
>>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>>> + */
>>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>>> + u32 size)
>>>>>> +{
>>>>>> + struct page_pool_params pp = {
>>>>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>>> + .order = LIBIE_RX_PAGE_ORDER,
>>>>>> + .pool_size = size,
>>>>>> + .nid = NUMA_NO_NODE,
>>>>>> + .dev = napi->dev->dev.parent,
>>>>>> + .napi = napi,
>>>>>> + .dma_dir = DMA_FROM_DEVICE,
>>>>>> + .offset = LIBIE_SKB_HEADROOM,
>>>>>
>>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>>> when the page is split, as we do not really know what is the offset of
>>>>> the frag of a page except for the first frag.
>>>>
>>>> Yeah, this is read as "offset from the start of the page or frag to the
>>>> actual frame start, i.e. its Ethernet header" or "this is just
>>>> xdp->data - xdp->data_hard_start".
>>>
>>> So the problem seems to be if most of drivers have a similar reading as
>>> libie does here, as .offset seems to have a clear semantics which is used
>>> to skip dma sync operation for buffer range that is not touched by the
>>> dma operation. Even if it happens to have the same value of "offset from
>>> the start of the page or frag to the actual frame start", I am not sure
>>> it is future-proofing to reuse it.
>>
>> Not sure I'm following :s
>
> It would be better to avoid accessing the internal data of the page pool
> directly as much as possible, as that may be changed to different meaning
> or removed if the implememtation is changed.
>
> If it is common enough that most drivers are using it the same way, adding
> a helper for that would be great.

How comes page_pool_params is internal if it's defined purely by the
driver and then exists read-only :D I even got warned in the adjacent
thread that the Page Pool core code shouldn't change it anyhow.

>
>>
>>>
>>> When page frag is added, I didn't really give much thought about that as
>>> we use it in a cache coherent system.
>>> It seems we might need to extend or update that semantics if we really want
>>> to skip dma sync operation for all the buffer ranges that are not touched
>>> by the dma operation for page split case.
>>> Or Skipping dma sync operation for all untouched ranges might not be worth
>>> the effort, because it might need a per frag dma sync operation, which is
>>> more costly than a batched per page dma sync operation. If it is true, page
>>> pool already support that currently as my understanding, because the dma
>>> sync operation is only done when the last frag is released/freed.
>>>
>>>>
>>>>>
>>>>>> + };
>>>>>> + size_t truesize;
>>>>>> +
>>>>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>>
>>> As mentioned above, if we depend on the last released/freed frag to do the
>>> dma sync, the pp.max_len might need to cover all the frag.
>>
>> ^^^^^^^^^^^^
>>
>> You mean the whole page or...?
>
> If we don't care about the accurate dma syncing, "cover all the frag" means
> the whole page here, as page pool doesn't have enough info to do accurate
> dma sync for now.
>
>> I think it's not the driver's duty to track all this. We always set
>> .offset to `data - data_hard_start` and .max_len to the maximum
>> HW-writeable length for one frame. We don't know whether PP will give us
>> a whole page or just a piece. DMA sync for device is performed in the PP
>> core code as well. Driver just creates a PP and don't care about the
>> internals.
>
> There problem is that when page_pool_put_page() is called with a split
> page, the page pool does not know which frag is freeing too.
>
> setting 'maximum HW-writeable length for one frame' only sync the first
> frag of a page as below:

Maybe Page Pool should synchronize DMA even when !last_frag then?
Setting .max_len to anything bigger than the maximum frame size you're
planning to receive is counter-intuitive.
All three xdp_buff, xdp_frame and skb always have all info needed to
determine which piece of the page we're recycling, it should be possible
to do with no complications. Hypothetical forcing drivers to do DMA
syncs on their own when they use frags is counter-intuitive as well,
Page Pool should be able to handle this itself.

Alternatively, Page Pool may do as follows:

1. !last_frag -- do nothing, same as today.
2. last_frag -- sync, but not [offset, offset + max_len), but
[offset, PAGE_SIZE).

This would also cover non-HW-writeable pieces like 2th-nth frame's
headroom and each frame's skb_shared_info, but it's the only alternative
to syncing each frag separately.
Yes, it's almost the same as to set .max_len to %PAGE_SIZE, but as I
said, it feels weird to set .max_len to 4k when you allocate 2k frags.
You don't know anyway how much of a page will be used.

For example, when I turn on driver-side XDP, increased headroom makes
truesize cross the 2k border with 1500 MTU, so that 2 frag per page
converts into 1 frag per page. In fact, instead of using the whole 4k,
I use ~2200 and don't need to sync the entire 4k.
Setting .max_len to 4k gives you way heavier overhead to non
DMA-coherent systems that per-frag-syncing would do.

>
> https://elixir.free-electrons.com/linux/v6.4-rc6/source/net/core/page_pool.c#L325
>

Thanks,
Olek