From: Matteo Croce <[email protected]>
This is a respin of [1]
This patchset shows the plans for allowing page_pool to handle and
maintain DMA map/unmap of the pages it serves to the driver. For this
to work a return hook in the network core is introduced.
The overall purpose is to simplify drivers, by providing a page
allocation API that does recycling, such that each driver doesn't have
to reinvent its own recycling scheme. Using page_pool in a driver
does not require implementing XDP support, but it makes it trivially
easy to do so. Instead of allocating buffers specifically for SKBs
we now allocate a generic buffer and either wrap it on an SKB
(via build_skb) or create an XDP frame.
The recycling code leverages the XDP recycle APIs.
The Marvell mvpp2 and mvneta drivers are used in this patchset to
demonstrate how to use the API, and tested on a MacchiatoBIN
and EspressoBIN boards respectively.
Please let this going in on a future -rc1 so to allow enough time
to have wider tests.
[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
v2 -> v3:
- added missing SOBs
- CCed the MM people
v1 -> v2:
- fix a commit message
- avoid setting pp_recycle multiple times on mvneta
- squash two patches to avoid breaking bisect
Ilias Apalodimas (1):
page_pool: Allow drivers to hint on SKB recycling
Jesper Dangaard Brouer (1):
xdp: reduce size of struct xdp_mem_info
Matteo Croce (3):
mm: add a signature in struct page
mvpp2: recycle buffers
mvneta: recycle buffers
.../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +-
drivers/net/ethernet/marvell/mvneta.c | 7 ++-
.../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++----
drivers/net/ethernet/marvell/sky2.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
include/linux/mm_types.h | 1 +
include/linux/skbuff.h | 35 ++++++++++++--
include/net/page_pool.h | 15 ++++++
include/net/xdp.h | 5 +-
net/core/page_pool.c | 47 +++++++++++++++++++
net/core/skbuff.c | 20 +++++++-
net/core/xdp.c | 14 ++++--
net/tls/tls_device.c | 2 +-
13 files changed, 142 insertions(+), 27 deletions(-)
--
2.30.2
From: Jesper Dangaard Brouer <[email protected]>
It is possible to compress/reduce the size of struct xdp_mem_info.
This change reduce struct xdp_mem_info from 8 bytes to 4 bytes.
The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht
rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE
which can safely fit in u16.
The member xdp_mem_info.type could be reduced more than u16, as it stores
the enum xdp_mem_type, but due to alignment it is only reduced to u16.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
include/net/xdp.h | 4 ++--
net/core/xdp.c | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index a5bc214a49d9..c35864d59113 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -48,8 +48,8 @@ enum xdp_mem_type {
#define XDP_XMIT_FLAGS_MASK XDP_XMIT_FLUSH
struct xdp_mem_info {
- u32 type; /* enum xdp_mem_type, but known size type */
- u32 id;
+ u16 type; /* enum xdp_mem_type, but known size type */
+ u16 id;
};
struct page_pool;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 05354976c1fc..3dd47ed83778 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -35,11 +35,11 @@ static struct rhashtable *mem_id_ht;
static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
{
- const u32 *k = data;
- const u32 key = *k;
+ const u16 *k = data;
+ const u16 key = *k;
BUILD_BUG_ON(sizeof_field(struct xdp_mem_allocator, mem.id)
- != sizeof(u32));
+ != sizeof(u16));
/* Use cyclic increasing ID as direct hash key */
return key;
@@ -49,7 +49,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
const void *ptr)
{
const struct xdp_mem_allocator *xa = ptr;
- u32 mem_id = *(u32 *)arg->key;
+ u16 mem_id = *(u16 *)arg->key;
return xa->mem.id != mem_id;
}
--
2.30.2
From: Matteo Croce <[email protected]>
This is needed by the page_pool to avoid recycling a page not allocated
via page_pool.
Signed-off-by: Matteo Croce <[email protected]>
---
include/linux/mm_types.h | 1 +
include/net/page_pool.h | 2 ++
net/core/page_pool.c | 4 ++++
3 files changed, 7 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..ef2d0d5f62e4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -101,6 +101,7 @@ struct page {
* 32-bit architectures.
*/
dma_addr_t dma_addr;
+ unsigned long signature;
};
struct { /* slab, slob and slub */
union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..b30405e84b5e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -63,6 +63,8 @@
*/
#define PP_ALLOC_CACHE_SIZE 128
#define PP_ALLOC_CACHE_REFILL 64
+#define PP_SIGNATURE 0x20210303
+
struct pp_alloc_cache {
u32 count;
void *cache[PP_ALLOC_CACHE_SIZE];
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..2ae9b554ef98 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
skip_dma_map:
+ page->signature = PP_SIGNATURE;
+
/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
@@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
DMA_ATTR_SKIP_CPU_SYNC);
page->dma_addr = 0;
skip_dma_unmap:
+ page->signature = 0;
+
/* This may be the last page returned, releasing the pool, so
* it is not safe to reference pool afterwards.
*/
--
2.30.2
From: Ilias Apalodimas <[email protected]>
Up to now several high speed NICs have custom mechanisms of recycling
the allocated memory they use for their payloads.
Our page_pool API already has recycling capabilities that are always
used when we are running in 'XDP mode'. So let's tweak the API and the
kernel network stack slightly and allow the recycling to happen even
during the standard operation.
The API doesn't take into account 'split page' policies used by those
drivers currently, but can be extended once we have users for that.
The idea is to be able to intercept the packet on skb_release_data().
If it's a buffer coming from our page_pool API recycle it back to the
pool for further usage or just release the packet entirely.
To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
store the xdp_mem_info in page->private. Storing the information in
page->private allows us to recycle both SKBs and their fragments.
The SKB bit is needed for a couple of reasons. First of all in an
effort to affect the free path as less as possible, reading a single bit,
is better that trying to derive identical information for the page stored
data. Moreover page->private is used by skb_copy_ubufs. We do have a
special mark in the page, that won't allow this to happen, but again
deciding without having to read the entire page is preferable.
The driver has to take care of the sync operations on it's own
during the buffer recycling since the buffer is, after opting-in to the
recycling, never unmapped.
Since the gain on the drivers depends on the architecture, we are not
enabling recycling by default if the page_pool API is used on a driver.
In order to enable recycling the driver must call skb_mark_for_recycle()
to store the information we need for recycling in page->private and
enabling the recycling bit, or page_pool_store_mem_info() for a fragment
Since we added an extra argument on __skb_frag_unref() to handle
recycling, update the current users of the function with that.
Co-developed-by: Jesper Dangaard Brouer <[email protected]>
Co-developed-by: Matteo Croce <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Matteo Croce <[email protected]>
Signed-off-by: Ilias Apalodimas <[email protected]>
---
.../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +-
drivers/net/ethernet/marvell/sky2.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
include/linux/skbuff.h | 35 ++++++++++++---
include/net/page_pool.h | 13 ++++++
include/net/xdp.h | 1 +
net/core/page_pool.c | 43 +++++++++++++++++++
net/core/skbuff.c | 20 ++++++++-
net/core/xdp.c | 6 +++
net/tls/tls_device.c | 2 +-
10 files changed, 115 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c
index 1115b8f9ea4e..8f815ebb59ae 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c
@@ -2125,7 +2125,7 @@ static int chcr_ktls_xmit(struct sk_buff *skb, struct net_device *dev)
/* clear the frag ref count which increased locally before */
for (i = 0; i < record->num_frags; i++) {
/* clear the frag ref count */
- __skb_frag_unref(&record->frags[i]);
+ __skb_frag_unref(&record->frags[i], false);
}
/* if any failure, come out from the loop. */
if (ret) {
diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index 68c154d715d6..9dc25c4fb359 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -2503,7 +2503,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
if (length == 0) {
/* don't need this page */
- __skb_frag_unref(frag);
+ __skb_frag_unref(frag, false);
--skb_shinfo(skb)->nr_frags;
} else {
size = min(length, (unsigned) PAGE_SIZE);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index e35e4d7ef4d1..cea62b8f554c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
fail:
while (nr > 0) {
nr--;
- __skb_frag_unref(skb_shinfo(skb)->frags + nr);
+ __skb_frag_unref(skb_shinfo(skb)->frags + nr, false);
}
return 0;
}
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dbf820a50a39..869f248204b9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,9 @@
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+#include <net/page_pool.h>
+#endif
/* The interface for checksum offload between the stack and networking drivers
* is as follows...
@@ -247,6 +250,7 @@ struct napi_struct;
struct bpf_prog;
union bpf_attr;
struct skb_ext;
+struct xdp_mem_info;
#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
struct nf_bridge_info {
@@ -667,6 +671,8 @@ typedef unsigned char *sk_buff_data_t;
* @head_frag: skb was allocated from page fragments,
* not allocated by kmalloc() or vmalloc().
* @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
+ * @pp_recycle: mark the packet for recycling instead of freeing (implies
+ * page_pool support on driver)
* @active_extensions: active extensions (skb_ext_id types)
* @ndisc_nodetype: router type (from link layer)
* @ooo_okay: allow the mapping of a socket to a queue to be changed
@@ -791,10 +797,12 @@ struct sk_buff {
fclone:2,
peeked:1,
head_frag:1,
- pfmemalloc:1;
+ pfmemalloc:1,
+ pp_recycle:1; /* page_pool recycle indicator */
#ifdef CONFIG_SKB_EXTENSIONS
__u8 active_extensions;
#endif
+
/* fields enclosed in headers_start/headers_end are copied
* using a single memcpy() in __copy_skb_header()
*/
@@ -3081,12 +3089,20 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
/**
* __skb_frag_unref - release a reference on a paged fragment.
* @frag: the paged fragment
+ * @recycle: recycle the page if allocated via page_pool
*
- * Releases a reference on the paged fragment @frag.
+ * Releases a reference on the paged fragment @frag
+ * or recycles the page via the page_pool API.
*/
-static inline void __skb_frag_unref(skb_frag_t *frag)
+static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
{
- put_page(skb_frag_page(frag));
+ struct page *page = skb_frag_page(frag);
+
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+ if (recycle && page_pool_return_skb_page(page_address(page)))
+ return;
+#endif
+ put_page(page);
}
/**
@@ -3098,7 +3114,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag)
*/
static inline void skb_frag_unref(struct sk_buff *skb, int f)
{
- __skb_frag_unref(&skb_shinfo(skb)->frags[f]);
+ __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
}
/**
@@ -4697,5 +4713,14 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb)
#endif
}
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
+ struct xdp_mem_info *mem)
+{
+ skb->pp_recycle = 1;
+ page_pool_store_mem_info(page, mem);
+}
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_SKBUFF_H */
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b30405e84b5e..75fffc15788b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -65,6 +65,8 @@
#define PP_ALLOC_CACHE_REFILL 64
#define PP_SIGNATURE 0x20210303
+struct xdp_mem_info;
+
struct pp_alloc_cache {
u32 count;
void *cache[PP_ALLOC_CACHE_SIZE];
@@ -148,6 +150,8 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
return pool->p.dma_dir;
}
+bool page_pool_return_skb_page(void *data);
+
struct page_pool *page_pool_create(const struct page_pool_params *params);
#ifdef CONFIG_PAGE_POOL
@@ -243,4 +247,13 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
spin_unlock_bh(&pool->ring.producer_lock);
}
+/* Store mem_info on struct page and use it while recycling skb frags */
+static inline
+void page_pool_store_mem_info(struct page *page, struct xdp_mem_info *mem)
+{
+ u32 *xmi = (u32 *)mem;
+
+ set_page_private(page, *xmi);
+}
+
#endif /* _NET_PAGE_POOL_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index c35864d59113..5d7316f1f195 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -235,6 +235,7 @@ void xdp_return_buff(struct xdp_buff *xdp);
void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq);
void xdp_return_frame_bulk(struct xdp_frame *xdpf,
struct xdp_frame_bulk *bq);
+void xdp_return_skb_frame(void *data, struct xdp_mem_info *mem);
/* When sending xdp_frame into the network stack, then there is no
* return point callback, which is needed to release e.g. DMA-mapping
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2ae9b554ef98..43bfd2e3d8df 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -9,6 +9,7 @@
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/device.h>
+#include <linux/skbuff.h>
#include <net/page_pool.h>
#include <net/xdp.h>
@@ -17,12 +18,19 @@
#include <linux/dma-mapping.h>
#include <linux/page-flags.h>
#include <linux/mm.h> /* for __put_page() */
+#include <net/xdp.h>
#include <trace/events/page_pool.h>
#define DEFER_TIME (msecs_to_jiffies(1000))
#define DEFER_WARN_INTERVAL (60 * HZ)
+/* Used to store/retrieve hi/lo bytes from xdp_mem_info to page->private */
+union page_pool_xmi {
+ u32 raw;
+ struct xdp_mem_info mem_info;
+};
+
static int page_pool_init(struct page_pool *pool,
const struct page_pool_params *params)
{
@@ -587,3 +595,38 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
+
+bool page_pool_return_skb_page(void *data)
+{
+ struct xdp_mem_info mem_info;
+ union page_pool_xmi info;
+ struct page *page;
+
+ page = virt_to_head_page(data);
+ if (unlikely(page->signature != PP_SIGNATURE))
+ return false;
+
+ info.raw = page_private(page);
+ mem_info = info.mem_info;
+
+ /* If a buffer is marked for recycle and does not belong to
+ * MEM_TYPE_PAGE_POOL, the buffers will be eventually freed from the
+ * network stack and kfree_skb, but the DMA region will *not* be
+ * correctly unmapped. WARN here for the recycling misusage
+ */
+ if (unlikely(mem_info.type != MEM_TYPE_PAGE_POOL)) {
+ WARN_ONCE(true, "Tried to recycle non MEM_TYPE_PAGE_POOL");
+ return false;
+ }
+
+ /* Driver set this to memory recycling info. Reset it on recycle
+ * This will *not* work for NIC using a split-page memory model.
+ * The page will be returned to the pool here regardless of the
+ * 'flipped' fragment being in use or not
+ */
+ set_page_private(page, 0);
+ xdp_return_skb_frame(data, &mem_info);
+
+ return true;
+}
+EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ad9e8425ab2..650f517565dd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,9 @@
#include <net/xfrm.h>
#include <net/mpls.h>
#include <net/mptcp.h>
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+#include <net/page_pool.h>
+#endif
#include <linux/uaccess.h>
#include <trace/events/skb.h>
@@ -644,6 +647,11 @@ static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+ if (skb->pp_recycle && page_pool_return_skb_page(head))
+ return;
+#endif
+
if (skb->head_frag)
skb_free_frag(head);
else
@@ -663,7 +671,7 @@ static void skb_release_data(struct sk_buff *skb)
skb_zcopy_clear(skb, true);
for (i = 0; i < shinfo->nr_frags; i++)
- __skb_frag_unref(&shinfo->frags[i]);
+ __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
if (shinfo->frag_list)
kfree_skb_list(shinfo->frag_list);
@@ -1045,6 +1053,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
n->nohdr = 0;
n->peeked = 0;
C(pfmemalloc);
+ C(pp_recycle);
n->destructor = NULL;
C(tail);
C(end);
@@ -3494,7 +3503,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
fragto = &skb_shinfo(tgt)->frags[merge];
skb_frag_size_add(fragto, skb_frag_size(fragfrom));
- __skb_frag_unref(fragfrom);
+ __skb_frag_unref(fragfrom, skb->pp_recycle);
}
/* Reposition in the original skb */
@@ -5275,6 +5284,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
if (skb_cloned(to))
return false;
+ /* We can't coalesce skb that are allocated from slab and page_pool
+ * The recycle mark is on the skb, so that might end up trying to
+ * recycle slab allocated skb->head
+ */
+ if (to->pp_recycle != from->pp_recycle)
+ return false;
+
if (len <= skb_tailroom(to)) {
if (len)
BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 3dd47ed83778..d89b827e54a9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -372,6 +372,12 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
}
}
+void xdp_return_skb_frame(void *data, struct xdp_mem_info *mem)
+{
+ __xdp_return(data, mem, false, NULL);
+}
+EXPORT_SYMBOL_GPL(xdp_return_skb_frame);
+
void xdp_return_frame(struct xdp_frame *xdpf)
{
__xdp_return(xdpf->data, &xdpf->mem, false, NULL);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 790c6b7ecb26..8027a58c76a2 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -127,7 +127,7 @@ static void destroy_record(struct tls_record_info *record)
int i;
for (i = 0; i < record->num_frags; i++)
- __skb_frag_unref(&record->frags[i]);
+ __skb_frag_unref(&record->frags[i], false);
kfree(record);
}
--
2.30.2
From: Matteo Croce <[email protected]>
Use the new recycling API for page_pool.
In a drop rate test, the packet rate is more than doubled,
from 962 Kpps to 2047 Kpps.
perf top on a stock system shows:
Overhead Shared Object Symbol
30.67% [kernel] [k] page_pool_release_page
8.37% [kernel] [k] get_page_from_freelist
7.34% [kernel] [k] free_unref_page
6.47% [mvpp2] [k] mvpp2_rx
4.69% [kernel] [k] eth_type_trans
4.55% [kernel] [k] __netif_receive_skb_core
4.40% [kernel] [k] build_skb
4.29% [kernel] [k] kmem_cache_free
4.00% [kernel] [k] kmem_cache_alloc
3.81% [kernel] [k] dev_gro_receive
With packet rate stable at 962 Kpps:
tx: 0 bps 0 pps rx: 477.4 Mbps 962.6 Kpps
tx: 0 bps 0 pps rx: 477.6 Mbps 962.8 Kpps
tx: 0 bps 0 pps rx: 477.6 Mbps 962.9 Kpps
tx: 0 bps 0 pps rx: 477.2 Mbps 962.1 Kpps
tx: 0 bps 0 pps rx: 477.5 Mbps 962.7 Kpps
And this is the same output with recycling enabled:
Overhead Shared Object Symbol
12.75% [mvpp2] [k] mvpp2_rx
9.56% [kernel] [k] __netif_receive_skb_core
9.29% [kernel] [k] build_skb
9.27% [kernel] [k] eth_type_trans
8.39% [kernel] [k] kmem_cache_alloc
7.85% [kernel] [k] kmem_cache_free
7.36% [kernel] [k] page_pool_put_page
6.45% [kernel] [k] dev_gro_receive
4.72% [kernel] [k] __xdp_return
3.06% [kernel] [k] page_pool_refill_alloc_cache
With packet rate above 2000 Kpps:
tx: 0 bps 0 pps rx: 1015 Mbps 2046 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
The major performance increase is explained by the fact that the most CPU
consuming functions (page_pool_release_page, get_page_from_freelist
and free_unref_page) are no longer called on a per packet basis.
The test was done by sending to the macchiatobin 64 byte ethernet frames
with an invalid ethertype, so the packets are dropped early in the RX path.
Signed-off-by: Matteo Croce <[email protected]>
---
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index ec706d614cac..633b39cfef65 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3847,6 +3847,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
struct mvpp2_pcpu_stats ps = {};
enum dma_data_direction dma_dir;
struct bpf_prog *xdp_prog;
+ struct xdp_rxq_info *rxqi;
struct xdp_buff xdp;
int rx_received;
int rx_done = 0;
@@ -3912,15 +3913,15 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
else
frag_size = bm_pool->frag_size;
- if (xdp_prog) {
- struct xdp_rxq_info *xdp_rxq;
+ if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
+ rxqi = &rxq->xdp_rxq_short;
+ else
+ rxqi = &rxq->xdp_rxq_long;
- if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
- xdp_rxq = &rxq->xdp_rxq_short;
- else
- xdp_rxq = &rxq->xdp_rxq_long;
+ if (xdp_prog) {
+ xdp.rxq = rxqi;
- xdp_init_buff(&xdp, PAGE_SIZE, xdp_rxq);
+ xdp_init_buff(&xdp, PAGE_SIZE, rxqi);
xdp_prepare_buff(&xdp, data,
MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM,
rx_bytes, false);
@@ -3964,7 +3965,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
}
if (pp)
- page_pool_release_page(pp, virt_to_page(data));
+ skb_mark_for_recycle(skb, virt_to_page(data), &rxqi->mem);
else
dma_unmap_single_attrs(dev->dev.parent, dma_addr,
bm_pool->buf_size, DMA_FROM_DEVICE,
--
2.30.2
From: Matteo Croce <[email protected]>
Use the new recycling API for page_pool.
In a drop rate test, the packet rate increased di 10%,
from 269 Kpps to 296 Kpps.
perf top on a stock system shows:
Overhead Shared Object Symbol
21.78% [kernel] [k] __pi___inval_dcache_area
21.66% [mvneta] [k] mvneta_rx_swbm
7.00% [kernel] [k] kmem_cache_alloc
6.05% [kernel] [k] eth_type_trans
4.44% [kernel] [k] kmem_cache_free.part.0
3.80% [kernel] [k] __netif_receive_skb_core
3.68% [kernel] [k] dev_gro_receive
3.65% [kernel] [k] get_page_from_freelist
3.43% [kernel] [k] page_pool_release_page
3.35% [kernel] [k] free_unref_page
And this is the same output with recycling enabled:
Overhead Shared Object Symbol
24.10% [kernel] [k] __pi___inval_dcache_area
23.02% [mvneta] [k] mvneta_rx_swbm
7.19% [kernel] [k] kmem_cache_alloc
6.50% [kernel] [k] eth_type_trans
4.93% [kernel] [k] __netif_receive_skb_core
4.77% [kernel] [k] kmem_cache_free.part.0
3.93% [kernel] [k] dev_gro_receive
3.03% [kernel] [k] build_skb
2.91% [kernel] [k] page_pool_put_page
2.85% [kernel] [k] __xdp_return
The test was done with mausezahn on the TX side with 64 byte raw
ethernet frames.
Signed-off-by: Matteo Croce <[email protected]>
---
drivers/net/ethernet/marvell/mvneta.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index f20dfd1d7a6b..f88c189b60a4 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2331,7 +2331,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
if (!skb)
return ERR_PTR(-ENOMEM);
- page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data));
+ skb_mark_for_recycle(skb, virt_to_page(xdp->data), &xdp->rxq->mem);
skb_reserve(skb, xdp->data - xdp->data_hard_start);
skb_put(skb, xdp->data_end - xdp->data);
@@ -2343,7 +2343,10 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
skb_frag_page(frag), skb_frag_off(frag),
skb_frag_size(frag), PAGE_SIZE);
- page_pool_release_page(rxq->page_pool, skb_frag_page(frag));
+ /* We don't need to reset pp_recycle here. It's already set, so
+ * just mark fragments for recycling.
+ */
+ page_pool_store_mem_info(skb_frag_page(frag), &xdp->rxq->mem);
}
return skb;
--
2.30.2
Hi Matteo,
[...]
> +bool page_pool_return_skb_page(void *data);
> +
> struct page_pool *page_pool_create(const struct page_pool_params *params);
>
> #ifdef CONFIG_PAGE_POOL
> @@ -243,4 +247,13 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
> spin_unlock_bh(&pool->ring.producer_lock);
> }
>
> +/* Store mem_info on struct page and use it while recycling skb frags */
> +static inline
> +void page_pool_store_mem_info(struct page *page, struct xdp_mem_info *mem)
> +{
> + u32 *xmi = (u32 *)mem;
> +
I just noticed this changed from the original patchset I was carrying.
On the original, I had a union containing a u32 member to explicitly avoid
this casting. Let's wait for comments on the rest of the series, but i'd like
to change that back in a v4. Aplogies, I completely missed this on the
previous postings ...
Thanks
/Ilias
On Sat, Apr 10, 2021 at 2:11 AM Ilias Apalodimas
<[email protected]> wrote:
>
> Hi Matteo,
>
> [...]
> > +bool page_pool_return_skb_page(void *data);
> > +
> > struct page_pool *page_pool_create(const struct page_pool_params *params);
> >
> > #ifdef CONFIG_PAGE_POOL
> > @@ -243,4 +247,13 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
> > spin_unlock_bh(&pool->ring.producer_lock);
> > }
> >
> > +/* Store mem_info on struct page and use it while recycling skb frags */
> > +static inline
> > +void page_pool_store_mem_info(struct page *page, struct xdp_mem_info *mem)
> > +{
> > + u32 *xmi = (u32 *)mem;
> > +
>
> I just noticed this changed from the original patchset I was carrying.
> On the original, I had a union containing a u32 member to explicitly avoid
> this casting. Let's wait for comments on the rest of the series, but i'd like
> to change that back in a v4. Aplogies, I completely missed this on the
> previous postings ...
>
Hi,
I had to change this because including net/xdp.h here caused a
circular dependency.
I think that the safest thing we can do is to use memcpy(), which will
handle the alignments correctly, depending on the architecture.
Cheers,
--
per aspera ad upstream
On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> This is needed by the page_pool to avoid recycling a page not allocated
> via page_pool.
Is the PageType mechanism more appropriate to your needs? It wouldn't
be if you use page->_mapcount (ie mapping it to userspace).
> Signed-off-by: Matteo Croce <[email protected]>
> ---
> include/linux/mm_types.h | 1 +
> include/net/page_pool.h | 2 ++
> net/core/page_pool.c | 4 ++++
> 3 files changed, 7 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..ef2d0d5f62e4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -101,6 +101,7 @@ struct page {
> * 32-bit architectures.
> */
> dma_addr_t dma_addr;
> + unsigned long signature;
> };
> struct { /* slab, slob and slub */
> union {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..b30405e84b5e 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -63,6 +63,8 @@
> */
> #define PP_ALLOC_CACHE_SIZE 128
> #define PP_ALLOC_CACHE_REFILL 64
> +#define PP_SIGNATURE 0x20210303
> +
> struct pp_alloc_cache {
> u32 count;
> void *cache[PP_ALLOC_CACHE_SIZE];
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..2ae9b554ef98 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>
> skip_dma_map:
> + page->signature = PP_SIGNATURE;
> +
> /* Track how many pages are held 'in-flight' */
> pool->pages_state_hold_cnt++;
>
> @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
> DMA_ATTR_SKIP_CPU_SYNC);
> page->dma_addr = 0;
> skip_dma_unmap:
> + page->signature = 0;
> +
> /* This may be the last page returned, releasing the pool, so
> * it is not safe to reference pool afterwards.
> */
> --
> 2.30.2
>
Hi Matthew
On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > This is needed by the page_pool to avoid recycling a page not allocated
> > via page_pool.
>
> Is the PageType mechanism more appropriate to your needs? It wouldn't
> be if you use page->_mapcount (ie mapping it to userspace).
Interesting!
Please keep in mind this was written ~2018 and was stale on my branches for
quite some time. So back then I did try to use PageType, but had not free
bits. Looking at it again though, it's cleaned up. So yes I think this can
be much much cleaner. Should we go and define a new PG_pagepool?
Thanks!
/Ilias
>
> > Signed-off-by: Matteo Croce <[email protected]>
> > ---
> > include/linux/mm_types.h | 1 +
> > include/net/page_pool.h | 2 ++
> > net/core/page_pool.c | 4 ++++
> > 3 files changed, 7 insertions(+)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..ef2d0d5f62e4 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -101,6 +101,7 @@ struct page {
> > * 32-bit architectures.
> > */
> > dma_addr_t dma_addr;
> > + unsigned long signature;
> > };
> > struct { /* slab, slob and slub */
> > union {
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index b5b195305346..b30405e84b5e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -63,6 +63,8 @@
> > */
> > #define PP_ALLOC_CACHE_SIZE 128
> > #define PP_ALLOC_CACHE_REFILL 64
> > +#define PP_SIGNATURE 0x20210303
> > +
> > struct pp_alloc_cache {
> > u32 count;
> > void *cache[PP_ALLOC_CACHE_SIZE];
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index ad8b0707af04..2ae9b554ef98 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> > page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> >
> > skip_dma_map:
> > + page->signature = PP_SIGNATURE;
> > +
> > /* Track how many pages are held 'in-flight' */
> > pool->pages_state_hold_cnt++;
> >
> > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
> > DMA_ATTR_SKIP_CPU_SYNC);
> > page->dma_addr = 0;
> > skip_dma_unmap:
> > + page->signature = 0;
> > +
> > /* This may be the last page returned, releasing the pool, so
> > * it is not safe to reference pool afterwards.
> > */
> > --
> > 2.30.2
> >
On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
<[email protected]> wrote:
>
> Hi Matthew
>
> On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > This is needed by the page_pool to avoid recycling a page not allocated
> > > via page_pool.
> >
> > Is the PageType mechanism more appropriate to your needs? It wouldn't
> > be if you use page->_mapcount (ie mapping it to userspace).
>
> Interesting!
> Please keep in mind this was written ~2018 and was stale on my branches for
> quite some time. So back then I did try to use PageType, but had not free
> bits. Looking at it again though, it's cleaned up. So yes I think this can
> be much much cleaner. Should we go and define a new PG_pagepool?
>
>
Can this page_pool be used for TCP RX zerocopy? If yes then PageType
can not be used.
There is a recent discussion [1] on memcg accounting of TCP RX
zerocopy and I am wondering if this work can somehow help in that
regard. I will take a look at the series.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Hi Shakeel,
On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
> On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
> <[email protected]> wrote:
> >
> > Hi Matthew
> >
> > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > via page_pool.
> > >
> > > Is the PageType mechanism more appropriate to your needs? It wouldn't
> > > be if you use page->_mapcount (ie mapping it to userspace).
> >
> > Interesting!
> > Please keep in mind this was written ~2018 and was stale on my branches for
> > quite some time. So back then I did try to use PageType, but had not free
> > bits. Looking at it again though, it's cleaned up. So yes I think this can
> > be much much cleaner. Should we go and define a new PG_pagepool?
> >
> >
>
> Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> can not be used.
Yes it can, since it's going to be used as your default allocator for
payloads, which might end up on an SKB.
So we have to keep the extra added field on struct page for our mark.
Matthew had an intersting idea. He suggested keeping it, but changing the
magic number, so it can't be a kernel address, but I'll let him follow
up on the details.
>
> There is a recent discussion [1] on memcg accounting of TCP RX
> zerocopy and I am wondering if this work can somehow help in that
> regard. I will take a look at the series.
>
I'll try having a look on this as well. The idea behind the patchset is to
allow lower speed NICs that use the API already, gain recycling 'easily'.
Using page_pool for the driver comes with a penalty to begin with.
Allocating pages instead of SKBs has a measurable difference. By enabling them
to recycle they'll get better performance, since you skip the
reallocation/remapping and only care for syncing the buffers correctly.
> [1] https://lore.kernel.org/linux-mm/[email protected]/
On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.
>
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.
> So we have to keep the extra added field on struct page for our mark.
> Matthew had an intersting idea. He suggested keeping it, but changing the
> magic number, so it can't be a kernel address, but I'll let him follow
> up on the details.
Sure! So, given the misalignment problem I discovered yesterday [1],
we probably want a page_pool page to look like:
unsigned long flags;
unsigned long pp_magic;
unsigned long xmi;
unsigned long _pp_mapping_pad;
dma_addr_t dma_addr; /* might be one or two words */
The only real restriction here is that pp_magic should not be a valid
pointer, and it must have the bottom bit clear. I'd recommend something
like:
#define PP_MAGIC (0x20 + POISON_POINTER_DELTA)
This leaves page->mapping as NULL, so you don't have to worry about
clearing it before free.
[1] https://lore.kernel.org/linux-mm/[email protected]/
On Sat, 10 Apr 2021 20:39:55 +0100
Matthew Wilcox <[email protected]> wrote:
> On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.
> >
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
> > So we have to keep the extra added field on struct page for our mark.
> > Matthew had an intersting idea. He suggested keeping it, but changing the
> > magic number, so it can't be a kernel address, but I'll let him follow
> > up on the details.
>
> Sure! So, given the misalignment problem I discovered yesterday [1],
> we probably want a page_pool page to look like:
>
> unsigned long flags;
> unsigned long pp_magic;
> unsigned long xmi;
> unsigned long _pp_mapping_pad;
> dma_addr_t dma_addr; /* might be one or two words */
>
> The only real restriction here is that pp_magic should not be a valid
> pointer, and it must have the bottom bit clear. I'd recommend something
> like:
>
> #define PP_MAGIC (0x20 + POISON_POINTER_DELTA)
>
> This leaves page->mapping as NULL, so you don't have to worry about
> clearing it before free.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
I didn't see this, before asking[2] for explaining your intent.
I still worry about page->index, see [2].
[2] https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Sat, 10 Apr 2021 21:27:31 +0300
Ilias Apalodimas <[email protected]> wrote:
> On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
>
> > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
> > <[email protected]> wrote:
> > >
> > > Hi Matthew
> > >
> > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > > via page_pool.
> > > >
> > > > Is the PageType mechanism more appropriate to your needs? It wouldn't
> > > > be if you use page->_mapcount (ie mapping it to userspace).
> > >
> > > Interesting!
> > > Please keep in mind this was written ~2018 and was stale on my branches for
> > > quite some time. So back then I did try to use PageType, but had not free
> > > bits. Looking at it again though, it's cleaned up. So yes I think this can
> > > be much much cleaner. Should we go and define a new PG_pagepool?
> > >
> >
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.
>
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.
I'm not sure we want or should "allow" page_pool be used for TCP RX
zerocopy.
For several reasons.
(1) This implies mapping these pages page to userspace, which AFAIK
means using page->mapping and page->index members (right?).
(2) It feels wrong (security wise) to keep the DMA-mapping (for the
device) and also map this page into userspace.
(3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
zerocopy will bump the refcnt, which means the page_pool will not
recycle the page when it see the elevated refcnt (it will instead
release its DMA-mapping).
(4) I remember vaguely that this code path for (TCP RX zerocopy) uses
page->private for tricks. And our patch [3/5] use page->private for
storing xdp_mem_info.
IMHO when the SKB travel into this TCP RX zerocopy code path, we should
call page_pool_release_page() to release its DMA-mapping.
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
<[email protected]> wrote:
>
[...]
> > >
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.
> >
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
>
> I'm not sure we want or should "allow" page_pool be used for TCP RX
> zerocopy.
> For several reasons.
>
> (1) This implies mapping these pages page to userspace, which AFAIK
> means using page->mapping and page->index members (right?).
>
No, only page->_mapcount is used.
> (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> device) and also map this page into userspace.
>
I think this is already the case i.e pages still DMA-mapped and also
mapped into userspace.
> (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> zerocopy will bump the refcnt, which means the page_pool will not
> recycle the page when it see the elevated refcnt (it will instead
> release its DMA-mapping).
Yes this is right but the userspace might have already consumed and
unmapped the page before the driver considers to recycle the page.
>
> (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> page->private for tricks. And our patch [3/5] use page->private for
> storing xdp_mem_info.
>
> IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> call page_pool_release_page() to release its DMA-mapping.
>
I will let TCP RX zerocopy experts respond to this but from my high
level code inspection, I didn't see page->private usage.
On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt <[email protected]> wrote:
>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.
Indeed, we do not use page->private, since we do not own the page(s).
On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> <[email protected]> wrote:
> >
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >
>
> No, only page->_mapcount is used.
>
I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
adopt the recycling mechanism we should try preserving the current
functionality of the network stack.
The question is how does it work with the current drivers that already have an
internal page recycling mechanism.
> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >
>
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.
>
> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).
>
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.
Same question here. I'll have a closer look in a few days and make sure we are
not breaking anything wrt zerocopy.
>
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks. And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >
>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.
Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
Thanks!
/Ilias
On Wed, 14 Apr 2021 13:09:47 -0700
Shakeel Butt <[email protected]> wrote:
> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> <[email protected]> wrote:
> >
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >
>
> No, only page->_mapcount is used.
Good to know.
I will admit that I don't fully understand the usage of page->mapping
and page->index members.
> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >
>
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.
True, other drivers are doing the same.
> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).
>
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.
That is a good point. So, there is a race window where it is possible
to gain recycling.
It seems my page_pool co-maintainer Ilias is interested in taking up the
challenge to get this working with TCP RX zerocopy. So, lets see how
this is doable.
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks. And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >
>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.
I trust when Eric says page->private isn't used in this code path.
So, it might actually be possible :-)
I will challenge Ilias and Matteo to pull this off. (But I know that
both of them are busy for personal reasons, so be patient with them).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Mon, Apr 19, 2021 at 01:22:04PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 14 Apr 2021 13:09:47 -0700
> Shakeel Butt <[email protected]> wrote:
>
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > <[email protected]> wrote:
> > >
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >
> >
> > No, only page->_mapcount is used.
>
> Good to know.
> I will admit that I don't fully understand the usage of page->mapping
> and page->index members.
That's fair. It's not well-documented, and it's complicated.
For a page mapped into userspace, page->mapping is one of:
- NULL
- A pointer to a file's address_space
- A pointer to an anonymous page's anon_vma
If a page isn't mapped into userspace, you can use the space in page->mapping
for anything you like (eg slab uses it)
page->index is only used for indicating pfmemalloc today (and I want to
move that indicator). I think it can also be used to merge VMAs (if
some other conditions are also true), but failing to merge VMAs isn't
a big deal for this kind of situation.
> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> >
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
>
> True, other drivers are doing the same.
And the contents of this page already came from that device ... if it
wanted to write bad data, it could already have done so.
> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).
> >
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
>
> That is a good point. So, there is a race window where it is possible
> to gain recycling.
>
> It seems my page_pool co-maintainer Ilias is interested in taking up the
> challenge to get this working with TCP RX zerocopy. So, lets see how
> this is doable.
You could also check page_ref_count() - page_mapcount() instead of
just checking page_ref_count(). Assuming mapping/unmapping can't
race with recycling?
On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
<[email protected]> wrote:
>
> On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > <[email protected]> wrote:
> > >
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >
> >
> > No, only page->_mapcount is used.
> >
>
> I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> adopt the recycling mechanism we should try preserving the current
> functionality of the network stack.
>
> The question is how does it work with the current drivers that already have an
> internal page recycling mechanism.
>
I think the current drivers check page_ref_count(page) to decide to
reuse (or not) the already allocated pages.
Some examples from the drivers:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()
> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > >
> >
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> >
> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).
> >
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
>
> Same question here. I'll have a closer look in a few days and make sure we are
> not breaking anything wrt zerocopy.
>
Pages mapped into the userspace have their refcnt elevated, so the
page_ref_count() check by the drivers indicates to not reuse such
pages.
> >
> > >
> > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > page->private for tricks. And our patch [3/5] use page->private for
> > > storing xdp_mem_info.
> > >
> > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > call page_pool_release_page() to release its DMA-mapping.
> > >
> >
> > I will let TCP RX zerocopy experts respond to this but from my high
> > level code inspection, I didn't see page->private usage.
>
> Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
>
I would recommend tools/testing/selftests/net/tcp_mmap.c.
Hi Shakeel,
On Mon, Apr 19, 2021 at 07:57:03AM -0700, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
> <[email protected]> wrote:
> >
> > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > > <[email protected]> wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > > can not be used.
> > > > >
> > > > > Yes it can, since it's going to be used as your default allocator for
> > > > > payloads, which might end up on an SKB.
> > > >
> > > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > > zerocopy.
> > > > For several reasons.
> > > >
> > > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > > means using page->mapping and page->index members (right?).
> > > >
> > >
> > > No, only page->_mapcount is used.
> > >
> >
> > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> > adopt the recycling mechanism we should try preserving the current
> > functionality of the network stack.
> >
> > The question is how does it work with the current drivers that already have an
> > internal page recycling mechanism.
> >
>
> I think the current drivers check page_ref_count(page) to decide to
> reuse (or not) the already allocated pages.
>
> Some examples from the drivers:
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
> drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()
>
Yes, that's how internal recycling is done in drivers. As Jesper mentioned the
refcnt of the page is 1 for the page_pool owned pages and that's how we decide
what to do with the page.
> > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > > device) and also map this page into userspace.
> > > >
> > >
> > > I think this is already the case i.e pages still DMA-mapped and also
> > > mapped into userspace.
> > >
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).
> > >
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> >
> > Same question here. I'll have a closer look in a few days and make sure we are
> > not breaking anything wrt zerocopy.
> >
>
> Pages mapped into the userspace have their refcnt elevated, so the
> page_ref_count() check by the drivers indicates to not reuse such
> pages.
>
When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
which will end up doing a get_page().
What you are saying is that once the zerocopy is done though, skb_release_data()
won't be called, but instead put_page() will be? If that's the case then we are
indeed leaking DMA mappings and memory. That sounds weird though, since the
refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
eventually frees the page?
If kfree_skb() (or any wrapper that calls skb_release_data()) is called
eventually, we'll end up properly recycling the page into our pool.
> > >
> > > >
> > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > > page->private for tricks. And our patch [3/5] use page->private for
> > > > storing xdp_mem_info.
> > > >
> > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > > call page_pool_release_page() to release its DMA-mapping.
> > > >
> > >
> > > I will let TCP RX zerocopy experts respond to this but from my high
> > > level code inspection, I didn't see page->private usage.
> >
> > Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
> >
>
> I would recommend tools/testing/selftests/net/tcp_mmap.c.
Ok, thanks I'll have a look.
Cheers
/Ilias
On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
<[email protected]> wrote:
>
[...]
> > Pages mapped into the userspace have their refcnt elevated, so the
> > page_ref_count() check by the drivers indicates to not reuse such
> > pages.
> >
>
> When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
> which will end up doing a get_page().
> What you are saying is that once the zerocopy is done though, skb_release_data()
> won't be called, but instead put_page() will be? If that's the case then we are
> indeed leaking DMA mappings and memory. That sounds weird though, since the
> refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
> eventually frees the page?
> If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> eventually, we'll end up properly recycling the page into our pool.
>
From what I understand (Eric, please correct me if I'm wrong) for
simple cases there are 3 page references taken. One by the driver,
second by skb and third by page table.
In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
page ref through insert_page_into_pte_locked(). However before
returning from tcp_zerocopy_receive(), the skb references are dropped
through tcp_recv_skb(). So, whenever the user unmaps the page and
drops the page ref only then that page can be reused by the driver.
In my understanding, for zerocopy rx the skb_release_data() is called
on the pages while they are still mapped into the userspace. So,
skb_release_data() might not be the right place to recycle the page
for zerocopy. The email chain at [1] has some discussion on how to
bundle the recycling of pages with their lifetime.
[1] https://lore.kernel.org/linux-mm/[email protected]/
On Mon, Apr 19, 2021 at 09:21:55AM -0700, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
> <[email protected]> wrote:
> >
> [...]
> > > Pages mapped into the userspace have their refcnt elevated, so the
> > > page_ref_count() check by the drivers indicates to not reuse such
> > > pages.
> > >
> >
> > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
> > which will end up doing a get_page().
> > What you are saying is that once the zerocopy is done though, skb_release_data()
> > won't be called, but instead put_page() will be? If that's the case then we are
> > indeed leaking DMA mappings and memory. That sounds weird though, since the
> > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
> > eventually frees the page?
> > If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> > eventually, we'll end up properly recycling the page into our pool.
> >
>
> From what I understand (Eric, please correct me if I'm wrong) for
> simple cases there are 3 page references taken. One by the driver,
> second by skb and third by page table.
>
> In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
> page ref through insert_page_into_pte_locked(). However before
> returning from tcp_zerocopy_receive(), the skb references are dropped
> through tcp_recv_skb(). So, whenever the user unmaps the page and
> drops the page ref only then that page can be reused by the driver.
>
> In my understanding, for zerocopy rx the skb_release_data() is called
> on the pages while they are still mapped into the userspace. So,
> skb_release_data() might not be the right place to recycle the page
> for zerocopy. The email chain at [1] has some discussion on how to
> bundle the recycling of pages with their lifetime.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
Ah right, you mentioned the same email before and I completely forgot about
it! In the past we had thoughts of 'stealing' the page on put_page instead of
skb_release_data(). We were afraid that this would cause a measurable
performance hit, so we tried to limit it within the skb lifecycle.
However I don't think this will be a problem. Assuming we are right here and
skb_release_data() is called while the userspace holds an extra reference from
the mapping here's what will happen:
skb_release_data() -> skb_free_head() -> page_pool_return_skb_page() ->
set_page_private() -> xdp_return_skb_frame() -> __xdp_return() ->
page_pool_put_full_page() -> page_pool_put_page() -> __page_pool_put_page()
When we call __page_pool_put_page(), the refcnt will be != 1 (because a user
mapping is still active), so we won't try to recycle it. Instead we'll remove
the DMA mappings and decrease the refcnt.
So although the recycling won't 'work', nothing bad will happen (famous last
words).
In any case, I'll double check with the test you pointed out before v4.
Thanks!
/Ilias
Hi Matthew,
[...]
>
> And the contents of this page already came from that device ... if it
> wanted to write bad data, it could already have done so.
>
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).
> > >
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> >
> > That is a good point. So, there is a race window where it is possible
> > to gain recycling.
> >
> > It seems my page_pool co-maintainer Ilias is interested in taking up the
> > challenge to get this working with TCP RX zerocopy. So, lets see how
> > this is doable.
>
> You could also check page_ref_count() - page_mapcount() instead of
> just checking page_ref_count(). Assuming mapping/unmapping can't
> race with recycling?
>
That's not a bad idea. As I explained on my last reply to Shakeel, I don't
think the current patch will blow up anywhere. If the page is unmapped prior
to kfree_skb() it will be recycled. If it's done in a reverse order, we'll
just free the page entirely and will have to re-allocate it.
The only thing I need to test is potential races (assuming those can even
happen?).
Trying to recycle the page outside of kfree_skb() means we'd have to 'steal'
the page, during put_page() (or some function that's outside the networking
scope). I think this is going to have a measurable performance penalty though
not in networking, but in general.
In any case, that should be orthogonal to the current patchset. So unless
someone feels strongly about it, I'd prefer keeping the current code and
trying to enable recycling in the skb zc case, when we have enough users of
the API.
Thanks
/Ilias
On 2021/4/10 6:37, Matteo Croce wrote:
> From: Matteo Croce <[email protected]>
>
> This is a respin of [1]
>
> This patchset shows the plans for allowing page_pool to handle and
> maintain DMA map/unmap of the pages it serves to the driver. For this
> to work a return hook in the network core is introduced.
>
> The overall purpose is to simplify drivers, by providing a page
> allocation API that does recycling, such that each driver doesn't have
> to reinvent its own recycling scheme. Using page_pool in a driver
> does not require implementing XDP support, but it makes it trivially
> easy to do so. Instead of allocating buffers specifically for SKBs
> we now allocate a generic buffer and either wrap it on an SKB
> (via build_skb) or create an XDP frame.
> The recycling code leverages the XDP recycle APIs.
>
> The Marvell mvpp2 and mvneta drivers are used in this patchset to
> demonstrate how to use the API, and tested on a MacchiatoBIN
> and EspressoBIN boards respectively.
>
Hi, Matteo
I added the skb frag page recycling in hns3 based on this patchset,
and it has above 10%~20% performance improvement for one thread iperf
TCP flow(IOMMU is off, there may be more performance improvement if
considering the DMA map/unmap avoiding for IOMMU), thanks for the job.
The skb frag page recycling support in hns3 driver is not so simple
as the mvpp2 and mvneta driver, because:
1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info"
is added to assist relation binding between the "struct page" and
"struct page_pool".
2. the hns3 driver has already a page reusing based on page spliting and
page reference count, but it may not work if the upper stack can not
handle skb and release the corresponding page fast enough.
3. the hns3 driver support page reference count updating batching, see:
aeda9bf87a45 ("net: hns3: batch the page reference count updates")
So it would be better if:
1. skb frag page recycling do not need "struct xdp_rxq_info" or
"struct xdp_mem_info" to bond the relation between "struct page" and
"struct page_pool", which seems uncessary at this point if bonding
a "struct page_pool" pointer directly in "struct page" does not cause
space increasing.
2. it would be good to do the page reference count updating batching
in page pool instead of specific driver.
page_pool_atomic_sub_if_positive() is added to decide who can call
page_pool_put_full_page(), because the driver and stack may hold
reference to the same page, only if last one which hold complete
reference to a page can call page_pool_put_full_page() to decide if
recycling is possible, if not, the page is released, so I am wondering
if a similar page_pool_atomic_sub_if_positive() can added to specific
user space address unmapping path to allow skb recycling for RX zerocopy
too?
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index c21dd11..8b01a7d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2566,7 +2566,10 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
unsigned int order = hns3_page_order(ring);
struct page *p;
- p = dev_alloc_pages(order);
+ if (ring->page_pool)
+ p = page_pool_dev_alloc_pages(ring->page_pool);
+ else
+ p = dev_alloc_pages(order);
if (!p)
return -ENOMEM;
@@ -2582,13 +2585,32 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
return 0;
}
+static void hns3_page_frag_cache_drain(struct hns3_enet_ring *ring,
+ struct hns3_desc_cb *cb)
+{
+ if (ring->page_pool) {
+ struct page *p = cb->priv;
+
+ if (page_pool_atomic_sub_if_positive(p, cb->pagecnt_bias))
+ return;
+
+ if (cb->pagecnt_bias > 1)
+ page_ref_sub(p, cb->pagecnt_bias - 1);
+
+ page_pool_put_full_page(ring->page_pool, p, false);
+ return;
+ }
+
+ __page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+}
+
static void hns3_free_buffer(struct hns3_enet_ring *ring,
struct hns3_desc_cb *cb, int budget)
{
if (cb->type == DESC_TYPE_SKB)
napi_consume_skb(cb->priv, budget);
else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias)
- __page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+ hns3_page_frag_cache_drain(ring, cb);
memset(cb, 0, sizeof(*cb));
}
@@ -2892,13 +2914,15 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len,
size - pull_len, truesize);
+ skb_mark_for_recycle(skb, desc_cb->priv, &ring->rxq_info.mem);
+
/* Avoid re-using remote and pfmemalloc pages, or the stack is still
* using the page when page_offset rollback to zero, flag default
* unreuse
*/
if (!dev_page_is_reusable(desc_cb->priv) ||
(!desc_cb->page_offset && !hns3_can_reuse_page(desc_cb))) {
- __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
+ hns3_page_frag_cache_drain(ring, desc_cb);
return;
}
@@ -2911,7 +2935,7 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
desc_cb->reuse_flag = 1;
desc_cb->page_offset = 0;
} else if (desc_cb->pagecnt_bias) {
- __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
+ hns3_page_frag_cache_drain(ring, desc_cb);
return;
}
@@ -3156,8 +3180,7 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
if (dev_page_is_reusable(desc_cb->priv))
desc_cb->reuse_flag = 1;
else /* This page cannot be reused so discard it */
- __page_frag_cache_drain(desc_cb->priv,
- desc_cb->pagecnt_bias);
+ hns3_page_frag_cache_drain(ring, desc_cb);
hns3_rx_ring_move_fw(ring);
return 0;
@@ -4028,6 +4051,33 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
goto out_with_desc_cb;
if (!HNAE3_IS_TX_RING(ring)) {
+ struct page_pool_params pp_params = {
+ /* internal DMA mapping in page_pool */
+ .flags = 0,
+ .order = 0,
+ .pool_size = 1024,
+ .nid = dev_to_node(ring_to_dev(ring)),
+ .dev = ring_to_dev(ring),
+ .dma_dir = DMA_FROM_DEVICE,
+ .offset = 0,
+ .max_len = 0,
+ };
+
+ ring->page_pool = page_pool_create(&pp_params);
+ if (IS_ERR(ring->page_pool)) {
+ dev_err(ring_to_dev(ring), "page pool creation failed\n");
+ ring->page_pool = NULL;
+ }
+
+ ret = xdp_rxq_info_reg(&ring->rxq_info, ring_to_netdev(ring), ring->queue_index, 0);
+ if (ret)
+ dev_err(ring_to_dev(ring), "xdp_rxq_info_reg failed\n");
+
+ ret = xdp_rxq_info_reg_mem_model(&ring->rxq_info, MEM_TYPE_PAGE_POOL,
+ ring->page_pool);
+ if (ret)
+ dev_err(ring_to_dev(ring), "xdp_rxq_info_reg_mem_model failed\n");
+
ret = hns3_alloc_ring_buffers(ring);
if (ret)
goto out_with_desc;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index daa04ae..fd53fcc 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,6 +6,9 @@
#include <linux/if_vlan.h>
+#include <net/page_pool.h>
+#include <net/xdp.h>
+
#include "hnae3.h"
enum hns3_nic_state {
@@ -408,6 +411,8 @@ struct hns3_enet_ring {
struct hnae3_queue *tqp;
int queue_index;
struct device *dev; /* will be used for DMA mapping of descriptors */
+ struct page_pool *page_pool;
struct hnae3_queue *tqp;
int queue_index;
struct device *dev; /* will be used for DMA mapping of descriptors */
+ struct page_pool *page_pool;
+ struct xdp_rxq_info rxq_info;
/* statistic */
struct ring_stats stats;
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 75fffc1..70c310e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -195,6 +195,8 @@ static inline void page_pool_put_full_page(struct page_pool *pool,
#endif
}
+bool page_pool_atomic_sub_if_positive(struct page *page, int i);
+
/* Same as above but the caller must guarantee safe context. e.g NAPI */
static inline void page_pool_recycle_direct(struct page_pool *pool,
struct page *page)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43bfd2e..8bc8b7e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -596,6 +596,26 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
EXPORT_SYMBOL(page_pool_update_nid);
+bool page_pool_atomic_sub_if_positive(struct page *page, int i)
+{
+ atomic_t *v = &page->_refcount;
+ int dec, c;
+
+ do {
+ c = atomic_read(v);
+
+ dec = c - i;
+ if (unlikely(dec == 0))
+ return false;
+ else if (unlikely(dec < 0)) {
+ pr_err("c: %d, dec: %d, i: %d\n", c, dec, i);
+ return false;
+ }
+ } while (!atomic_try_cmpxchg(v, &c, dec));
+
+ return true;
+}
+
bool page_pool_return_skb_page(void *data)
{
struct xdp_mem_info mem_info;
@@ -606,6 +626,9 @@ bool page_pool_return_skb_page(void *data)
if (unlikely(page->signature != PP_SIGNATURE))
return false;
+ if (page_pool_atomic_sub_if_positive(page, 1))
+ return true;
+
info.raw = page_private(page);
mem_info = info.mem_info;
Hi Yunsheng,
On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote:
> On 2021/4/10 6:37, Matteo Croce wrote:
> > From: Matteo Croce <[email protected]>
> >
> > This is a respin of [1]
> >
> > This patchset shows the plans for allowing page_pool to handle and
> > maintain DMA map/unmap of the pages it serves to the driver. For this
> > to work a return hook in the network core is introduced.
> >
> > The overall purpose is to simplify drivers, by providing a page
> > allocation API that does recycling, such that each driver doesn't have
> > to reinvent its own recycling scheme. Using page_pool in a driver
> > does not require implementing XDP support, but it makes it trivially
> > easy to do so. Instead of allocating buffers specifically for SKBs
> > we now allocate a generic buffer and either wrap it on an SKB
> > (via build_skb) or create an XDP frame.
> > The recycling code leverages the XDP recycle APIs.
> >
> > The Marvell mvpp2 and mvneta drivers are used in this patchset to
> > demonstrate how to use the API, and tested on a MacchiatoBIN
> > and EspressoBIN boards respectively.
> >
>
> Hi, Matteo
> I added the skb frag page recycling in hns3 based on this patchset,
> and it has above 10%~20% performance improvement for one thread iperf
> TCP flow(IOMMU is off, there may be more performance improvement if
> considering the DMA map/unmap avoiding for IOMMU), thanks for the job.
>
> The skb frag page recycling support in hns3 driver is not so simple
> as the mvpp2 and mvneta driver, because:
>
> 1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info"
> is added to assist relation binding between the "struct page" and
> "struct page_pool".
>
> 2. the hns3 driver has already a page reusing based on page spliting and
> page reference count, but it may not work if the upper stack can not
> handle skb and release the corresponding page fast enough.
>
> 3. the hns3 driver support page reference count updating batching, see:
> aeda9bf87a45 ("net: hns3: batch the page reference count updates")
>
> So it would be better if:
>
> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
> "struct xdp_mem_info" to bond the relation between "struct page" and
> "struct page_pool", which seems uncessary at this point if bonding
> a "struct page_pool" pointer directly in "struct page" does not cause
> space increasing.
We can't do that. The reason we need those structs is that we rely on the
existing XDP code, which already recycles it's buffers, to enable
recycling. Since we allocate a page per packet when using page_pool for a
driver , the same ideas apply to an SKB and XDP frame. We just recycle the
payload and we don't really care what's in that. We could rename the functions
to something more generic in the future though ?
>
> 2. it would be good to do the page reference count updating batching
> in page pool instead of specific driver.
>
>
> page_pool_atomic_sub_if_positive() is added to decide who can call
> page_pool_put_full_page(), because the driver and stack may hold
> reference to the same page, only if last one which hold complete
> reference to a page can call page_pool_put_full_page() to decide if
> recycling is possible, if not, the page is released, so I am wondering
> if a similar page_pool_atomic_sub_if_positive() can added to specific
> user space address unmapping path to allow skb recycling for RX zerocopy
> too?
>
I would prefer a different page pool type if we wanted to support the split
page model. The changes as is are quite intrusive, since they change the
entire skb return path. So I would prefer introducing the changes one at a
time.
The fundamental difference between having the recycling in the driver vs
having it in a generic API is pretty straightforward. When a driver holds
the extra page references he is free to decide what to reuse, when he is about
to refill his Rx descriptors. So TCP zerocopy might work even if the
userspace applications hold the buffers for an X amount of time.
On this proposal though we *need* to decide what to do with the buffer when we
are about to free the skb.
[...]
Cheers
/Ilias
On 2021/4/30 2:51, Ilias Apalodimas wrote:
> Hi Yunsheng,
>
> On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote:
>> On 2021/4/10 6:37, Matteo Croce wrote:
>>> From: Matteo Croce <[email protected]>
[...]
>>
>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
>> "struct xdp_mem_info" to bond the relation between "struct page" and
>> "struct page_pool", which seems uncessary at this point if bonding
>> a "struct page_pool" pointer directly in "struct page" does not cause
>> space increasing.
>
> We can't do that. The reason we need those structs is that we rely on the
> existing XDP code, which already recycles it's buffers, to enable
> recycling. Since we allocate a page per packet when using page_pool for a
> driver , the same ideas apply to an SKB and XDP frame. We just recycle the
I am not really familar with XDP here, but a packet from hw is either a
"struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
the same time, right?
What does not really make sense to me is that the page has to be from page
pool when a skb's frag page can be recycled, right? If it is ture, the switch
case in __xdp_return() does not really make sense for skb recycling, why go
all the trouble of checking the mem->type and mem->id to find the page_pool
pointer when recyclable page for skb can only be from page pool?
> payload and we don't really care what's in that. We could rename the functions
> to something more generic in the future though ?
>
>>
>> 2. it would be good to do the page reference count updating batching
>> in page pool instead of specific driver.
>>
>>
>> page_pool_atomic_sub_if_positive() is added to decide who can call
>> page_pool_put_full_page(), because the driver and stack may hold
>> reference to the same page, only if last one which hold complete
>> reference to a page can call page_pool_put_full_page() to decide if
>> recycling is possible, if not, the page is released, so I am wondering
>> if a similar page_pool_atomic_sub_if_positive() can added to specific
>> user space address unmapping path to allow skb recycling for RX zerocopy
>> too?
>>
>
> I would prefer a different page pool type if we wanted to support the split
> page model. The changes as is are quite intrusive, since they change the
> entire skb return path. So I would prefer introducing the changes one at a
> time.
I understand there may be fundamental semantic change when split page model
is supported by page pool, but the split page support change mainly affect the
skb recycling path and the driver that uses page pool(XDP too) if we are careful
enough, not the entire skb return path as my understanding.
Anyway, one changes at a time is always prefered if supporting split page is
proved to be non-trivel and intrusive.
>
> The fundamental difference between having the recycling in the driver vs
> having it in a generic API is pretty straightforward. When a driver holds
> the extra page references he is free to decide what to reuse, when he is about
> to refill his Rx descriptors. So TCP zerocopy might work even if the
> userspace applications hold the buffers for an X amount of time.
> On this proposal though we *need* to decide what to do with the buffer when we
> are about to free the skb.
I am not sure I understand what you meant by "free the skb", does it mean
that kfree_skb() is called to free the skb.
As my understanding, if the skb completely own the page(which means page_count()
== 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
try to handle it atomically.
>
> [...]
>
>
> Cheers
> /Ilias
>
> .
>
[...]
> >>
> >> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
> >> "struct xdp_mem_info" to bond the relation between "struct page" and
> >> "struct page_pool", which seems uncessary at this point if bonding
> >> a "struct page_pool" pointer directly in "struct page" does not cause
> >> space increasing.
> >
> > We can't do that. The reason we need those structs is that we rely on the
> > existing XDP code, which already recycles it's buffers, to enable
> > recycling. Since we allocate a page per packet when using page_pool for a
> > driver , the same ideas apply to an SKB and XDP frame. We just recycle the
>
> I am not really familar with XDP here, but a packet from hw is either a
> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> the same time, right?
>
Yes, but the payload is irrelevant in both cases and that's what we use
page_pool for. You can't use this patchset unless your driver usues
build_skb(). So in both cases you just allocate memory for the payload and
decide what the wrap the buffer with (XDP or SKB) later.
> What does not really make sense to me is that the page has to be from page
> pool when a skb's frag page can be recycled, right? If it is ture, the switch
> case in __xdp_return() does not really make sense for skb recycling, why go
> all the trouble of checking the mem->type and mem->id to find the page_pool
> pointer when recyclable page for skb can only be from page pool?
In any case you need to find in which pool the buffer you try to recycle
belongs. In order to make the whole idea generic and be able to recycle skb
fragments instead of just the skb head you need to store some information on
struct page. That's the fundamental difference of this patchset compared to
the RFC we sent a few years back [1] which was just storing information on the
skb. The way this is done on the current patchset is that we store the
struct xdp_mem_info in page->private and then look it up on xdp_return().
Now that being said Matthew recently reworked struct page, so we could see if
we can store the page pool pointer directly instead of the struct
xdp_mem_info. That would allow us to call into page pool functions directly.
But we'll have to agree if that makes sense to go into struct page to begin
with and make sure the pointer is still valid when we take the recycling path.
> > payload and we don't really care what's in that. We could rename the functions
> > to something more generic in the future though ?
> >
> >>
> >> 2. it would be good to do the page reference count updating batching
> >> in page pool instead of specific driver.
> >>
> >>
> >> page_pool_atomic_sub_if_positive() is added to decide who can call
> >> page_pool_put_full_page(), because the driver and stack may hold
> >> reference to the same page, only if last one which hold complete
> >> reference to a page can call page_pool_put_full_page() to decide if
> >> recycling is possible, if not, the page is released, so I am wondering
> >> if a similar page_pool_atomic_sub_if_positive() can added to specific
> >> user space address unmapping path to allow skb recycling for RX zerocopy
> >> too?
> >>
> >
> > I would prefer a different page pool type if we wanted to support the split
> > page model. The changes as is are quite intrusive, since they change the
> > entire skb return path. So I would prefer introducing the changes one at a
> > time.
>
> I understand there may be fundamental semantic change when split page model
> is supported by page pool, but the split page support change mainly affect the
> skb recycling path and the driver that uses page pool(XDP too) if we are careful
> enough, not the entire skb return path as my understanding.
It affects those drivers only, but in order to do so is intercepts the
packet in skb_free_head(), which pretty much affects the entire network path.
>
> Anyway, one changes at a time is always prefered if supporting split page is
> proved to be non-trivel and intrusive.
>
> >
> > The fundamental difference between having the recycling in the driver vs
> > having it in a generic API is pretty straightforward. When a driver holds
> > the extra page references he is free to decide what to reuse, when he is about
> > to refill his Rx descriptors. So TCP zerocopy might work even if the
> > userspace applications hold the buffers for an X amount of time.
> > On this proposal though we *need* to decide what to do with the buffer when we
> > are about to free the skb.
>
> I am not sure I understand what you meant by "free the skb", does it mean
> that kfree_skb() is called to free the skb.
Yes
>
> As my understanding, if the skb completely own the page(which means page_count()
> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
> try to handle it atomically.
>
Not really, the opposite is happening here. If the pp_recycle bit is set we
will always call page_pool_return_skb_page(). If the page signature matches
the 'magic' set by page pool we will always call xdp_return_skb_frame() will
end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
to recycle the page. If it's not we'll release it from page_pool (releasing
some internal references we keep) unmap the buffer and decrement the refcnt.
[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
Cheers
/Ilias
(-cc invalid emails)
Replying to my self here but....
[...]
> > >
> > > We can't do that. The reason we need those structs is that we rely on the
> > > existing XDP code, which already recycles it's buffers, to enable
> > > recycling. Since we allocate a page per packet when using page_pool for a
> > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the
> >
> > I am not really familar with XDP here, but a packet from hw is either a
> > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> > the same time, right?
> >
>
> Yes, but the payload is irrelevant in both cases and that's what we use
> page_pool for. You can't use this patchset unless your driver usues
> build_skb(). So in both cases you just allocate memory for the payload and
> decide what the wrap the buffer with (XDP or SKB) later.
>
> > What does not really make sense to me is that the page has to be from page
> > pool when a skb's frag page can be recycled, right? If it is ture, the switch
> > case in __xdp_return() does not really make sense for skb recycling, why go
> > all the trouble of checking the mem->type and mem->id to find the page_pool
> > pointer when recyclable page for skb can only be from page pool?
>
> In any case you need to find in which pool the buffer you try to recycle
> belongs. In order to make the whole idea generic and be able to recycle skb
> fragments instead of just the skb head you need to store some information on
> struct page. That's the fundamental difference of this patchset compared to
> the RFC we sent a few years back [1] which was just storing information on the
> skb. The way this is done on the current patchset is that we store the
> struct xdp_mem_info in page->private and then look it up on xdp_return().
>
> Now that being said Matthew recently reworked struct page, so we could see if
> we can store the page pool pointer directly instead of the struct
> xdp_mem_info. That would allow us to call into page pool functions directly.
> But we'll have to agree if that makes sense to go into struct page to begin
> with and make sure the pointer is still valid when we take the recycling path.
>
Thinking more about it the reason that prevented us from storing a
page pool pointer directly is not there anymore. Jesper fixed that
already a while back. So we might as well store the page_pool ptr in
page->private and call into the functions directly. I'll have a look
before v4.
[...]
Thanks
/Ilias
On Fri, 30 Apr 2021 20:32:07 +0300
Ilias Apalodimas <[email protected]> wrote:
> (-cc invalid emails)
> Replying to my self here but....
>
> [...]
> > > >
> > > > We can't do that. The reason we need those structs is that we rely on the
> > > > existing XDP code, which already recycles it's buffers, to enable
> > > > recycling. Since we allocate a page per packet when using page_pool for a
> > > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the
> > >
> > > I am not really familar with XDP here, but a packet from hw is either a
> > > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> > > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> > > the same time, right?
> > >
> >
> > Yes, but the payload is irrelevant in both cases and that's what we use
> > page_pool for. You can't use this patchset unless your driver usues
> > build_skb(). So in both cases you just allocate memory for the payload and
> > decide what the wrap the buffer with (XDP or SKB) later.
> >
> > > What does not really make sense to me is that the page has to be from page
> > > pool when a skb's frag page can be recycled, right? If it is ture, the switch
> > > case in __xdp_return() does not really make sense for skb recycling, why go
> > > all the trouble of checking the mem->type and mem->id to find the page_pool
> > > pointer when recyclable page for skb can only be from page pool?
> >
> > In any case you need to find in which pool the buffer you try to recycle
> > belongs. In order to make the whole idea generic and be able to recycle skb
> > fragments instead of just the skb head you need to store some information on
> > struct page. That's the fundamental difference of this patchset compared to
> > the RFC we sent a few years back [1] which was just storing information on the
> > skb. The way this is done on the current patchset is that we store the
> > struct xdp_mem_info in page->private and then look it up on xdp_return().
> >
> > Now that being said Matthew recently reworked struct page, so we could see if
> > we can store the page pool pointer directly instead of the struct
> > xdp_mem_info. That would allow us to call into page pool functions directly.
> > But we'll have to agree if that makes sense to go into struct page to begin
> > with and make sure the pointer is still valid when we take the recycling path.
> >
>
> Thinking more about it the reason that prevented us from storing a
> page pool pointer directly is not there anymore. Jesper fixed that
> already a while back. So we might as well store the page_pool ptr in
> page->private and call into the functions directly. I'll have a look
> before v4.
I want to give credit to Jonathan Lemon whom came up with the idea of
storing the page_pool object that "owns" the page directly in struct
page. I see this as an optimization that we can add later, so it
doesn't block this patchset. As Ilias mention, it required some
work/changes[1]+[2] to guarantee that the page_pool object life-time
were longer than all the outstanding in-flight page-objects, but that
have been stable for some/many kernel releases now. This is already
need/used for making sure the DMA-mappings can be safely released[1],
but I on-purpose enabled the same in-flight tracking for page_pool
users that doesn't use the DMA-mapping feature (making sure the code is
exercised).
[1] 99c07c43c4ea ("xdp: tracking page_pool resources and safe removal")
[2] c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.")
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Thu, May 06, 2021 at 08:34:48PM +0800, Yunsheng Lin wrote:
> On 2021/5/1 0:24, Ilias Apalodimas wrote:
> > [...]
> >>>>
> >>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
> >>>> "struct xdp_mem_info" to bond the relation between "struct page" and
> >>>> "struct page_pool", which seems uncessary at this point if bonding
> >>>> a "struct page_pool" pointer directly in "struct page" does not cause
> >>>> space increasing.
> >>>
> >>> We can't do that. The reason we need those structs is that we rely on the
> >>> existing XDP code, which already recycles it's buffers, to enable
> >>> recycling. Since we allocate a page per packet when using page_pool for a
> >>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the
> >>
> >> I am not really familar with XDP here, but a packet from hw is either a
> >> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> >> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> >> the same time, right?
> >>
> >
> > Yes, but the payload is irrelevant in both cases and that's what we use
> > page_pool for. You can't use this patchset unless your driver usues
> > build_skb(). So in both cases you just allocate memory for the payload and
>
> I am not sure I understood why build_skb() matters here. If the head data of
> a skb is a page frag and is from page pool, then it's page->signature should be
> PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does
> not require it's head data being from a page pool, right?
>
Correct, and that's the big improvement compared to the original RFC.
The wording was a bit off in my initial response. I was trying to point out
you can recycle *any* buffer coming from page_pool and one of the ways you can
do that in your driver, is use build_skb() while the payload is allocated by
page_pool.
> > decide what the wrap the buffer with (XDP or SKB) later.
>
> [...]
>
> >>
> >> I am not sure I understand what you meant by "free the skb", does it mean
> >> that kfree_skb() is called to free the skb.
> >
> > Yes
> >
> >>
> >> As my understanding, if the skb completely own the page(which means page_count()
> >> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
> >> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
> >> try to handle it atomically.
> >>
> >
> > Not really, the opposite is happening here. If the pp_recycle bit is set we
> > will always call page_pool_return_skb_page(). If the page signature matches
> > the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> > end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> > to recycle the page. If it's not we'll release it from page_pool (releasing
> > some internal references we keep) unmap the buffer and decrement the refcnt.
>
> Yes, I understood the above is what the page pool do now.
>
> But the question is who is still holding an extral reference to the page when
> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
> So that we can always reuse the recyclable page from a recyclable skb. This may
> make the page_pool_destroy() process delays longer than before, I am supposed the
> page_pool_destroy() delaying for cloned skb case does not really matters here.
>
> If the above works, I think the samiliar handling can be added to RX zerocopy if
> the RX zerocopy also hold extral references to the recyclable page from a recyclable
> skb too?
>
Right, this sounds doable, but I'll have to go back code it and see if it
really makes sense. However I'd still prefer the support to go in as-is
(including the struct xdp_mem_info in struct page, instead of a page_pool
pointer).
There's a couple of reasons for that. If we keep the struct xdp_mem_info we
can in the future recycle different kind of buffers using __xdp_return().
And this is a non intrusive change if we choose to store the page pool address
directly in the future. It just affects the internal contract between the
page_pool code and struct page. So it won't affect any drivers that already
use the feature.
Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer
playing it safe for now and getting rid of the buffers that somehow ended up
holding an extra reference. Once this gets approved we can go back and try to
save the extra space. I hope I am not wrong but the changes required to
support a few extra refcounts should not change the current patches much.
Thanks for taking the time on this!
/Ilias
> >
> > [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
> >
> > Cheers
> > /Ilias
> >
> > .
> >
>
On 2021/5/1 0:24, Ilias Apalodimas wrote:
> [...]
>>>>
>>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
>>>> "struct xdp_mem_info" to bond the relation between "struct page" and
>>>> "struct page_pool", which seems uncessary at this point if bonding
>>>> a "struct page_pool" pointer directly in "struct page" does not cause
>>>> space increasing.
>>>
>>> We can't do that. The reason we need those structs is that we rely on the
>>> existing XDP code, which already recycles it's buffers, to enable
>>> recycling. Since we allocate a page per packet when using page_pool for a
>>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the
>>
>> I am not really familar with XDP here, but a packet from hw is either a
>> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
>> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
>> the same time, right?
>>
>
> Yes, but the payload is irrelevant in both cases and that's what we use
> page_pool for. You can't use this patchset unless your driver usues
> build_skb(). So in both cases you just allocate memory for the payload and
I am not sure I understood why build_skb() matters here. If the head data of
a skb is a page frag and is from page pool, then it's page->signature should be
PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does
not require it's head data being from a page pool, right?
> decide what the wrap the buffer with (XDP or SKB) later.
[...]
>>
>> I am not sure I understand what you meant by "free the skb", does it mean
>> that kfree_skb() is called to free the skb.
>
> Yes
>
>>
>> As my understanding, if the skb completely own the page(which means page_count()
>> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
>> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
>> try to handle it atomically.
>>
>
> Not really, the opposite is happening here. If the pp_recycle bit is set we
> will always call page_pool_return_skb_page(). If the page signature matches
> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> to recycle the page. If it's not we'll release it from page_pool (releasing
> some internal references we keep) unmap the buffer and decrement the refcnt.
Yes, I understood the above is what the page pool do now.
But the question is who is still holding an extral reference to the page when
kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
reference to the same page? So why not just do a page_ref_dec() if the orginal skb
is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
So that we can always reuse the recyclable page from a recyclable skb. This may
make the page_pool_destroy() process delays longer than before, I am supposed the
page_pool_destroy() delaying for cloned skb case does not really matters here.
If the above works, I think the samiliar handling can be added to RX zerocopy if
the RX zerocopy also hold extral references to the recyclable page from a recyclable
skb too?
>
> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
>
> Cheers
> /Ilias
>
> .
>
On 2021/5/6 20:58, Ilias Apalodimas wrote:
>>>>
>>>
>>> Not really, the opposite is happening here. If the pp_recycle bit is set we
>>> will always call page_pool_return_skb_page(). If the page signature matches
>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
>>> to recycle the page. If it's not we'll release it from page_pool (releasing
>>> some internal references we keep) unmap the buffer and decrement the refcnt.
>>
>> Yes, I understood the above is what the page pool do now.
>>
>> But the question is who is still holding an extral reference to the page when
>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
>> So that we can always reuse the recyclable page from a recyclable skb. This may
>> make the page_pool_destroy() process delays longer than before, I am supposed the
>> page_pool_destroy() delaying for cloned skb case does not really matters here.
>>
>> If the above works, I think the samiliar handling can be added to RX zerocopy if
>> the RX zerocopy also hold extral references to the recyclable page from a recyclable
>> skb too?
>>
>
> Right, this sounds doable, but I'll have to go back code it and see if it
> really makes sense. However I'd still prefer the support to go in as-is
> (including the struct xdp_mem_info in struct page, instead of a page_pool
> pointer).
>
> There's a couple of reasons for that. If we keep the struct xdp_mem_info we
> can in the future recycle different kind of buffers using __xdp_return().
> And this is a non intrusive change if we choose to store the page pool address
> directly in the future. It just affects the internal contract between the
> page_pool code and struct page. So it won't affect any drivers that already
> use the feature.
This patchset has embeded a signature field in "struct page", and xdp_mem_info
is stored in page_private(), which seems not considering the case for associating
the page pool with "struct page" directly yet? Is the page pool also stored in
page_private() and a different signature is used to indicate that?
I am not saying we have to do it in this patchset, but we have to consider it
while we are adding new signature field to "struct page", right?
> Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer
> playing it safe for now and getting rid of the buffers that somehow ended up
> holding an extra reference. Once this gets approved we can go back and try to
> save the extra space. I hope I am not wrong but the changes required to
> support a few extra refcounts should not change the current patches much.
>
> Thanks for taking the time on this!
Thanks all invovled in the effort improving page pool too:)
> /Ilias
>
>>>
>>> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
>>>
>>> Cheers
>>> /Ilias
>>>
>>> .
>>>
>>
>
> .
>
On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
> On 2021/5/6 20:58, Ilias Apalodimas wrote:
> >>>>
> >>>
> >>> Not really, the opposite is happening here. If the pp_recycle bit is set we
> >>> will always call page_pool_return_skb_page(). If the page signature matches
> >>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> >>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> >>> to recycle the page. If it's not we'll release it from page_pool (releasing
> >>> some internal references we keep) unmap the buffer and decrement the refcnt.
> >>
> >> Yes, I understood the above is what the page pool do now.
> >>
> >> But the question is who is still holding an extral reference to the page when
> >> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
> >> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
> >> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
> >> So that we can always reuse the recyclable page from a recyclable skb. This may
> >> make the page_pool_destroy() process delays longer than before, I am supposed the
> >> page_pool_destroy() delaying for cloned skb case does not really matters here.
> >>
> >> If the above works, I think the samiliar handling can be added to RX zerocopy if
> >> the RX zerocopy also hold extral references to the recyclable page from a recyclable
> >> skb too?
> >>
> >
> > Right, this sounds doable, but I'll have to go back code it and see if it
> > really makes sense. However I'd still prefer the support to go in as-is
> > (including the struct xdp_mem_info in struct page, instead of a page_pool
> > pointer).
> >
> > There's a couple of reasons for that. If we keep the struct xdp_mem_info we
> > can in the future recycle different kind of buffers using __xdp_return().
> > And this is a non intrusive change if we choose to store the page pool address
> > directly in the future. It just affects the internal contract between the
> > page_pool code and struct page. So it won't affect any drivers that already
> > use the feature.
>
> This patchset has embeded a signature field in "struct page", and xdp_mem_info
> is stored in page_private(), which seems not considering the case for associating
> the page pool with "struct page" directly yet?
Correct
> Is the page pool also stored in
> page_private() and a different signature is used to indicate that?
No only struct xdp_mem_info as you mentioned before
>
> I am not saying we have to do it in this patchset, but we have to consider it
> while we are adding new signature field to "struct page", right?
We won't need a new signature. The signature in both cases is there to
guarantee the page you are trying to recycle was indeed allocated by page_pool.
Basically we got two design choices here:
- We store the page_pool ptr address directly in page->private and then,
we call into page_pool APIs directly to do the recycling.
That would eliminate the lookup through xdp_mem_info and the
XDP helpers to locate page pool pointer (through __xdp_return).
- You store the xdp_mem_info on page_private. In that case you need to go
through __xdp_return() to locate the page_pool pointer. Although we might
loose some performance that would allow us to recycle additional memory types
and not only MEM_TYPE_PAGE_POOL (in case we ever need it).
I think both choices are sane. What I am trying to explain here, is
regardless of what we choose now, we can change it in the future without
affecting the API consumers at all. What will change internally is the way we
lookup the page pool pointer we are trying to recycle.
[...]
Cheers
/Ilias
On 2021/5/7 15:06, Ilias Apalodimas wrote:
> On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
>> On 2021/5/6 20:58, Ilias Apalodimas wrote:
>>>>>>
>>>>>
>>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we
>>>>> will always call page_pool_return_skb_page(). If the page signature matches
>>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
>>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
>>>>> to recycle the page. If it's not we'll release it from page_pool (releasing
>>>>> some internal references we keep) unmap the buffer and decrement the refcnt.
>>>>
>>>> Yes, I understood the above is what the page pool do now.
>>>>
>>>> But the question is who is still holding an extral reference to the page when
>>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
>>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
>>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
>>>> So that we can always reuse the recyclable page from a recyclable skb. This may
>>>> make the page_pool_destroy() process delays longer than before, I am supposed the
>>>> page_pool_destroy() delaying for cloned skb case does not really matters here.
>>>>
>>>> If the above works, I think the samiliar handling can be added to RX zerocopy if
>>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable
>>>> skb too?
>>>>
>>>
>>> Right, this sounds doable, but I'll have to go back code it and see if it
>>> really makes sense. However I'd still prefer the support to go in as-is
>>> (including the struct xdp_mem_info in struct page, instead of a page_pool
>>> pointer).
>>>
>>> There's a couple of reasons for that. If we keep the struct xdp_mem_info we
>>> can in the future recycle different kind of buffers using __xdp_return().
>>> And this is a non intrusive change if we choose to store the page pool address
>>> directly in the future. It just affects the internal contract between the
>>> page_pool code and struct page. So it won't affect any drivers that already
>>> use the feature.
>>
>> This patchset has embeded a signature field in "struct page", and xdp_mem_info
>> is stored in page_private(), which seems not considering the case for associating
>> the page pool with "struct page" directly yet?
>
> Correct
>
>> Is the page pool also stored in
>> page_private() and a different signature is used to indicate that?
>
> No only struct xdp_mem_info as you mentioned before
>
>>
>> I am not saying we have to do it in this patchset, but we have to consider it
>> while we are adding new signature field to "struct page", right?
>
> We won't need a new signature. The signature in both cases is there to
> guarantee the page you are trying to recycle was indeed allocated by page_pool.
>
> Basically we got two design choices here:
> - We store the page_pool ptr address directly in page->private and then,
> we call into page_pool APIs directly to do the recycling.
> That would eliminate the lookup through xdp_mem_info and the
> XDP helpers to locate page pool pointer (through __xdp_return).
> - You store the xdp_mem_info on page_private. In that case you need to go
> through __xdp_return() to locate the page_pool pointer. Although we might
> loose some performance that would allow us to recycle additional memory types
> and not only MEM_TYPE_PAGE_POOL (in case we ever need it).
So the signature field in "struct page" is used to only indicate a page is
from a page pool, then how do we tell the content of page_private() if both of
the above choices are needed, we might still need an extra indicator to tell
page_private() is page_pool ptr or xdp_mem_info.
It seems storing the page pool ptr in page_private() is clear for recyclable
page from a recyclable skb use case, and the use case for storing xdp_mem_info
in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the
"struct xdp_frame", so it does not need the xdp_mem_info from page_private().
If the above is true, what does not really makes sense to me here is that:
why do we first implement a unclear use case for storing xdp_mem_info in
page_private(), why not implement the clear use case for storing page pool ptr
in page_private() first?
>
>
> I think both choices are sane. What I am trying to explain here, is
> regardless of what we choose now, we can change it in the future without
> affecting the API consumers at all. What will change internally is the way we
> lookup the page pool pointer we are trying to recycle.
It seems the below API need changing?
+static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
+ struct xdp_mem_info *mem)
>
> [...]
>
>
> Cheers
> /Ilias
>
> .
>
On Fri, 7 May 2021 16:28:30 +0800
Yunsheng Lin <[email protected]> wrote:
> On 2021/5/7 15:06, Ilias Apalodimas wrote:
> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
> >> On 2021/5/6 20:58, Ilias Apalodimas wrote:
> >>>>>>
> >>>>>
> >>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we
> >>>>> will always call page_pool_return_skb_page(). If the page signature matches
> >>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> >>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> >>>>> to recycle the page. If it's not we'll release it from page_pool (releasing
> >>>>> some internal references we keep) unmap the buffer and decrement the refcnt.
> >>>>
> >>>> Yes, I understood the above is what the page pool do now.
> >>>>
> >>>> But the question is who is still holding an extral reference to the page when
> >>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
> >>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
> >>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
> >>>> So that we can always reuse the recyclable page from a recyclable skb. This may
> >>>> make the page_pool_destroy() process delays longer than before, I am supposed the
> >>>> page_pool_destroy() delaying for cloned skb case does not really matters here.
> >>>>
> >>>> If the above works, I think the samiliar handling can be added to RX zerocopy if
> >>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable
> >>>> skb too?
> >>>>
> >>>
> >>> Right, this sounds doable, but I'll have to go back code it and see if it
> >>> really makes sense. However I'd still prefer the support to go in as-is
> >>> (including the struct xdp_mem_info in struct page, instead of a page_pool
> >>> pointer).
> >>>
> >>> There's a couple of reasons for that. If we keep the struct xdp_mem_info we
> >>> can in the future recycle different kind of buffers using __xdp_return().
> >>> And this is a non intrusive change if we choose to store the page pool address
> >>> directly in the future. It just affects the internal contract between the
> >>> page_pool code and struct page. So it won't affect any drivers that already
> >>> use the feature.
> >>
> >> This patchset has embeded a signature field in "struct page", and xdp_mem_info
> >> is stored in page_private(), which seems not considering the case for associating
> >> the page pool with "struct page" directly yet?
> >
> > Correct
> >
> >> Is the page pool also stored in
> >> page_private() and a different signature is used to indicate that?
> >
> > No only struct xdp_mem_info as you mentioned before
> >
> >>
> >> I am not saying we have to do it in this patchset, but we have to consider it
> >> while we are adding new signature field to "struct page", right?
> >
> > We won't need a new signature. The signature in both cases is there to
> > guarantee the page you are trying to recycle was indeed allocated by page_pool.
> >
> > Basically we got two design choices here:
> > - We store the page_pool ptr address directly in page->private and then,
> > we call into page_pool APIs directly to do the recycling.
> > That would eliminate the lookup through xdp_mem_info and the
> > XDP helpers to locate page pool pointer (through __xdp_return).
> > - You store the xdp_mem_info on page_private. In that case you need to go
> > through __xdp_return() to locate the page_pool pointer. Although we might
> > loose some performance that would allow us to recycle additional memory types
> > and not only MEM_TYPE_PAGE_POOL (in case we ever need it).
>
> So the signature field in "struct page" is used to only indicate a page is
> from a page pool, then how do we tell the content of page_private() if both of
> the above choices are needed, we might still need an extra indicator to tell
> page_private() is page_pool ptr or xdp_mem_info.
The signature field in "struct page" and "xdp_mem_info" is a double
construct that was introduced in this patchset. AFAIK Matteo took the
idea from Jonathan's patchset. I'm not convinced we need both, maybe
later we do (when someone introduce a new mem model ala NetGPU).
I think Jonathan's use-case was NetGPU[1] (which got shutdown due to
Nvidia[2] being involved which I think was unfair). The general idea
behind NetGPU makes sense to me, to allow packet headers to live in
first page, and second page belongs to hardware. This implies that an
SKB can can point to two different pages with different memory types,
which need to be handled correctly when freeing the SKB and the pages it
points to. Thus, placing (xdp_)mem_info in SKB is wrong as it implies
all pages belong the same mem_info.type.
The point is, when designing this I want us to think about how our
design can handle other memory models than just page_pool.
In this patchset design, we use a single bit in SKB to indicate that
the pages pointed comes from another memory model, in this case
page_pool is the only user of this bit. The remaining info about the
memory model (page_pool) is stored in struct-page, which we look at
when freeing the pages that the SKB points to (that is at the layer
above the MM-calls that would free the page for real).
[1] https://linuxplumbersconf.org/event/7/contributions/670/
[2] https://lwn.net/Articles/827596/
> It seems storing the page pool ptr in page_private() is clear for recyclable
> page from a recyclable skb use case, and the use case for storing xdp_mem_info
> in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the
> "struct xdp_frame", so it does not need the xdp_mem_info from page_private().
>
> If the above is true, what does not really makes sense to me here is that:
> why do we first implement a unclear use case for storing xdp_mem_info in
> page_private(), why not implement the clear use case for storing page pool ptr
> in page_private() first?
I'm actually not against storing page_pool object ptr directly in
struct-page. It is a nice optimization. Perhaps we should implement
this optimization outside this patchset first, and let __xdp_return()
for XDP-redirected packets also take advantage to this optimization?
Then it would feel more natural to also used this optimization in this
patchset, right?
> >
> >
> > I think both choices are sane. What I am trying to explain here, is
> > regardless of what we choose now, we can change it in the future without
> > affecting the API consumers at all. What will change internally is the way we
> > lookup the page pool pointer we are trying to recycle.
>
> It seems the below API need changing?
> +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
> + struct xdp_mem_info *mem)
I don't think we need to change this API, to support future memory
models. Notice that xdp_mem_info have a 'type' member.
Naming in Computer Science is a hard problem ;-). Something that seems
to confuse a lot of people is the naming of the struct "xdp_mem_info".
Maybe we should have named it "mem_info" instead or "net_mem_info", as
it doesn't indicate that the device is running XDP.
I see XDP as the RX-layer before the network stack, that helps drivers
to support different memory models, also for handling normal packets
that doesn't get process by XDP, and the drivers doesn't even need to
support XDP to use the "xdp_mem_info" type.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
struct xdp_mem_info {
u32 type; /* enum xdp_mem_type, but known size type */
u32 id;
};
enum xdp_mem_type {
MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
MEM_TYPE_PAGE_POOL,
MEM_TYPE_XSK_BUFF_POOL,
MEM_TYPE_MAX,
};
On Fri, May 07, 2021 at 12:19:53PM +0200, Jesper Dangaard Brouer wrote:
> Nvidia[2] being involved which I think was unfair). The general idea
> behind NetGPU makes sense to me,
Sorry, but that is utter bullshit. It was rejected because it did
nothing but injecting hooks for an out of tree module while ignoring the
existing kernel infrastructure for much of what it tries to archive.
Jesper Dangaard Brouer <[email protected]> writes:
> On Fri, 7 May 2021 16:28:30 +0800
> Yunsheng Lin <[email protected]> wrote:
>
>> On 2021/5/7 15:06, Ilias Apalodimas wrote:
>> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
>> >> On 2021/5/6 20:58, Ilias Apalodimas wrote:
>> >>>>>>
>> >>>>>
> ...
>> >
>> >
>> > I think both choices are sane. What I am trying to explain
>> > here, is
>> > regardless of what we choose now, we can change it in the
>> > future without
>> > affecting the API consumers at all. What will change
>> > internally is the way we
>> > lookup the page pool pointer we are trying to recycle.
>>
>> It seems the below API need changing?
>> +static inline void skb_mark_for_recycle(struct sk_buff *skb,
>> struct page *page,
>> + struct xdp_mem_info *mem)
>
> I don't think we need to change this API, to support future
> memory
> models. Notice that xdp_mem_info have a 'type' member.
Hi,
Providing that we will (possibly as a future optimization) store
the pointer to the page pool in struct page instead of strcut
xdp_mem_info, passing
xdp_mem_info * instead of struct page_pool * would mean that for
every packet we'll need to call
xa = rhashtable_lookup(mem_id_ht, &mem->id,
mem_id_rht_params);
xa->page_pool;
which might pressure the Dcache to fetch a pointer that might be
present already in cache as part of driver's data-structures.
I tend to agree with Yunsheng that it makes more sense to adjust
the API for the clear use-case now rather than using xdp_mem_info
indirection. It seems to me like
the page signature provides the same information anyway and allows
to support different memory types.
Shay
>
> Naming in Computer Science is a hard problem ;-). Something that
> seems
> to confuse a lot of people is the naming of the struct
> "xdp_mem_info".
> Maybe we should have named it "mem_info" instead or
> "net_mem_info", as
> it doesn't indicate that the device is running XDP.
>
> I see XDP as the RX-layer before the network stack, that helps
> drivers
> to support different memory models, also for handling normal
> packets
> that doesn't get process by XDP, and the drivers doesn't even
> need to
> support XDP to use the "xdp_mem_info" type.
On 2021/5/7 18:19, Jesper Dangaard Brouer wrote:
> On Fri, 7 May 2021 16:28:30 +0800
> Yunsheng Lin <[email protected]> wrote:
>
[...]
>
> I'm actually not against storing page_pool object ptr directly in
> struct-page. It is a nice optimization. Perhaps we should implement
> this optimization outside this patchset first, and let __xdp_return()
> for XDP-redirected packets also take advantage to this optimization?
>
> Then it would feel more natural to also used this optimization in this
> patchset, right?
Yes, it would be good if XDP can take advantage of this optimization too.
Then it seems we can remove the "mem_id_ht"? if we want to support different
type of page pool in the future, the type field is in the page pool too
when page_pool object ptr directly in struct-page.
>
Hi Shay,
On Sun, May 09, 2021 at 08:11:35AM +0300, Shay Agroskin wrote:
>
> Jesper Dangaard Brouer <[email protected]> writes:
>
> > On Fri, 7 May 2021 16:28:30 +0800
> > Yunsheng Lin <[email protected]> wrote:
> >
> > > On 2021/5/7 15:06, Ilias Apalodimas wrote:
> > > > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote: >>
> > > On 2021/5/6 20:58, Ilias Apalodimas wrote: >>>>>> >>>>>
> > ...
> > > > > > I think both choices are sane. What I am trying to explain >
> > > here, is
> > > > regardless of what we choose now, we can change it in the > future
> > > without
> > > > affecting the API consumers at all. What will change > internally
> > > is the way we
> > > > lookup the page pool pointer we are trying to recycle.
> > >
> > > It seems the below API need changing?
> > > +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct
> > > page *page,
> > > + struct xdp_mem_info *mem)
> >
> > I don't think we need to change this API, to support future memory
> > models. Notice that xdp_mem_info have a 'type' member.
>
> Hi,
> Providing that we will (possibly as a future optimization) store the pointer
> to the page pool in struct page instead of strcut xdp_mem_info, passing
> xdp_mem_info * instead of struct page_pool * would mean that for every
> packet we'll need to call
> xa = rhashtable_lookup(mem_id_ht, &mem->id,
> mem_id_rht_params);
> xa->page_pool;
>
> which might pressure the Dcache to fetch a pointer that might be present
> already in cache as part of driver's data-structures.
>
> I tend to agree with Yunsheng that it makes more sense to adjust the API for
> the clear use-case now rather than using xdp_mem_info indirection. It seems
> to me like
> the page signature provides the same information anyway and allows to
> support different memory types.
We've switched the patches already. We didn't notice any performance boost
by doing so (tested on a machiattobin), but I agree as well. As I
explained the only thing that will change if we ever the need the struct
xdp_mem_info in struct page is the internal contract between struct page
and the recycling function, so let's start clean and see if we ever need
that.
Cheers
/Ilias
>
> Shay
>
> >
> > Naming in Computer Science is a hard problem ;-). Something that seems
> > to confuse a lot of people is the naming of the struct "xdp_mem_info".
> > Maybe we should have named it "mem_info" instead or "net_mem_info", as
> > it doesn't indicate that the device is running XDP.
> >
> > I see XDP as the RX-layer before the network stack, that helps drivers
> > to support different memory models, also for handling normal packets
> > that doesn't get process by XDP, and the drivers doesn't even need to
> > support XDP to use the "xdp_mem_info" type.
>