2023-08-04 18:30:34

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 0/6] page_pool: a couple of assorted optimizations

That initially was a spin-off of the IAVF PP series[0], but has grown
(and shrunk) since then a bunch. In fact, it consists of three
semi-independent blocks:

* #1-2: Compile-time optimization. Split page_pool.h into 2 headers to
not overbloat the consumers not needing complex inline helpers and
then stop including it in skbuff.h at all. The first patch is also
prereq for the whole series.
* #3: Improve cacheline locality for users of the Page Pool frag API.
* #4-6: Use direct cache recycling more aggressively, when it is safe
obviously. In addition, make sure nobody wants to use Page Pool API
with disabled interrupts.

Patches #1 and #5 are authored by Yunsheng and Jakub respectively, with
small modifications from my side as per ML discussions.
For the perf numbers for #3-6, please see individual commit messages.

Also available on my GH with many more Page Pool goodies[1].

[0] https://lore.kernel.org/netdev/[email protected]
[1] https://github.com/alobakin/linux/commits/iavf-pp-frag

Alexander Lobakin (4):
net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
page_pool: place frag_* fields in one cacheline
net: skbuff: avoid accessing page_pool if !napi_safe when returning
page
net: skbuff: always try to recycle PP pages directly when in softirq

Jakub Kicinski (1):
page_pool: add a lockdep check for recycling in hardirq

Yunsheng Lin (1):
page_pool: split types and declarations from page_pool.h

MAINTAINERS | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +-
drivers/net/ethernet/marvell/mvneta.c | 2 +-
drivers/net/ethernet/marvell/mvpp2/mvpp2.h | 2 +-
.../net/ethernet/marvell/mvpp2/mvpp2_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +-
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/trap.c | 1 -
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 2 +-
.../ethernet/mellanox/mlx5/core/en_stats.c | 2 +-
.../ethernet/microchip/lan966x/lan966x_fdma.c | 1 +
.../ethernet/microchip/lan966x/lan966x_main.h | 2 +-
drivers/net/ethernet/socionext/netsec.c | 2 +-
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +-
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
drivers/net/ethernet/ti/cpsw.c | 2 +-
drivers/net/ethernet/ti/cpsw_new.c | 2 +-
drivers/net/ethernet/ti/cpsw_priv.c | 2 +-
drivers/net/ethernet/wangxun/libwx/wx_lib.c | 2 +-
drivers/net/veth.c | 2 +-
drivers/net/wireless/mediatek/mt76/mac80211.c | 1 -
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
drivers/net/xen-netfront.c | 2 +-
include/linux/lockdep.h | 7 +
include/linux/skbuff.h | 5 +-
.../net/{page_pool.h => page_pool/helpers.h} | 242 +-----------------
include/net/page_pool/types.h | 236 +++++++++++++++++
include/trace/events/page_pool.h | 2 +-
net/bpf/test_run.c | 2 +-
net/core/page_pool.c | 43 +---
net/core/skbuff.c | 49 +++-
net/core/xdp.c | 2 +-
42 files changed, 335 insertions(+), 305 deletions(-)
rename include/net/{page_pool.h => page_pool/helpers.h} (51%)
create mode 100644 include/net/page_pool/types.h

---
From v3[2]:
* rebase again after the bpf-next merge ._.;
* #5: pick the Acked-by from Jesper.

From v2[3]:
* just rebase on top of Jakub's kdoc changes landed last minute.

From v1[4]:
* move the "avoid calling no-op DMA sync ops" piece out of the series --
will join some other or transform into something else (Jakub et al.);
* #1: restore accidentally removed path in MAINTAINERS (Yunsheng);
* #1: prefer `include/net/page_pool/` over `include/net/page_pool/*` in
MAINTAINERS (Yunsheng);
* #2: rename page_pool_return_skb_page() to napi_pp_put_page() -- the old
name seems to not make any sense for some time already (Yunsheng);
* #2: guard napi_pp_put_page() with CONFIG_PAGE_POOL in skbuff.c (to not
compile it when there are no users) (Yunsheng).

From RFC v2[5]:
* drop the dependency on the hybrid allocation series (and thus the
"RFC" prefix) -- it wasn't a strict dep and it's not in the trees yet;
* add [slightly reworked] Yunsheng's patch which splits page_pool.h into
2 headers -- merge conflict hell otherwise.
Also fix a typo while nobody looks (Simon);
* #3 (former #2): word the commitmsg a bit better, mention the main
reason for the change more clearly (Ilias);
* add Jakub's hardirq assertion as a prereq for the last patch;
* #9 (former #7): add comment mentioning that the hardirq case is not
checked due to the assertion checking it later (yes, it is illegal to
use Page Pool with the interrupts disabled or when in TH) (Jakub).

From RFC v1[6]:
* #1: move the entire function to skbuff.c, don't try to split it (Alex);
* #2-4: new;
* #5: use internal flags field added in #4 and don't modify driver-defined
structure (Alex, Jakub);
* #6: new;
* drop "add new NAPI state" as a redundant complication;
* #7: replace the check for the new NAPI state to just in_softirq(), should
be fine (Jakub).

[2] https://lore.kernel.org/netdev/[email protected]
[3] https://lore.kernel.org/netdev/[email protected]
[4] https://lore.kernel.org/netdev/[email protected]
[5] https://lore.kernel.org/netdev/[email protected]
[6] https://lore.kernel.org/netdev/[email protected]
--
2.41.0



2023-08-04 18:47:04

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 2/6] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

Currently, touching <net/page_pool/types.h> triggers a rebuild of more
than half of the kernel. That's because it's included in
<linux/skbuff.h>. And each new include to page_pool/types.h adds more
[useless] data for the toolchain to process per each source file from
that pile.

In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple of functions
defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info as
long as page pool owns the page") one of the calls was removed, so only
one was left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having very niche page_pool_types.h included only for that
looks like an overkill.

As %PP_SIGNATURE is not local to page_pool.c (was only in the
early submissions), nothing holds this function there. Teleport
page_pool_return_skb_page() to skbuff.c, just next to the main consumer,
skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't
work with skbs at all and the former name tells nothing. The #if guards
here are only to not compile and have it in the vmlinux when not needed
-- both call sites are already guarded.
Now, touching page_pool_types.h only triggers rebuilding of the drivers
using it and a couple of core networking files.

Suggested-by: Jakub Kicinski <[email protected]> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <[email protected]> # move to skbuff.c
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/skbuff.h | 5 ++--
include/net/page_pool/types.h | 2 --
net/core/page_pool.c | 39 ------------------------------
net/core/skbuff.c | 45 +++++++++++++++++++++++++++++++++--
4 files changed, 46 insertions(+), 45 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 888e3d7e74c1..aa57e2eca33b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,6 @@
#include <linux/if_packet.h>
#include <linux/llist.h>
#include <net/flow.h>
-#include <net/page_pool/types.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
@@ -3421,13 +3420,15 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
}

+bool napi_pp_put_page(struct page *page, bool napi_safe);
+
static inline void
napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
{
struct page *page = skb_frag_page(frag);

#ifdef CONFIG_PAGE_POOL
- if (recycle && page_pool_return_skb_page(page, napi_safe))
+ if (recycle && napi_pp_put_page(page, napi_safe))
return;
#endif
put_page(page);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 9ac39191bed7..fcb846523398 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -185,8 +185,6 @@ struct page_pool {
struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset,
unsigned int size, gfp_t gfp);
-bool page_pool_return_skb_page(struct page *page, bool napi_safe);
-
struct page_pool *page_pool_create(const struct page_pool_params *params);

struct xdp_mem_info;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index cd28c1f14002..03ad74d25959 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -935,42 +935,3 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
-
-bool page_pool_return_skb_page(struct page *page, bool napi_safe)
-{
- struct napi_struct *napi;
- struct page_pool *pp;
- bool allow_direct;
-
- page = compound_head(page);
-
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- * in order to preserve any existing bits, such as bit 0 for the
- * head page of compound page and bit 1 for pfmemalloc page, so
- * mask those bits for freeing side when doing below checking,
- * and page_is_pfmemalloc() is checked in __page_pool_put_page()
- * to avoid recycling the pfmemalloc page.
- */
- if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
- return false;
-
- pp = page->pp;
-
- /* Allow direct recycle if we have reasons to believe that we are
- * in the same context as the consumer would run, so there's
- * no possible race.
- */
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
-
- /* Driver set this to memory recycling info. Reset it on recycle.
- * This will *not* work for NIC using a split-page memory model.
- * The page will be returned to the pool here regardless of the
- * 'flipped' fragment being in use or not.
- */
- page_pool_put_full_page(pp, page, allow_direct);
-
- return true;
-}
-EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d3bed964123c..acc5844a0de1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,7 +73,7 @@
#include <net/mpls.h>
#include <net/mptcp.h>
#include <net/mctp.h>
-#include <net/page_pool/types.h>
+#include <net/page_pool/helpers.h>
#include <net/dropreason.h>

#include <linux/uaccess.h>
@@ -879,11 +879,52 @@ static void skb_clone_fraglist(struct sk_buff *skb)
skb_get(list);
}

+#if IS_ENABLED(CONFIG_PAGE_POOL)
+bool napi_pp_put_page(struct page *page, bool napi_safe)
+{
+ struct napi_struct *napi;
+ struct page_pool *pp;
+ bool allow_direct;
+
+ page = compound_head(page);
+
+ /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+ * in order to preserve any existing bits, such as bit 0 for the
+ * head page of compound page and bit 1 for pfmemalloc page, so
+ * mask those bits for freeing side when doing below checking,
+ * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+ * to avoid recycling the pfmemalloc page.
+ */
+ if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+ return false;
+
+ pp = page->pp;
+
+ /* Allow direct recycle if we have reasons to believe that we are
+ * in the same context as the consumer would run, so there's
+ * no possible race.
+ */
+ napi = READ_ONCE(pp->p.napi);
+ allow_direct = napi_safe && napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+
+ /* Driver set this to memory recycling info. Reset it on recycle.
+ * This will *not* work for NIC using a split-page memory model.
+ * The page will be returned to the pool here regardless of the
+ * 'flipped' fragment being in use or not.
+ */
+ page_pool_put_full_page(pp, page, allow_direct);
+
+ return true;
+}
+EXPORT_SYMBOL(napi_pp_put_page);
+#endif
+
static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
{
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
return false;
- return page_pool_return_skb_page(virt_to_page(data), napi_safe);
+ return napi_pp_put_page(virt_to_page(data), napi_safe);
}

static void skb_kfree_head(void *head, unsigned int end_offset)
--
2.41.0


2023-08-04 19:08:05

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

From: Jakub Kicinski <[email protected]>

Page pool use in hardirq is prohibited, add debug checks
to catch misuses. IIRC we previously discussed using
DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
that people will have DEBUG_NET enabled in perf testing.
I don't think anyone enables lockdep in perf testing,
so use lockdep to avoid pushback and arguing :)

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/lockdep.h | 7 +++++++
net/core/page_pool.c | 2 ++
2 files changed, 9 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 310f85903c91..dc2844b071c2 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -625,6 +625,12 @@ do { \
WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
} while (0)

+#define lockdep_assert_no_hardirq() \
+do { \
+ WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
+ !this_cpu_read(hardirqs_enabled))); \
+} while (0)
+
#define lockdep_assert_preemption_enabled() \
do { \
WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
@@ -659,6 +665,7 @@ do { \
# define lockdep_assert_irqs_enabled() do { } while (0)
# define lockdep_assert_irqs_disabled() do { } while (0)
# define lockdep_assert_in_irq() do { } while (0)
+# define lockdep_assert_no_hardirq() do { } while (0)

# define lockdep_assert_preemption_enabled() do { } while (0)
# define lockdep_assert_preemption_disabled() do { } while (0)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 03ad74d25959..77cb75e63aca 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -587,6 +587,8 @@ static __always_inline struct page *
__page_pool_put_page(struct page_pool *pool, struct page *page,
unsigned int dma_sync_size, bool allow_direct)
{
+ lockdep_assert_no_hardirq();
+
/* This allocator is optimized for the XDP mode that uses
* one-frame-per-page, but have fallbacks that act like the
* regular page allocator APIs.
--
2.41.0


2023-08-04 19:08:57

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 1/6] page_pool: split types and declarations from page_pool.h

From: Yunsheng Lin <[email protected]>

Split types and pure function declarations from page_pool.h
and add them in page_page/types.h, so that C sources can
include page_pool.h and headers should generally only include
page_pool/types.h as suggested by jakub.
Rename page_pool.h to page_pool/helpers.h to have both in
one place.

Signed-off-by: Yunsheng Lin <[email protected]>
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
MAINTAINERS | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +-
drivers/net/ethernet/marvell/mvneta.c | 2 +-
drivers/net/ethernet/marvell/mvpp2/mvpp2.h | 2 +-
.../net/ethernet/marvell/mvpp2/mvpp2_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +-
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/trap.c | 1 -
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 2 +-
.../ethernet/mellanox/mlx5/core/en_stats.c | 2 +-
.../ethernet/microchip/lan966x/lan966x_fdma.c | 1 +
.../ethernet/microchip/lan966x/lan966x_main.h | 2 +-
drivers/net/ethernet/socionext/netsec.c | 2 +-
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +-
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
drivers/net/ethernet/ti/cpsw.c | 2 +-
drivers/net/ethernet/ti/cpsw_new.c | 2 +-
drivers/net/ethernet/ti/cpsw_priv.c | 2 +-
drivers/net/ethernet/wangxun/libwx/wx_lib.c | 2 +-
drivers/net/veth.c | 2 +-
drivers/net/wireless/mediatek/mt76/mac80211.c | 1 -
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
drivers/net/xen-netfront.c | 2 +-
include/linux/skbuff.h | 2 +-
.../net/{page_pool.h => page_pool/helpers.h} | 242 +-----------------
include/net/page_pool/types.h | 238 +++++++++++++++++
include/trace/events/page_pool.h | 2 +-
net/bpf/test_run.c | 2 +-
net/core/page_pool.c | 2 +-
net/core/skbuff.c | 2 +-
net/core/xdp.c | 2 +-
41 files changed, 280 insertions(+), 264 deletions(-)
rename include/net/{page_pool.h => page_pool/helpers.h} (51%)
create mode 100644 include/net/page_pool/types.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 5e2bb1059ab6..08bcf3a7c482 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16020,7 +16020,7 @@ M: Ilias Apalodimas <[email protected]>
L: [email protected]
S: Supported
F: Documentation/networking/page_pool.rst
-F: include/net/page_pool.h
+F: include/net/page_pool/
F: include/trace/events/page_pool.h
F: net/core/page_pool.c

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6a643aae7802..eb168ca983b7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -54,7 +54,7 @@
#include <net/pkt_cls.h>
#include <linux/hwmon.h>
#include <linux/hwmon-sysfs.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <linux/align.h>
#include <net/netdev_queues.h>

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 2ce46d7affe4..96f5ca778c67 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -15,7 +15,7 @@
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
#include <linux/filter.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include "bnxt_hsi.h"
#include "bnxt.h"
#include "bnxt_xdp.h"
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index 079f9f6ae21a..f61bd89734c5 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -28,6 +28,7 @@
#include <linux/iopoll.h>
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
+#include <net/page_pool/helpers.h>
#include <net/xdp_sock_drv.h>

#define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 43f14cec91e9..3bd0bf03aedb 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -38,6 +38,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <net/ip.h>
+#include <net/page_pool/helpers.h>
#include <net/selftests.h>
#include <net/tso.h>
#include <linux/tcp.h>
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 9f6890059666..e5e37a33fd81 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -18,6 +18,7 @@
#include <net/gre.h>
#include <net/gro.h>
#include <net/ip6_checksum.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>
#include <net/tcp.h>
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 88af34bbee34..acd756b0c7c9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,7 +6,7 @@

#include <linux/dim.h>
#include <linux/if_vlan.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <asm/barrier.h>

#include "hnae3.h"
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index acf4f6ba73a6..d483b8c00ec0 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -37,7 +37,7 @@
#include <net/ip.h>
#include <net/ipv6.h>
#include <net/tso.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_sched.h>
#include <linux/bpf_trace.h>

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
index 11e603686a27..e809f91c08fb 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
@@ -16,7 +16,7 @@
#include <linux/phy.h>
#include <linux/phylink.h>
#include <net/flow_offload.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <linux/bpf.h>
#include <net/xdp.h>

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 9e1b596c8f08..eb74ccddb440 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -35,6 +35,7 @@
#include <uapi/linux/ppp_defs.h>
#include <net/ip.h>
#include <net/ipv6.h>
+#include <net/page_pool/helpers.h>
#include <net/tso.h>
#include <linux/bpf_trace.h>

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index 8cdd92dd9762..8336cea16aff 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -7,6 +7,7 @@

#include <linux/interrupt.h>
#include <linux/pci.h>
+#include <net/page_pool/helpers.h>
#include <net/tso.h>
#include <linux/bitfield.h>

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
index 61f62a6ec662..70b9065f7d10 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
@@ -16,6 +16,7 @@
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
#include <linux/bitfield.h>
+#include <net/page_pool/types.h>

#include "otx2_reg.h"
#include "otx2_common.h"
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 1b89f800f6df..fe05c9020269 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -26,6 +26,7 @@
#include <linux/bitfield.h>
#include <net/dsa.h>
#include <net/dst_metadata.h>
+#include <net/page_pool/helpers.h>

#include "mtk_eth_soc.h"
#include "mtk_wed.h"
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 80d17729e557..4a2470fbad2c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -18,7 +18,7 @@
#include <linux/rhashtable.h>
#include <linux/dim.h>
#include <linux/bitfield.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <linux/bpf_trace.h>
#include "mtk_ppe.h"

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 5ce28ff7685f..e097f336e1c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -6,6 +6,7 @@
#include "en/port.h"
#include "en_accel/en_accel.h"
#include "en_accel/ipsec.h"
+#include <net/page_pool/types.h>
#include <net/xdp_sock_drv.h>

static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
index 201ac7dd338f..698647cc8c0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
@@ -1,7 +1,6 @@
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2020 Mellanox Technologies */

-#include <net/page_pool.h>
#include "en/txrx.h"
#include "en/params.h"
#include "en/trap.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 40589cebb773..12f56d0db0af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -35,6 +35,7 @@
#include "en/xdp.h"
#include "en/params.h"
#include <linux/bitfield.h>
+#include <net/page_pool/helpers.h>

int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1c820119e438..c8ec6467d4d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -38,7 +38,7 @@
#include <linux/debugfs.h>
#include <linux/if_bridge.h>
#include <linux/filter.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/pkt_sched.h>
#include <net/xdp_sock_drv.h>
#include "eswitch.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index f7bb5f4aaaca..3fd11b0761e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -36,7 +36,7 @@
#include <linux/bitmap.h>
#include <linux/filter.h>
#include <net/ip6_checksum.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/inet_ecn.h>
#include <net/gro.h>
#include <net/udp.h>
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 4d77055abd4b..07b84d668fcc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -38,7 +38,7 @@
#include "en/port.h"

#ifdef CONFIG_PAGE_POOL_STATS
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#endif

static unsigned int stats_grps_num(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c b/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c
index bd72fbc2220f..3960534ac2ad 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c
@@ -2,6 +2,7 @@

#include <linux/bpf.h>
#include <linux/filter.h>
+#include <net/page_pool/helpers.h>

#include "lan966x_main.h"

diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
index aebc9154693a..caa9e0533c96 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
@@ -10,7 +10,7 @@
#include <linux/phy.h>
#include <linux/phylink.h>
#include <linux/ptp_clock_kernel.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>
#include <net/switchdev.h>
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 0dcd6a568b06..f358ea003193 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -15,7 +15,7 @@
#include <linux/bpf_trace.h>

#include <net/tcp.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/ip6_checksum.h>

#define NETSEC_REG_SOFT_RST 0x104
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index a6d034968654..3401e888a9f6 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -21,7 +21,7 @@
#include <linux/ptp_clock_kernel.h>
#include <linux/net_tstamp.h>
#include <linux/reset.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/xdp.h>
#include <uapi/linux/bpf.h>

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index e1f1c034d325..380ac6c7046f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -39,6 +39,7 @@
#include <linux/phylink.h>
#include <linux/udp.h>
#include <linux/bpf_trace.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/xdp_sock_drv.h>
#include "stmmac_ptp.h"
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index f9cd566d1c9b..ca4d4548f85e 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -31,7 +31,7 @@
#include <linux/if_vlan.h>
#include <linux/kmemleak.h>
#include <linux/sys_soc.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <linux/bpf.h>
#include <linux/bpf_trace.h>

diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index c61e4e44a78f..0e4f526b1753 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -30,7 +30,7 @@
#include <linux/sys_soc.h>

#include <net/switchdev.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/devlink.h>

diff --git a/drivers/net/ethernet/ti/cpsw_priv.c b/drivers/net/ethernet/ti/cpsw_priv.c
index ae52cdbcf8cc..0ec85635dfd6 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.c
+++ b/drivers/net/ethernet/ti/cpsw_priv.c
@@ -18,7 +18,7 @@
#include <linux/platform_device.h>
#include <linux/pm_runtime.h>
#include <linux/skbuff.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index 2c3f08be8c37..e04d4a5eed7b 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -3,7 +3,7 @@

#include <linux/etherdevice.h>
#include <net/ip6_checksum.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/inet_ecn.h>
#include <linux/iopoll.h>
#include <linux/sctp.h>
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..953f6d8f8db0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -26,7 +26,7 @@
#include <linux/ptr_ring.h>
#include <linux/bpf_trace.h>
#include <linux/net_tstamp.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>

#define DRV_NAME "veth"
#define DRV_VERSION "1.0"
diff --git a/drivers/net/wireless/mediatek/mt76/mac80211.c b/drivers/net/wireless/mediatek/mt76/mac80211.c
index 467afef98ba2..c8f7f80746e8 100644
--- a/drivers/net/wireless/mediatek/mt76/mac80211.c
+++ b/drivers/net/wireless/mediatek/mt76/mac80211.c
@@ -4,7 +4,6 @@
*/
#include <linux/sched.h>
#include <linux/of.h>
-#include <net/page_pool.h>
#include "mt76.h"

#define CHAN2G(_idx, _freq) { \
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 6b07b8fafec2..81192a07f17f 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -15,6 +15,7 @@
#include <linux/average.h>
#include <linux/soc/mediatek/mtk_wed.h>
#include <net/mac80211.h>
+#include <net/page_pool/helpers.h>
#include "util.h"
#include "testmode.h"

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 47d54d8ea59d..ad29f370034e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -45,7 +45,7 @@
#include <linux/slab.h>
#include <net/ip.h>
#include <linux/bpf.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <linux/bpf_trace.h>

#include <xen/xen.h>
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 16a49ba534e4..888e3d7e74c1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,7 @@
#include <linux/if_packet.h>
#include <linux/llist.h>
#include <net/flow.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
diff --git a/include/net/page_pool.h b/include/net/page_pool/helpers.h
similarity index 51%
rename from include/net/page_pool.h
rename to include/net/page_pool/helpers.h
index 73d4f786418d..78df91804c87 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool/helpers.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0
*
- * page_pool.h
+ * page_pool/helpers.h
* Author: Jesper Dangaard Brouer <[email protected]>
* Copyright (C) 2016 Red Hat, Inc.
*/
@@ -26,126 +26,12 @@
* will release the DMA mapping and in-flight state accounting. We
* hope to lift this requirement in the future.
*/
-#ifndef _NET_PAGE_POOL_H
-#define _NET_PAGE_POOL_H
+#ifndef _NET_PAGE_POOL_HELPERS_H
+#define _NET_PAGE_POOL_HELPERS_H

-#include <linux/mm.h> /* Needed by ptr_ring */
-#include <linux/ptr_ring.h>
-#include <linux/dma-direction.h>
-
-#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA
- * map/unmap
- */
-#define PP_FLAG_DMA_SYNC_DEV BIT(1) /* If set all pages that the driver gets
- * from page_pool will be
- * DMA-synced-for-device according to
- * the length provided by the device
- * driver.
- * Please note DMA-sync-for-CPU is still
- * device driver responsibility
- */
-#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */
-#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\
- PP_FLAG_DMA_SYNC_DEV |\
- PP_FLAG_PAGE_FRAG)
-
-/*
- * Fast allocation side cache array/stack
- *
- * The cache size and refill watermark is related to the network
- * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX
- * ring is usually refilled and the max consumed elements will be 64,
- * thus a natural max size of objects needed in the cache.
- *
- * Keeping room for more objects, is due to XDP_DROP use-case. As
- * XDP_DROP allows the opportunity to recycle objects directly into
- * this array, as it shares the same softirq/NAPI protection. If
- * cache is already full (or partly full) then the XDP_DROP recycles
- * would have to take a slower code path.
- */
-#define PP_ALLOC_CACHE_SIZE 128
-#define PP_ALLOC_CACHE_REFILL 64
-struct pp_alloc_cache {
- u32 count;
- struct page *cache[PP_ALLOC_CACHE_SIZE];
-};
-
-/**
- * struct page_pool_params - page pool parameters
- * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_PAGE_FRAG
- * @order: 2^order pages on allocation
- * @pool_size: size of the ptr_ring
- * @nid: NUMA node id to allocate from pages from
- * @dev: device, for DMA pre-mapping purposes
- * @napi: NAPI which is the sole consumer of pages, otherwise NULL
- * @dma_dir: DMA mapping direction
- * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
- * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
- */
-struct page_pool_params {
- unsigned int flags;
- unsigned int order;
- unsigned int pool_size;
- int nid;
- struct device *dev;
- struct napi_struct *napi;
- enum dma_data_direction dma_dir;
- unsigned int max_len;
- unsigned int offset;
-/* private: used by test code only */
- void (*init_callback)(struct page *page, void *arg);
- void *init_arg;
-};
+#include <net/page_pool/types.h>

#ifdef CONFIG_PAGE_POOL_STATS
-/**
- * struct page_pool_alloc_stats - allocation statistics
- * @fast: successful fast path allocations
- * @slow: slow path order-0 allocations
- * @slow_high_order: slow path high order allocations
- * @empty: ptr ring is empty, so a slow path allocation was forced
- * @refill: an allocation which triggered a refill of the cache
- * @waive: pages obtained from the ptr ring that cannot be added to
- * the cache due to a NUMA mismatch
- */
-struct page_pool_alloc_stats {
- u64 fast;
- u64 slow;
- u64 slow_high_order;
- u64 empty;
- u64 refill;
- u64 waive;
-};
-
-/**
- * struct page_pool_recycle_stats - recycling (freeing) statistics
- * @cached: recycling placed page in the page pool cache
- * @cache_full: page pool cache was full
- * @ring: page placed into the ptr ring
- * @ring_full: page released from page pool because the ptr ring was full
- * @released_refcnt: page released (and not recycled) because refcnt > 1
- */
-struct page_pool_recycle_stats {
- u64 cached;
- u64 cache_full;
- u64 ring;
- u64 ring_full;
- u64 released_refcnt;
-};
-
-/**
- * struct page_pool_stats - combined page pool use statistics
- * @alloc_stats: see struct page_pool_alloc_stats
- * @recycle_stats: see struct page_pool_recycle_stats
- *
- * Wrapper struct for combining page pool stats with different storage
- * requirements.
- */
-struct page_pool_stats {
- struct page_pool_alloc_stats alloc_stats;
- struct page_pool_recycle_stats recycle_stats;
-};
-
int page_pool_ethtool_stats_get_count(void);
u8 *page_pool_ethtool_stats_get_strings(u8 *data);
u64 *page_pool_ethtool_stats_get(u64 *data, void *stats);
@@ -158,7 +44,6 @@ u64 *page_pool_ethtool_stats_get(u64 *data, void *stats);
bool page_pool_get_stats(struct page_pool *pool,
struct page_pool_stats *stats);
#else
-
static inline int page_pool_ethtool_stats_get_count(void)
{
return 0;
@@ -173,72 +58,7 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
{
return data;
}
-
-#endif
-
-struct page_pool {
- struct page_pool_params p;
-
- struct delayed_work release_dw;
- void (*disconnect)(void *);
- unsigned long defer_start;
- unsigned long defer_warn;
-
- u32 pages_state_hold_cnt;
- unsigned int frag_offset;
- struct page *frag_page;
- long frag_users;
-
-#ifdef CONFIG_PAGE_POOL_STATS
- /* these stats are incremented while in softirq context */
- struct page_pool_alloc_stats alloc_stats;
#endif
- u32 xdp_mem_id;
-
- /*
- * Data structure for allocation side
- *
- * Drivers allocation side usually already perform some kind
- * of resource protection. Piggyback on this protection, and
- * require driver to protect allocation side.
- *
- * For NIC drivers this means, allocate a page_pool per
- * RX-queue. As the RX-queue is already protected by
- * Softirq/BH scheduling and napi_schedule. NAPI schedule
- * guarantee that a single napi_struct will only be scheduled
- * on a single CPU (see napi_schedule).
- */
- struct pp_alloc_cache alloc ____cacheline_aligned_in_smp;
-
- /* Data structure for storing recycled pages.
- *
- * Returning/freeing pages is more complicated synchronization
- * wise, because free's can happen on remote CPUs, with no
- * association with allocation resource.
- *
- * Use ptr_ring, as it separates consumer and producer
- * effeciently, it a way that doesn't bounce cache-lines.
- *
- * TODO: Implement bulk return pages into this structure.
- */
- struct ptr_ring ring;
-
-#ifdef CONFIG_PAGE_POOL_STATS
- /* recycle stats are per-cpu to avoid locking */
- struct page_pool_recycle_stats __percpu *recycle_stats;
-#endif
- atomic_t pages_state_release_cnt;
-
- /* A page_pool is strictly tied to a single RX-queue being
- * protected by NAPI, due to above pp_alloc_cache. This
- * refcnt serves purpose is to simplify drivers error handling.
- */
- refcount_t user_cnt;
-
- u64 destroy_cnt;
-};
-
-struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);

/**
* page_pool_dev_alloc_pages() - allocate a page.
@@ -253,9 +73,6 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
return page_pool_alloc_pages(pool, gfp);
}

-struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset,
- unsigned int size, gfp_t gfp);
-
static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
unsigned int *offset,
unsigned int size)
@@ -278,44 +95,6 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
return pool->p.dma_dir;
}

-bool page_pool_return_skb_page(struct page *page, bool napi_safe);
-
-struct page_pool *page_pool_create(const struct page_pool_params *params);
-
-struct xdp_mem_info;
-
-#ifdef CONFIG_PAGE_POOL
-void page_pool_unlink_napi(struct page_pool *pool);
-void page_pool_destroy(struct page_pool *pool);
-void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
- struct xdp_mem_info *mem);
-void page_pool_put_page_bulk(struct page_pool *pool, void **data,
- int count);
-#else
-static inline void page_pool_unlink_napi(struct page_pool *pool)
-{
-}
-
-static inline void page_pool_destroy(struct page_pool *pool)
-{
-}
-
-static inline void page_pool_use_xdp_mem(struct page_pool *pool,
- void (*disconnect)(void *),
- struct xdp_mem_info *mem)
-{
-}
-
-static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data,
- int count)
-{
-}
-#endif
-
-void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
- unsigned int dma_sync_size,
- bool allow_direct);
-
/* pp_frag_count represents the number of writers who can update the page
* either by updating skb->data or via DMA mappings for the device.
* We can't rely on the page refcnt for that as we don't know who might be
@@ -445,26 +224,15 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
page->dma_addr_upper = upper_32_bits(addr);
}

-static inline bool is_page_pool_compiled_in(void)
-{
-#ifdef CONFIG_PAGE_POOL
- return true;
-#else
- return false;
-#endif
-}
-
static inline bool page_pool_put(struct page_pool *pool)
{
return refcount_dec_and_test(&pool->user_cnt);
}

-/* Caller must provide appropriate safe context, e.g. NAPI. */
-void page_pool_update_nid(struct page_pool *pool, int new_nid);
static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
{
if (unlikely(pool->p.nid != new_nid))
page_pool_update_nid(pool, new_nid);
}

-#endif /* _NET_PAGE_POOL_H */
+#endif /* _NET_PAGE_POOL_HELPERS_H */
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
new file mode 100644
index 000000000000..9ac39191bed7
--- /dev/null
+++ b/include/net/page_pool/types.h
@@ -0,0 +1,238 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _NET_PAGE_POOL_TYPES_H
+#define _NET_PAGE_POOL_TYPES_H
+
+#include <linux/dma-direction.h>
+#include <linux/ptr_ring.h>
+
+#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA
+ * map/unmap
+ */
+#define PP_FLAG_DMA_SYNC_DEV BIT(1) /* If set all pages that the driver gets
+ * from page_pool will be
+ * DMA-synced-for-device according to
+ * the length provided by the device
+ * driver.
+ * Please note DMA-sync-for-CPU is still
+ * device driver responsibility
+ */
+#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */
+#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\
+ PP_FLAG_DMA_SYNC_DEV |\
+ PP_FLAG_PAGE_FRAG)
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case. As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection. If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE 128
+#define PP_ALLOC_CACHE_REFILL 64
+struct pp_alloc_cache {
+ u32 count;
+ struct page *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+/**
+ * struct page_pool_params - page pool parameters
+ * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_PAGE_FRAG
+ * @order: 2^order pages on allocation
+ * @pool_size: size of the ptr_ring
+ * @nid: NUMA node id to allocate from pages from
+ * @dev: device, for DMA pre-mapping purposes
+ * @napi: NAPI which is the sole consumer of pages, otherwise NULL
+ * @dma_dir: DMA mapping direction
+ * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
+ * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
+ */
+struct page_pool_params {
+ unsigned int flags;
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+/* private: used by test code only */
+ void (*init_callback)(struct page *page, void *arg);
+ void *init_arg;
+};
+
+#ifdef CONFIG_PAGE_POOL_STATS
+/**
+ * struct page_pool_alloc_stats - allocation statistics
+ * @fast: successful fast path allocations
+ * @slow: slow path order-0 allocations
+ * @slow_high_order: slow path high order allocations
+ * @empty: ptr ring is empty, so a slow path allocation was forced
+ * @refill: an allocation which triggered a refill of the cache
+ * @waive: pages obtained from the ptr ring that cannot be added to
+ * the cache due to a NUMA mismatch
+ */
+struct page_pool_alloc_stats {
+ u64 fast;
+ u64 slow;
+ u64 slow_high_order;
+ u64 empty;
+ u64 refill;
+ u64 waive;
+};
+
+/**
+ * struct page_pool_recycle_stats - recycling (freeing) statistics
+ * @cached: recycling placed page in the page pool cache
+ * @cache_full: page pool cache was full
+ * @ring: page placed into the ptr ring
+ * @ring_full: page released from page pool because the ptr ring was full
+ * @released_refcnt: page released (and not recycled) because refcnt > 1
+ */
+struct page_pool_recycle_stats {
+ u64 cached;
+ u64 cache_full;
+ u64 ring;
+ u64 ring_full;
+ u64 released_refcnt;
+};
+
+/**
+ * struct page_pool_stats - combined page pool use statistics
+ * @alloc_stats: see struct page_pool_alloc_stats
+ * @recycle_stats: see struct page_pool_recycle_stats
+ *
+ * Wrapper struct for combining page pool stats with different storage
+ * requirements.
+ */
+struct page_pool_stats {
+ struct page_pool_alloc_stats alloc_stats;
+ struct page_pool_recycle_stats recycle_stats;
+};
+#endif
+
+struct page_pool {
+ struct page_pool_params p;
+
+ struct delayed_work release_dw;
+ void (*disconnect)(void *pool);
+ unsigned long defer_start;
+ unsigned long defer_warn;
+
+ u32 pages_state_hold_cnt;
+ unsigned int frag_offset;
+ struct page *frag_page;
+ long frag_users;
+
+#ifdef CONFIG_PAGE_POOL_STATS
+ /* these stats are incremented while in softirq context */
+ struct page_pool_alloc_stats alloc_stats;
+#endif
+ u32 xdp_mem_id;
+
+ /*
+ * Data structure for allocation side
+ *
+ * Drivers allocation side usually already perform some kind
+ * of resource protection. Piggyback on this protection, and
+ * require driver to protect allocation side.
+ *
+ * For NIC drivers this means, allocate a page_pool per
+ * RX-queue. As the RX-queue is already protected by
+ * Softirq/BH scheduling and napi_schedule. NAPI schedule
+ * guarantee that a single napi_struct will only be scheduled
+ * on a single CPU (see napi_schedule).
+ */
+ struct pp_alloc_cache alloc ____cacheline_aligned_in_smp;
+
+ /* Data structure for storing recycled pages.
+ *
+ * Returning/freeing pages is more complicated synchronization
+ * wise, because free's can happen on remote CPUs, with no
+ * association with allocation resource.
+ *
+ * Use ptr_ring, as it separates consumer and producer
+ * efficiently, it a way that doesn't bounce cache-lines.
+ *
+ * TODO: Implement bulk return pages into this structure.
+ */
+ struct ptr_ring ring;
+
+#ifdef CONFIG_PAGE_POOL_STATS
+ /* recycle stats are per-cpu to avoid locking */
+ struct page_pool_recycle_stats __percpu *recycle_stats;
+#endif
+ atomic_t pages_state_release_cnt;
+
+ /* A page_pool is strictly tied to a single RX-queue being
+ * protected by NAPI, due to above pp_alloc_cache. This
+ * refcnt serves purpose is to simplify drivers error handling.
+ */
+ refcount_t user_cnt;
+
+ u64 destroy_cnt;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset,
+ unsigned int size, gfp_t gfp);
+bool page_pool_return_skb_page(struct page *page, bool napi_safe);
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+struct xdp_mem_info;
+
+#ifdef CONFIG_PAGE_POOL
+void page_pool_unlink_napi(struct page_pool *pool);
+void page_pool_destroy(struct page_pool *pool);
+void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
+ struct xdp_mem_info *mem);
+void page_pool_put_page_bulk(struct page_pool *pool, void **data,
+ int count);
+#else
+static inline void page_pool_unlink_napi(struct page_pool *pool)
+{
+}
+
+static inline void page_pool_destroy(struct page_pool *pool)
+{
+}
+
+static inline void page_pool_use_xdp_mem(struct page_pool *pool,
+ void (*disconnect)(void *),
+ struct xdp_mem_info *mem)
+{
+}
+
+static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data,
+ int count)
+{
+}
+#endif
+
+void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
+ unsigned int dma_sync_size,
+ bool allow_direct);
+
+static inline bool is_page_pool_compiled_in(void)
+{
+#ifdef CONFIG_PAGE_POOL
+ return true;
+#else
+ return false;
+#endif
+}
+
+/* Caller must provide appropriate safe context, e.g. NAPI. */
+void page_pool_update_nid(struct page_pool *pool, int new_nid);
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h
index ca534501158b..6834356b2d2a 100644
--- a/include/trace/events/page_pool.h
+++ b/include/trace/events/page_pool.h
@@ -9,7 +9,7 @@
#include <linux/tracepoint.h>

#include <trace/events/mmflags.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>

TRACE_EVENT(page_pool_release,

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0aac76c13fd4..57a7a64b84ed 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -15,7 +15,7 @@
#include <net/sock.h>
#include <net/tcp.h>
#include <net/net_namespace.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <linux/error-injection.h>
#include <linux/smp.h>
#include <linux/sock_diag.h>
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5d615a169718..cd28c1f14002 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -10,7 +10,7 @@
#include <linux/slab.h>
#include <linux/device.h>

-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/xdp.h>

#include <linux/dma-direction.h>
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c6f98245582c..d3bed964123c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,7 +73,7 @@
#include <net/mpls.h>
#include <net/mptcp.h>
#include <net/mctp.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/dropreason.h>

#include <linux/uaccess.h>
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8362130bf085..a70670fe9a2d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -14,7 +14,7 @@
#include <linux/idr.h>
#include <linux/rhashtable.h>
#include <linux/bug.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>

#include <net/xdp.h>
#include <net/xdp_priv.h> /* struct xdp_mem_allocator */
--
2.41.0


2023-08-04 19:45:43

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 4/6] net: skbuff: avoid accessing page_pool if !napi_safe when returning page

Currently, pp->p.napi is always read, but the actual variable it gets
assigned to is read-only when @napi_safe is true. For the !napi_safe
cases, which yet is still a pack, it's an unneeded operation.
Moreover, it can lead to premature or even redundant page_pool
cacheline access. For example, when page_pool_is_last_frag() returns
false (with the recent frag improvements).
Thus, read it only when @napi_safe is true. This also allows moving
@napi inside the condition block itself. Constify it while we are
here, because why not.

Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index acc5844a0de1..85f82a6a08dc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -882,9 +882,8 @@ static void skb_clone_fraglist(struct sk_buff *skb)
#if IS_ENABLED(CONFIG_PAGE_POOL)
bool napi_pp_put_page(struct page *page, bool napi_safe)
{
- struct napi_struct *napi;
+ bool allow_direct = false;
struct page_pool *pp;
- bool allow_direct;

page = compound_head(page);

@@ -904,9 +903,12 @@ bool napi_pp_put_page(struct page *page, bool napi_safe)
* in the same context as the consumer would run, so there's
* no possible race.
*/
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
+ if (napi_safe) {
+ const struct napi_struct *napi = READ_ONCE(pp->p.napi);
+
+ allow_direct = napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+ }

/* Driver set this to memory recycling info. Reset it on recycle.
* This will *not* work for NIC using a split-page memory model.
--
2.41.0


2023-08-04 20:10:11

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 3/6] page_pool: place frag_* fields in one cacheline

On x86_64, frag_* fields of struct page_pool are scattered across two
cachelines despite the summary size of 24 bytes. All three fields are
used in pretty much the same places, but the last field, ::frag_users,
is pushed out to the next CL, provoking unwanted false-sharing on
hotpath (frags allocation code).
There are some holes and cold members to move around. Move frag_* one
block up, placing them right after &page_pool_params perfectly at the
beginning of CL2. This doesn't do any meaningful to the second block, as
those are some destroy-path cold structures, and doesn't do anything to
::alloc_stats, which still starts at 200-byte offset, 8 bytes after CL3
(still fitting into 1 cacheline).
On my setup, this yields 1-2% of Mpps when using PP frags actively.
When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
CL, so there should be at least no regressions from the actual change.
::pages_state_hold_cnt is not related directly to that triple, but is
paired currently with ::frags_offset and decoupling them would mean
either two 4-byte holes or more invasive layout changes.

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool/types.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index fcb846523398..887e7946a597 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -123,16 +123,16 @@ struct page_pool_stats {
struct page_pool {
struct page_pool_params p;

+ long frag_users;
+ struct page *frag_page;
+ unsigned int frag_offset;
+ u32 pages_state_hold_cnt;
+
struct delayed_work release_dw;
void (*disconnect)(void *pool);
unsigned long defer_start;
unsigned long defer_warn;

- u32 pages_state_hold_cnt;
- unsigned int frag_offset;
- struct page *frag_page;
- long frag_users;
-
#ifdef CONFIG_PAGE_POOL_STATS
/* these stats are incremented while in softirq context */
struct page_pool_alloc_stats alloc_stats;
--
2.41.0


2023-08-04 20:51:02

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v4 6/6] net: skbuff: always try to recycle PP pages directly when in softirq

Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple of other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of time we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, check whether we are in the
softirq context. It will most likely be so and then if ->list_owner
is our current CPU, we're good to use direct recycling, even though
@napi_safe is false -- concurrent access is excluded. in_softirq()
protection is needed mostly due to we can hit this place in the
process context (not the hardirq though).
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of time, which means the
MM layer is not even called to allocate any new pages.

Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 85f82a6a08dc..33fdf04d4334 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -902,8 +902,10 @@ bool napi_pp_put_page(struct page *page, bool napi_safe)
/* Allow direct recycle if we have reasons to believe that we are
* in the same context as the consumer would run, so there's
* no possible race.
+ * __page_pool_put_page() makes sure we're not in hardirq context
+ * and interrupts are enabled prior to accessing the cache.
*/
- if (napi_safe) {
+ if (napi_safe || in_softirq()) {
const struct napi_struct *napi = READ_ONCE(pp->p.napi);

allow_direct = napi &&
--
2.41.0


2023-08-07 16:26:44

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v4 0/6] page_pool: a couple of assorted optimizations

On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote:
> That initially was a spin-off of the IAVF PP series[0], but has grown
> (and shrunk) since then a bunch. In fact, it consists of three
> semi-independent blocks:
>
> * #1-2: Compile-time optimization. Split page_pool.h into 2 headers to
> not overbloat the consumers not needing complex inline helpers and
> then stop including it in skbuff.h at all. The first patch is also
> prereq for the whole series.
> * #3: Improve cacheline locality for users of the Page Pool frag API.
> * #4-6: Use direct cache recycling more aggressively, when it is safe
> obviously. In addition, make sure nobody wants to use Page Pool API
> with disabled interrupts.
>
> Patches #1 and #5 are authored by Yunsheng and Jakub respectively, with
> small modifications from my side as per ML discussions.
> For the perf numbers for #3-6, please see individual commit messages.
>
> Also available on my GH with many more Page Pool goodies[1].
>
> [0] https://lore.kernel.org/netdev/[email protected]
> [1] https://github.com/alobakin/linux/commits/iavf-pp-frag
>
> Alexander Lobakin (4):
> net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
> page_pool: place frag_* fields in one cacheline
> net: skbuff: avoid accessing page_pool if !napi_safe when returning
> page
> net: skbuff: always try to recycle PP pages directly when in softirq
>
> Jakub Kicinski (1):
> page_pool: add a lockdep check for recycling in hardirq
>
> Yunsheng Lin (1):
> page_pool: split types and declarations from page_pool.h

So the series mostly looks good to me. My only concern would be with
path 5 since I am not sure why we are just throwing a WARN_ON when we
could just take action on the info to prevent the problem in the first
place. That said the change doesn't hurt anything as-is so I would be
good with us thinking about changing that as a follow-up.

Reviewed-by: Alexander Duyck <[email protected]>


2023-08-07 16:27:14

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote:
> From: Jakub Kicinski <[email protected]>
>
> Page pool use in hardirq is prohibited, add debug checks
> to catch misuses. IIRC we previously discussed using
> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
> that people will have DEBUG_NET enabled in perf testing.
> I don't think anyone enables lockdep in perf testing,
> so use lockdep to avoid pushback and arguing :)
>
> Signed-off-by: Jakub Kicinski <[email protected]>
> Acked-by: Jesper Dangaard Brouer <[email protected]>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> include/linux/lockdep.h | 7 +++++++
> net/core/page_pool.c | 2 ++
> 2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index 310f85903c91..dc2844b071c2 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -625,6 +625,12 @@ do { \
> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
> } while (0)
>
> +#define lockdep_assert_no_hardirq() \
> +do { \
> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
> + !this_cpu_read(hardirqs_enabled))); \
> +} while (0)
> +
> #define lockdep_assert_preemption_enabled() \
> do { \
> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
> @@ -659,6 +665,7 @@ do { \
> # define lockdep_assert_irqs_enabled() do { } while (0)
> # define lockdep_assert_irqs_disabled() do { } while (0)
> # define lockdep_assert_in_irq() do { } while (0)
> +# define lockdep_assert_no_hardirq() do { } while (0)
>
> # define lockdep_assert_preemption_enabled() do { } while (0)
> # define lockdep_assert_preemption_disabled() do { } while (0)
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 03ad74d25959..77cb75e63aca 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -587,6 +587,8 @@ static __always_inline struct page *
> __page_pool_put_page(struct page_pool *pool, struct page *page,
> unsigned int dma_sync_size, bool allow_direct)
> {
> + lockdep_assert_no_hardirq();
> +
> /* This allocator is optimized for the XDP mode that uses
> * one-frame-per-page, but have fallbacks that act like the
> * regular page allocator APIs.

So two points.

First could we look at moving this inside the if statement just before
we return the page, as there isn't a risk until we get into that path
of needing a lock.

Secondly rather than returning an error is there any reason why we
couldn't just look at not returning page and instead just drop into the
release path which wouldn't take the locks in the first place? Either
that or I would even be good with some combination of the two where we
threw a warning, but still just dropped the page so we reduce our risk
further of actually locking things up.

2023-08-07 21:03:09

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v4 1/6] page_pool: split types and declarations from page_pool.h

On Fri, 4 Aug 2023 20:05:24 +0200 Alexander Lobakin wrote:
> Split types and pure function declarations from page_pool.h
> and add them in page_page/types.h, so that C sources can
> include page_pool.h and headers should generally only include
> page_pool/types.h as suggested by jakub.
> Rename page_pool.h to page_pool/helpers.h to have both in
> one place.

I had to touch this one up a little when merging to fix build
for the merged-in-meantime mana driver and also update the
patch in Documentation.

2023-08-07 21:47:46

by patchwork-bot+netdevbpf

[permalink] [raw]
Subject: Re: [PATCH net-next v4 0/6] page_pool: a couple of assorted optimizations

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <[email protected]>:

On Fri, 4 Aug 2023 20:05:23 +0200 you wrote:
> That initially was a spin-off of the IAVF PP series[0], but has grown
> (and shrunk) since then a bunch. In fact, it consists of three
> semi-independent blocks:
>
> * #1-2: Compile-time optimization. Split page_pool.h into 2 headers to
> not overbloat the consumers not needing complex inline helpers and
> then stop including it in skbuff.h at all. The first patch is also
> prereq for the whole series.
> * #3: Improve cacheline locality for users of the Page Pool frag API.
> * #4-6: Use direct cache recycling more aggressively, when it is safe
> obviously. In addition, make sure nobody wants to use Page Pool API
> with disabled interrupts.
>
> [...]

Here is the summary with links:
- [net-next,v4,1/6] page_pool: split types and declarations from page_pool.h
https://git.kernel.org/netdev/net-next/c/a9ca9f9ceff3
- [net-next,v4,2/6] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
https://git.kernel.org/netdev/net-next/c/75eaf63ea7af
- [net-next,v4,3/6] page_pool: place frag_* fields in one cacheline
https://git.kernel.org/netdev/net-next/c/06d0fbdad612
- [net-next,v4,4/6] net: skbuff: avoid accessing page_pool if !napi_safe when returning page
https://git.kernel.org/netdev/net-next/c/5b899c33b3b8
- [net-next,v4,5/6] page_pool: add a lockdep check for recycling in hardirq
https://git.kernel.org/netdev/net-next/c/ff4e538c8c3e
- [net-next,v4,6/6] net: skbuff: always try to recycle PP pages directly when in softirq
https://git.kernel.org/netdev/net-next/c/4a36d0180c45

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



2023-08-08 17:29:27

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v4 0/6] page_pool: a couple of assorted optimizations

From: Patchwork-Bot+netdevbpf <[email protected]>
Date: Mon, 07 Aug 2023 20:20:23 +0000

> Hello:
>
> This series was applied to netdev/net-next.git (main)
> by Jakub Kicinski <[email protected]>:

Just got back from the long weekend and saw that mail, nice :D Thanks!

>
> On Fri, 4 Aug 2023 20:05:23 +0200 you wrote:
>> That initially was a spin-off of the IAVF PP series[0], but has grown
>> (and shrunk) since then a bunch. In fact, it consists of three
>> semi-independent blocks:
>>
>> * #1-2: Compile-time optimization. Split page_pool.h into 2 headers to
>> not overbloat the consumers not needing complex inline helpers and
>> then stop including it in skbuff.h at all. The first patch is also
>> prereq for the whole series.
>> * #3: Improve cacheline locality for users of the Page Pool frag API.
>> * #4-6: Use direct cache recycling more aggressively, when it is safe
>> obviously. In addition, make sure nobody wants to use Page Pool API
>> with disabled interrupts.
>>
>> [...]
>
> Here is the summary with links:
> - [net-next,v4,1/6] page_pool: split types and declarations from page_pool.h
> https://git.kernel.org/netdev/net-next/c/a9ca9f9ceff3
> - [net-next,v4,2/6] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
> https://git.kernel.org/netdev/net-next/c/75eaf63ea7af
> - [net-next,v4,3/6] page_pool: place frag_* fields in one cacheline
> https://git.kernel.org/netdev/net-next/c/06d0fbdad612
> - [net-next,v4,4/6] net: skbuff: avoid accessing page_pool if !napi_safe when returning page
> https://git.kernel.org/netdev/net-next/c/5b899c33b3b8
> - [net-next,v4,5/6] page_pool: add a lockdep check for recycling in hardirq
> https://git.kernel.org/netdev/net-next/c/ff4e538c8c3e
> - [net-next,v4,6/6] net: skbuff: always try to recycle PP pages directly when in softirq
> https://git.kernel.org/netdev/net-next/c/4a36d0180c45
>
> You are awesome, thank you!

Thanks,
Olek

2023-08-08 17:47:39

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

On Tue, Aug 8, 2023 at 6:59 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
> Date: Tue, 8 Aug 2023 06:45:26 -0700
>
> > On Tue, Aug 8, 2023 at 6:16 AM Alexander Lobakin
> > <[email protected]> wrote:
> >>
> >> From: Alexander H Duyck <[email protected]>
> >> Date: Mon, 07 Aug 2023 07:48:54 -0700
> >>
> >>> On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote:
> >>>> From: Jakub Kicinski <[email protected]>
> >>>>
> >>>> Page pool use in hardirq is prohibited, add debug checks
> >>>> to catch misuses. IIRC we previously discussed using
> >>>> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
> >>>> that people will have DEBUG_NET enabled in perf testing.
> >>>> I don't think anyone enables lockdep in perf testing,
> >>>> so use lockdep to avoid pushback and arguing :)
> >>>>
> >>>> Signed-off-by: Jakub Kicinski <[email protected]>
> >>>> Acked-by: Jesper Dangaard Brouer <[email protected]>
> >>>> Signed-off-by: Alexander Lobakin <[email protected]>
> >>>> ---
> >>>> include/linux/lockdep.h | 7 +++++++
> >>>> net/core/page_pool.c | 2 ++
> >>>> 2 files changed, 9 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> >>>> index 310f85903c91..dc2844b071c2 100644
> >>>> --- a/include/linux/lockdep.h
> >>>> +++ b/include/linux/lockdep.h
> >>>> @@ -625,6 +625,12 @@ do { \
> >>>> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
> >>>> } while (0)
> >>>>
> >>>> +#define lockdep_assert_no_hardirq() \
> >>>> +do { \
> >>>> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
> >>>> + !this_cpu_read(hardirqs_enabled))); \
> >>>> +} while (0)
> >>>> +
> >>>> #define lockdep_assert_preemption_enabled() \
> >>>> do { \
> >>>> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
> >>>> @@ -659,6 +665,7 @@ do { \
> >>>> # define lockdep_assert_irqs_enabled() do { } while (0)
> >>>> # define lockdep_assert_irqs_disabled() do { } while (0)
> >>>> # define lockdep_assert_in_irq() do { } while (0)
> >>>> +# define lockdep_assert_no_hardirq() do { } while (0)
> >>>>
> >>>> # define lockdep_assert_preemption_enabled() do { } while (0)
> >>>> # define lockdep_assert_preemption_disabled() do { } while (0)
> >>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >>>> index 03ad74d25959..77cb75e63aca 100644
> >>>> --- a/net/core/page_pool.c
> >>>> +++ b/net/core/page_pool.c
> >>>> @@ -587,6 +587,8 @@ static __always_inline struct page *
> >>>> __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>>> unsigned int dma_sync_size, bool allow_direct)
> >>>> {
> >>>> + lockdep_assert_no_hardirq();
> >>>> +
> >>>> /* This allocator is optimized for the XDP mode that uses
> >>>> * one-frame-per-page, but have fallbacks that act like the
> >>>> * regular page allocator APIs.
> >>>
> >>> So two points.
> >>>
> >>> First could we look at moving this inside the if statement just before
> >>> we return the page, as there isn't a risk until we get into that path
> >>> of needing a lock.
> >>>
> >>> Secondly rather than returning an error is there any reason why we
> >>> couldn't just look at not returning page and instead just drop into the
> >>> release path which wouldn't take the locks in the first place? Either
> >>
> >> That is exception path to quickly catch broken drivers and fix them, why
> >> bother? It's not something we have to live with.
> >
> > My concern is that the current "fix" consists of stalling a Tx ring.
> > We need to have a way to allow forward progress when somebody mixes
> > xdp_frame and skb traffic as I suspect we will end up with a number of
> > devices doing this since they cannot handle recycling the pages in
> > hardirq context.
>
> You could've seen that several vendors already disabled recycling XDP
> buffers when in hardirq (= netpoll) in their drivers. hardirq is in
> general not for networking-related operations.

The whole idea behind the netpoll cleanup is to get the Tx buffers out
of the way so that we can transmit even after the system has crashed.
The idea isn't to transmit XDP buffers, but to get the buffers out of
the way in the cases where somebody is combining both xdp_frame and
sk_buff on the same queue due to a limited number of rings being
present on the device.

My concern is that at some point in the near future somebody is going
to have a system crash and instead of being able to get the crash log
message out via their netconsole it is going to get cut off because
the driver stopped cleaning the Tx ring because somebody was also
using it as an XDP redirect destination.

> >
> > The only reason why the skbs don't have the problem is that they are
> > queued and then cleaned up in the net_tx_action. That is why I wonder
> > if we shouldn't look at adding some sort of support for doing
> > something like that with xdp_frame as well. Something like a
> > dev_kfree_pp_page_any to go along with the dev_kfree_skb_any.
>
> I still don't get why we may need to clean XDP buffers in hardirq, maybe
> someone could give me some links to read why we may need this and how
> that happens? netpoll is a very specific thing for some debug
> operations, isn't it? XDP shouldn't in general be enabled when this
> happens, should it?

I think I kind of explained it above. It isn't so much about cleaning
the XDP buffers as getting them off of the ring and out of the way. If
we block a Tx queue because of an XDP buffer then we cannot use that
Tx queue. I would be good with us just deferring the cleanup like we
do with an sk_buff in dev_kfree_skb_irq, the only issue is we don't
have the ability to put them on a queue since they don't have
prev/next pointers.

I suppose an alternative to cleaning them might be to make a mandatory
requirement that you cannot support netpoll and mix xdp_frame and
sk_buff on the same queue. If we enforced that then my concern about
them blocking a queue would be addressed.

> (unrelated: 6:58 AM West Coast, you use to wake up early or traveling?
> :D)

I am usually up pretty early, especially this time of year. Sunrise
here is 6AM and I am usually up a little before that.. :)

2023-08-08 17:51:47

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

From: Alexander Duyck <[email protected]>
Date: Tue, 8 Aug 2023 06:45:26 -0700

> On Tue, Aug 8, 2023 at 6:16 AM Alexander Lobakin
> <[email protected]> wrote:
>>
>> From: Alexander H Duyck <[email protected]>
>> Date: Mon, 07 Aug 2023 07:48:54 -0700
>>
>>> On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote:
>>>> From: Jakub Kicinski <[email protected]>
>>>>
>>>> Page pool use in hardirq is prohibited, add debug checks
>>>> to catch misuses. IIRC we previously discussed using
>>>> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
>>>> that people will have DEBUG_NET enabled in perf testing.
>>>> I don't think anyone enables lockdep in perf testing,
>>>> so use lockdep to avoid pushback and arguing :)
>>>>
>>>> Signed-off-by: Jakub Kicinski <[email protected]>
>>>> Acked-by: Jesper Dangaard Brouer <[email protected]>
>>>> Signed-off-by: Alexander Lobakin <[email protected]>
>>>> ---
>>>> include/linux/lockdep.h | 7 +++++++
>>>> net/core/page_pool.c | 2 ++
>>>> 2 files changed, 9 insertions(+)
>>>>
>>>> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
>>>> index 310f85903c91..dc2844b071c2 100644
>>>> --- a/include/linux/lockdep.h
>>>> +++ b/include/linux/lockdep.h
>>>> @@ -625,6 +625,12 @@ do { \
>>>> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
>>>> } while (0)
>>>>
>>>> +#define lockdep_assert_no_hardirq() \
>>>> +do { \
>>>> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
>>>> + !this_cpu_read(hardirqs_enabled))); \
>>>> +} while (0)
>>>> +
>>>> #define lockdep_assert_preemption_enabled() \
>>>> do { \
>>>> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
>>>> @@ -659,6 +665,7 @@ do { \
>>>> # define lockdep_assert_irqs_enabled() do { } while (0)
>>>> # define lockdep_assert_irqs_disabled() do { } while (0)
>>>> # define lockdep_assert_in_irq() do { } while (0)
>>>> +# define lockdep_assert_no_hardirq() do { } while (0)
>>>>
>>>> # define lockdep_assert_preemption_enabled() do { } while (0)
>>>> # define lockdep_assert_preemption_disabled() do { } while (0)
>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>> index 03ad74d25959..77cb75e63aca 100644
>>>> --- a/net/core/page_pool.c
>>>> +++ b/net/core/page_pool.c
>>>> @@ -587,6 +587,8 @@ static __always_inline struct page *
>>>> __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>> unsigned int dma_sync_size, bool allow_direct)
>>>> {
>>>> + lockdep_assert_no_hardirq();
>>>> +
>>>> /* This allocator is optimized for the XDP mode that uses
>>>> * one-frame-per-page, but have fallbacks that act like the
>>>> * regular page allocator APIs.
>>>
>>> So two points.
>>>
>>> First could we look at moving this inside the if statement just before
>>> we return the page, as there isn't a risk until we get into that path
>>> of needing a lock.
>>>
>>> Secondly rather than returning an error is there any reason why we
>>> couldn't just look at not returning page and instead just drop into the
>>> release path which wouldn't take the locks in the first place? Either
>>
>> That is exception path to quickly catch broken drivers and fix them, why
>> bother? It's not something we have to live with.
>
> My concern is that the current "fix" consists of stalling a Tx ring.
> We need to have a way to allow forward progress when somebody mixes
> xdp_frame and skb traffic as I suspect we will end up with a number of
> devices doing this since they cannot handle recycling the pages in
> hardirq context.

You could've seen that several vendors already disabled recycling XDP
buffers when in hardirq (= netpoll) in their drivers. hardirq is in
general not for networking-related operations.

>
> The only reason why the skbs don't have the problem is that they are
> queued and then cleaned up in the net_tx_action. That is why I wonder
> if we shouldn't look at adding some sort of support for doing
> something like that with xdp_frame as well. Something like a
> dev_kfree_pp_page_any to go along with the dev_kfree_skb_any.

I still don't get why we may need to clean XDP buffers in hardirq, maybe
someone could give me some links to read why we may need this and how
that happens? netpoll is a very specific thing for some debug
operations, isn't it? XDP shouldn't in general be enabled when this
happens, should it?

(unrelated: 6:58 AM West Coast, you use to wake up early or traveling?
:D)

Thanks,
Olek

2023-08-08 17:55:32

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

From: Alexander H Duyck <[email protected]>
Date: Mon, 07 Aug 2023 07:48:54 -0700

> On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote:
>> From: Jakub Kicinski <[email protected]>
>>
>> Page pool use in hardirq is prohibited, add debug checks
>> to catch misuses. IIRC we previously discussed using
>> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
>> that people will have DEBUG_NET enabled in perf testing.
>> I don't think anyone enables lockdep in perf testing,
>> so use lockdep to avoid pushback and arguing :)
>>
>> Signed-off-by: Jakub Kicinski <[email protected]>
>> Acked-by: Jesper Dangaard Brouer <[email protected]>
>> Signed-off-by: Alexander Lobakin <[email protected]>
>> ---
>> include/linux/lockdep.h | 7 +++++++
>> net/core/page_pool.c | 2 ++
>> 2 files changed, 9 insertions(+)
>>
>> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
>> index 310f85903c91..dc2844b071c2 100644
>> --- a/include/linux/lockdep.h
>> +++ b/include/linux/lockdep.h
>> @@ -625,6 +625,12 @@ do { \
>> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
>> } while (0)
>>
>> +#define lockdep_assert_no_hardirq() \
>> +do { \
>> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
>> + !this_cpu_read(hardirqs_enabled))); \
>> +} while (0)
>> +
>> #define lockdep_assert_preemption_enabled() \
>> do { \
>> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
>> @@ -659,6 +665,7 @@ do { \
>> # define lockdep_assert_irqs_enabled() do { } while (0)
>> # define lockdep_assert_irqs_disabled() do { } while (0)
>> # define lockdep_assert_in_irq() do { } while (0)
>> +# define lockdep_assert_no_hardirq() do { } while (0)
>>
>> # define lockdep_assert_preemption_enabled() do { } while (0)
>> # define lockdep_assert_preemption_disabled() do { } while (0)
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 03ad74d25959..77cb75e63aca 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -587,6 +587,8 @@ static __always_inline struct page *
>> __page_pool_put_page(struct page_pool *pool, struct page *page,
>> unsigned int dma_sync_size, bool allow_direct)
>> {
>> + lockdep_assert_no_hardirq();
>> +
>> /* This allocator is optimized for the XDP mode that uses
>> * one-frame-per-page, but have fallbacks that act like the
>> * regular page allocator APIs.
>
> So two points.
>
> First could we look at moving this inside the if statement just before
> we return the page, as there isn't a risk until we get into that path
> of needing a lock.
>
> Secondly rather than returning an error is there any reason why we
> couldn't just look at not returning page and instead just drop into the
> release path which wouldn't take the locks in the first place? Either

That is exception path to quickly catch broken drivers and fix them, why
bother? It's not something we have to live with.

> that or I would even be good with some combination of the two where we
> threw a warning, but still just dropped the page so we reduce our risk
> further of actually locking things up.

Thanks,
Olek

2023-08-08 18:56:00

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

On Tue, Aug 8, 2023 at 8:06 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
> Date: Tue, 8 Aug 2023 07:52:32 -0700
>
> > On Tue, Aug 8, 2023 at 6:59 AM Alexander Lobakin
> > <[email protected]> wrote:
> >>
> >> From: Alexander Duyck <[email protected]>
> >> Date: Tue, 8 Aug 2023 06:45:26 -0700
>
> [...]
>
> >>>>> Secondly rather than returning an error is there any reason why we
> >>>>> couldn't just look at not returning page and instead just drop into the
> >>>>> release path which wouldn't take the locks in the first place? Either
> >>>>
> >>>> That is exception path to quickly catch broken drivers and fix them, why
> >>>> bother? It's not something we have to live with.
> >>>
> >>> My concern is that the current "fix" consists of stalling a Tx ring.
> >>> We need to have a way to allow forward progress when somebody mixes
> >>> xdp_frame and skb traffic as I suspect we will end up with a number of
> >>> devices doing this since they cannot handle recycling the pages in
> >>> hardirq context.
> >>
> >> You could've seen that several vendors already disabled recycling XDP
> >> buffers when in hardirq (= netpoll) in their drivers. hardirq is in
> >> general not for networking-related operations.
> >
> > The whole idea behind the netpoll cleanup is to get the Tx buffers out
> > of the way so that we can transmit even after the system has crashed.
> > The idea isn't to transmit XDP buffers, but to get the buffers out of
> > the way in the cases where somebody is combining both xdp_frame and
> > sk_buff on the same queue due to a limited number of rings being
> > present on the device.
>
> I see now, thanks a lot!
>
> >
> > My concern is that at some point in the near future somebody is going
> > to have a system crash and instead of being able to get the crash log
> > message out via their netconsole it is going to get cut off because
> > the driver stopped cleaning the Tx ring because somebody was also
> > using it as an XDP redirect destination.
> >
> >>>
> >>> The only reason why the skbs don't have the problem is that they are
> >>> queued and then cleaned up in the net_tx_action. That is why I wonder
> >>> if we shouldn't look at adding some sort of support for doing
> >>> something like that with xdp_frame as well. Something like a
> >>> dev_kfree_pp_page_any to go along with the dev_kfree_skb_any.
> >>
> >> I still don't get why we may need to clean XDP buffers in hardirq, maybe
> >> someone could give me some links to read why we may need this and how
> >> that happens? netpoll is a very specific thing for some debug
> >> operations, isn't it? XDP shouldn't in general be enabled when this
> >> happens, should it?
> >
> > I think I kind of explained it above. It isn't so much about cleaning
> > the XDP buffers as getting them off of the ring and out of the way. If
> > we block a Tx queue because of an XDP buffer then we cannot use that
> > Tx queue. I would be good with us just deferring the cleanup like we
> > do with an sk_buff in dev_kfree_skb_irq, the only issue is we don't
> > have the ability to put them on a queue since they don't have
> > prev/next pointers.
> >
> > I suppose an alternative to cleaning them might be to make a mandatory
> > requirement that you cannot support netpoll and mix xdp_frame and
> > sk_buff on the same queue. If we enforced that then my concern about
> > them blocking a queue would be addressed.
>
> I'm leaning more towards this one TBH. I don't feel sole netpoll as
> a solid argument for introducing XDP frame deferred queues :s

That was kind of my line of thought as well. That is why I was
thinking that instead of bothering with a queue it might work just as
well to just throw all recycling out the window and just call put_page
if we are dealing with XDP in netpoll and just force it into the free
path. Then it becomes more of an "_any" type handler.

2023-08-08 19:01:46

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

From: Alexander Duyck <[email protected]>
Date: Tue, 8 Aug 2023 07:52:32 -0700

> On Tue, Aug 8, 2023 at 6:59 AM Alexander Lobakin
> <[email protected]> wrote:
>>
>> From: Alexander Duyck <[email protected]>
>> Date: Tue, 8 Aug 2023 06:45:26 -0700

[...]

>>>>> Secondly rather than returning an error is there any reason why we
>>>>> couldn't just look at not returning page and instead just drop into the
>>>>> release path which wouldn't take the locks in the first place? Either
>>>>
>>>> That is exception path to quickly catch broken drivers and fix them, why
>>>> bother? It's not something we have to live with.
>>>
>>> My concern is that the current "fix" consists of stalling a Tx ring.
>>> We need to have a way to allow forward progress when somebody mixes
>>> xdp_frame and skb traffic as I suspect we will end up with a number of
>>> devices doing this since they cannot handle recycling the pages in
>>> hardirq context.
>>
>> You could've seen that several vendors already disabled recycling XDP
>> buffers when in hardirq (= netpoll) in their drivers. hardirq is in
>> general not for networking-related operations.
>
> The whole idea behind the netpoll cleanup is to get the Tx buffers out
> of the way so that we can transmit even after the system has crashed.
> The idea isn't to transmit XDP buffers, but to get the buffers out of
> the way in the cases where somebody is combining both xdp_frame and
> sk_buff on the same queue due to a limited number of rings being
> present on the device.

I see now, thanks a lot!

>
> My concern is that at some point in the near future somebody is going
> to have a system crash and instead of being able to get the crash log
> message out via their netconsole it is going to get cut off because
> the driver stopped cleaning the Tx ring because somebody was also
> using it as an XDP redirect destination.
>
>>>
>>> The only reason why the skbs don't have the problem is that they are
>>> queued and then cleaned up in the net_tx_action. That is why I wonder
>>> if we shouldn't look at adding some sort of support for doing
>>> something like that with xdp_frame as well. Something like a
>>> dev_kfree_pp_page_any to go along with the dev_kfree_skb_any.
>>
>> I still don't get why we may need to clean XDP buffers in hardirq, maybe
>> someone could give me some links to read why we may need this and how
>> that happens? netpoll is a very specific thing for some debug
>> operations, isn't it? XDP shouldn't in general be enabled when this
>> happens, should it?
>
> I think I kind of explained it above. It isn't so much about cleaning
> the XDP buffers as getting them off of the ring and out of the way. If
> we block a Tx queue because of an XDP buffer then we cannot use that
> Tx queue. I would be good with us just deferring the cleanup like we
> do with an sk_buff in dev_kfree_skb_irq, the only issue is we don't
> have the ability to put them on a queue since they don't have
> prev/next pointers.
>
> I suppose an alternative to cleaning them might be to make a mandatory
> requirement that you cannot support netpoll and mix xdp_frame and
> sk_buff on the same queue. If we enforced that then my concern about
> them blocking a queue would be addressed.

I'm leaning more towards this one TBH. I don't feel sole netpoll as
a solid argument for introducing XDP frame deferred queues :s

>
>> (unrelated: 6:58 AM West Coast, you use to wake up early or traveling?
>> :D)
>
> I am usually up pretty early, especially this time of year. Sunrise
> here is 6AM and I am usually up a little before that.. :)

Nice!

Thanks,
Olek

2023-08-08 20:06:11

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq

On Tue, Aug 8, 2023 at 6:16 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander H Duyck <[email protected]>
> Date: Mon, 07 Aug 2023 07:48:54 -0700
>
> > On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote:
> >> From: Jakub Kicinski <[email protected]>
> >>
> >> Page pool use in hardirq is prohibited, add debug checks
> >> to catch misuses. IIRC we previously discussed using
> >> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
> >> that people will have DEBUG_NET enabled in perf testing.
> >> I don't think anyone enables lockdep in perf testing,
> >> so use lockdep to avoid pushback and arguing :)
> >>
> >> Signed-off-by: Jakub Kicinski <[email protected]>
> >> Acked-by: Jesper Dangaard Brouer <[email protected]>
> >> Signed-off-by: Alexander Lobakin <[email protected]>
> >> ---
> >> include/linux/lockdep.h | 7 +++++++
> >> net/core/page_pool.c | 2 ++
> >> 2 files changed, 9 insertions(+)
> >>
> >> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> >> index 310f85903c91..dc2844b071c2 100644
> >> --- a/include/linux/lockdep.h
> >> +++ b/include/linux/lockdep.h
> >> @@ -625,6 +625,12 @@ do { \
> >> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
> >> } while (0)
> >>
> >> +#define lockdep_assert_no_hardirq() \
> >> +do { \
> >> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
> >> + !this_cpu_read(hardirqs_enabled))); \
> >> +} while (0)
> >> +
> >> #define lockdep_assert_preemption_enabled() \
> >> do { \
> >> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
> >> @@ -659,6 +665,7 @@ do { \
> >> # define lockdep_assert_irqs_enabled() do { } while (0)
> >> # define lockdep_assert_irqs_disabled() do { } while (0)
> >> # define lockdep_assert_in_irq() do { } while (0)
> >> +# define lockdep_assert_no_hardirq() do { } while (0)
> >>
> >> # define lockdep_assert_preemption_enabled() do { } while (0)
> >> # define lockdep_assert_preemption_disabled() do { } while (0)
> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >> index 03ad74d25959..77cb75e63aca 100644
> >> --- a/net/core/page_pool.c
> >> +++ b/net/core/page_pool.c
> >> @@ -587,6 +587,8 @@ static __always_inline struct page *
> >> __page_pool_put_page(struct page_pool *pool, struct page *page,
> >> unsigned int dma_sync_size, bool allow_direct)
> >> {
> >> + lockdep_assert_no_hardirq();
> >> +
> >> /* This allocator is optimized for the XDP mode that uses
> >> * one-frame-per-page, but have fallbacks that act like the
> >> * regular page allocator APIs.
> >
> > So two points.
> >
> > First could we look at moving this inside the if statement just before
> > we return the page, as there isn't a risk until we get into that path
> > of needing a lock.
> >
> > Secondly rather than returning an error is there any reason why we
> > couldn't just look at not returning page and instead just drop into the
> > release path which wouldn't take the locks in the first place? Either
>
> That is exception path to quickly catch broken drivers and fix them, why
> bother? It's not something we have to live with.

My concern is that the current "fix" consists of stalling a Tx ring.
We need to have a way to allow forward progress when somebody mixes
xdp_frame and skb traffic as I suspect we will end up with a number of
devices doing this since they cannot handle recycling the pages in
hardirq context.

The only reason why the skbs don't have the problem is that they are
queued and then cleaned up in the net_tx_action. That is why I wonder
if we shouldn't look at adding some sort of support for doing
something like that with xdp_frame as well. Something like a
dev_kfree_pp_page_any to go along with the dev_kfree_skb_any.