2023-07-27 15:15:15

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 0/9] page_pool: a couple of assorted optimizations

That initially was a spin-off of the IAVF PP series[0], but has grown since
then a bunch. In fact, it consists of 4 semi-independent blocks:

* #1-2: Compile-time optimization. Split page_pool.h into 2 headers to
not overbloat the consumers not needing complex inline helpers and
then stop including it in skbuff.h at all. The first patch is also
prereq for the whole series.
* #3: Improve cacheline locality for users of the Page Pool frag API.
* #4-6: Don't call DMA API when it would end with a no-op, i.e. on
systems with coherent DMA and w/o enabled IOMMU or swiotlb.
* #7-9: Use direct cache recycling more aggressively, when it is safe
obviously. In addition, make sure nobody wants to use Page Pool API
with disabled interrupts.

Patches #1 and #8 are authored by Yunsheng and Jakub respectively, with
small modifications from my side as per ML discussions.
For the perf numbers for #3-9, please see individual commit messages.

[0] https://lore.kernel.org/netdev/[email protected]

Alexander Lobakin (7):
net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
page_pool: place frag_* fields in one cacheline
page_pool: shrink &page_pool_params a tiny bit
page_pool: don't use driver-set flags field directly
page_pool: avoid calling no-op externals when possible
net: skbuff: avoid accessing page_pool if !napi_safe when returning
page
net: skbuff: always try to recycle PP pages directly when in softirq

Jakub Kicinski (1):
page_pool: add a lockdep check for recycling in hardirq

Yunsheng Lin (1):
page_pool: split types and declarations from page_pool.h

MAINTAINERS | 3 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +-
drivers/net/ethernet/marvell/mvneta.c | 2 +-
drivers/net/ethernet/marvell/mvpp2/mvpp2.h | 2 +-
.../net/ethernet/marvell/mvpp2/mvpp2_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +-
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/trap.c | 1 -
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 2 +-
.../ethernet/mellanox/mlx5/core/en_stats.c | 2 +-
.../ethernet/microchip/lan966x/lan966x_fdma.c | 1 +
.../ethernet/microchip/lan966x/lan966x_main.h | 2 +-
drivers/net/ethernet/socionext/netsec.c | 2 +-
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +-
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
drivers/net/ethernet/ti/cpsw.c | 2 +-
drivers/net/ethernet/ti/cpsw_new.c | 2 +-
drivers/net/ethernet/ti/cpsw_priv.c | 2 +-
drivers/net/ethernet/wangxun/libwx/wx_lib.c | 2 +-
drivers/net/veth.c | 2 +-
drivers/net/wireless/mediatek/mt76/mac80211.c | 1 -
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
drivers/net/xen-netfront.c | 2 +-
include/linux/lockdep.h | 7 +
include/linux/skbuff.h | 3 +-
include/net/page_pool/helpers.h | 193 +++++++++++++++
.../net/{page_pool.h => page_pool/types.h} | 221 ++----------------
include/trace/events/page_pool.h | 2 +-
net/bpf/test_run.c | 2 +-
net/core/page_pool.c | 81 ++-----
net/core/skbuff.c | 45 +++-
net/core/xdp.c | 2 +-
42 files changed, 325 insertions(+), 284 deletions(-)
create mode 100644 include/net/page_pool/helpers.h
rename include/net/{page_pool.h => page_pool/types.h} (50%)

---
Yunsheng, Jakub, pls make sure you agree with your patches that went
into the series, as both were modified :z

From RFC v2[1]:
* drop the dependency on the hybrid allocation series (and thus the
"RFC" prefix) -- it wasn't a strict dep and it's not in the trees yet;
* add [slightly reworked] Yunsheng's patch which splits page_pool.h into
2 headers -- merge conflict hell otherwise.
Also fix a typo while nobody looks (Simon);
* #3 (former #2): word the commitmsg a bit better, mention the main
reason for the change more clearly (Ilias);
* add Jakub's hardirq assertion as a prereq for the last patch;
* #9 (former #7): add comment mentioning that the hardirq case is not
checked due to the assertion checking it later (yes, it is illegal to
use Page Pool with the interrupts disabled or when in TH) (Jakub);

From RFC v1[2]:
* #1: move the entire function to skbuff.c, don't try to split it (Alex);
* #2-4: new;
* #5: use internal flags field added in #4 and don't modify driver-defined
structure (Alex, Jakub);
* #6: new;
* drop "add new NAPI state" as a redundant complication;
* #7: replace the check for the new NAPI state to just in_softirq(), should
be fine (Jakub).

[1] https://lore.kernel.org/netdev/[email protected]
[2] https://lore.kernel.org/netdev/[email protected]
--
2.41.0



2023-07-27 15:20:37

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 4/9] page_pool: shrink &page_pool_params a tiny bit

For now, this structure takes a whole 64-byte cacheline on x86_64.
But in fact, it has a 4-byte hole before ::init_callback() (yet not
sufficient to change its sizeof()).
::dma_dir is whole 4 bytes, although its values can only be 0 and 2.
Merge it with ::flags and, so that its slot gets freed and reduces
the structure's size to 56 bytes. This adds instruction when reading
that field, but the upcoming change will make those reads happen way
less often.
Pad the freed slot explicitly in &page_pool to not alter cacheline
layout while it's not used.

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool/types.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 664a787948e1..c86f65e57614 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -44,13 +44,13 @@ struct pp_alloc_cache {
};

struct page_pool_params {
- unsigned int flags;
+ unsigned int flags:30;
+ enum dma_data_direction dma_dir:2; /* DMA mapping direction */
unsigned int order;
unsigned int pool_size;
int nid; /* Numa node id to allocate from pages from */
struct device *dev; /* device, for DMA pre-mapping purposes */
struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
- enum dma_data_direction dma_dir; /* DMA mapping direction */
unsigned int max_len; /* max DMA sync memory size */
unsigned int offset; /* DMA addr offset */
void (*init_callback)(struct page *page, void *arg);
@@ -93,6 +93,7 @@ struct page_pool_stats {

struct page_pool {
struct page_pool_params p;
+ long pad;

long frag_users;
struct page *frag_page;
--
2.41.0


2023-07-27 15:24:59

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 1/9] page_pool: split types and declarations from page_pool.h

From: Yunsheng Lin <[email protected]>

Split types and pure function declarations from page_pool.h
and add them in page_page/types.h, so that C sources can
include page_pool.h and headers should generally only include
page_pool/types.h as suggested by jakub.
Rename page_pool.h to page_pool/helpers.h to have both in
one place.

Signed-off-by: Yunsheng Lin <[email protected]>
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
MAINTAINERS | 3 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 1 +
.../net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +-
drivers/net/ethernet/marvell/mvneta.c | 2 +-
drivers/net/ethernet/marvell/mvpp2/mvpp2.h | 2 +-
.../net/ethernet/marvell/mvpp2/mvpp2_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +-
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/trap.c | 1 -
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 2 +-
.../ethernet/mellanox/mlx5/core/en_stats.c | 2 +-
.../ethernet/microchip/lan966x/lan966x_fdma.c | 1 +
.../ethernet/microchip/lan966x/lan966x_main.h | 2 +-
drivers/net/ethernet/socionext/netsec.c | 2 +-
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +-
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
drivers/net/ethernet/ti/cpsw.c | 2 +-
drivers/net/ethernet/ti/cpsw_new.c | 2 +-
drivers/net/ethernet/ti/cpsw_priv.c | 2 +-
drivers/net/ethernet/wangxun/libwx/wx_lib.c | 2 +-
drivers/net/veth.c | 2 +-
drivers/net/wireless/mediatek/mt76/mac80211.c | 1 -
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
drivers/net/xen-netfront.c | 2 +-
include/linux/skbuff.h | 2 +-
include/net/page_pool/helpers.h | 193 +++++++++++++++++
.../net/{page_pool.h => page_pool/types.h} | 197 +-----------------
include/trace/events/page_pool.h | 2 +-
net/bpf/test_run.c | 2 +-
net/core/page_pool.c | 2 +-
net/core/skbuff.c | 2 +-
net/core/xdp.c | 2 +-
41 files changed, 235 insertions(+), 220 deletions(-)
create mode 100644 include/net/page_pool/helpers.h
rename include/net/{page_pool.h => page_pool/types.h} (51%)

diff --git a/MAINTAINERS b/MAINTAINERS
index d0553ad37865..30037d39b82d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16015,8 +16015,7 @@ M: Ilias Apalodimas <[email protected]>
L: [email protected]
S: Supported
F: Documentation/networking/page_pool.rst
-F: include/net/page_pool.h
-F: include/trace/events/page_pool.h
+F: include/net/page_pool/*.h
F: net/core/page_pool.c

PAGE TABLE CHECK
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a3bbd13c070f..2e8a1b79bf3f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -54,7 +54,7 @@
#include <net/pkt_cls.h>
#include <linux/hwmon.h>
#include <linux/hwmon-sysfs.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <linux/align.h>
#include <net/netdev_queues.h>

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 5b6fbdc4dc40..0fc2c7514aa6 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -15,7 +15,7 @@
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
#include <linux/filter.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include "bnxt_hsi.h"
#include "bnxt.h"
#include "bnxt_xdp.h"
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index 079f9f6ae21a..f61bd89734c5 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -28,6 +28,7 @@
#include <linux/iopoll.h>
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
+#include <net/page_pool/helpers.h>
#include <net/xdp_sock_drv.h>

#define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 03ac7690b5c4..27420f3447fc 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -38,6 +38,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <net/ip.h>
+#include <net/page_pool/helpers.h>
#include <net/selftests.h>
#include <net/tso.h>
#include <linux/tcp.h>
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 9f6890059666..e5e37a33fd81 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -18,6 +18,7 @@
#include <net/gre.h>
#include <net/gro.h>
#include <net/ip6_checksum.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>
#include <net/tcp.h>
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 88af34bbee34..acd756b0c7c9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,7 +6,7 @@

#include <linux/dim.h>
#include <linux/if_vlan.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <asm/barrier.h>

#include "hnae3.h"
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index acf4f6ba73a6..d483b8c00ec0 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -37,7 +37,7 @@
#include <net/ip.h>
#include <net/ipv6.h>
#include <net/tso.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_sched.h>
#include <linux/bpf_trace.h>

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
index 11e603686a27..e809f91c08fb 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
@@ -16,7 +16,7 @@
#include <linux/phy.h>
#include <linux/phylink.h>
#include <net/flow_offload.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <linux/bpf.h>
#include <net/xdp.h>

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 1fec84b4c068..68fef2027221 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -36,6 +36,7 @@
#include <uapi/linux/ppp_defs.h>
#include <net/ip.h>
#include <net/ipv6.h>
+#include <net/page_pool/helpers.h>
#include <net/tso.h>
#include <linux/bpf_trace.h>

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index 8cdd92dd9762..8336cea16aff 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -7,6 +7,7 @@

#include <linux/interrupt.h>
#include <linux/pci.h>
+#include <net/page_pool/helpers.h>
#include <net/tso.h>
#include <linux/bitfield.h>

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
index 9551b422622a..bfcc13ad9514 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
@@ -16,6 +16,7 @@
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
#include <linux/bitfield.h>
+#include <net/page_pool/types.h>

#include "otx2_reg.h"
#include "otx2_common.h"
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 7490d48000c2..b72e127d7d3a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -25,6 +25,7 @@
#include <linux/bitfield.h>
#include <net/dsa.h>
#include <net/dst_metadata.h>
+#include <net/page_pool/helpers.h>

#include "mtk_eth_soc.h"
#include "mtk_wed.h"
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 28adda0c90c0..5ebe151f0763 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -18,7 +18,7 @@
#include <linux/rhashtable.h>
#include <linux/dim.h>
#include <linux/bitfield.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <linux/bpf_trace.h>
#include "mtk_ppe.h"

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 5ce28ff7685f..e097f336e1c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -6,6 +6,7 @@
#include "en/port.h"
#include "en_accel/en_accel.h"
#include "en_accel/ipsec.h"
+#include <net/page_pool/types.h>
#include <net/xdp_sock_drv.h>

static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
index 201ac7dd338f..698647cc8c0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
@@ -1,7 +1,6 @@
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2020 Mellanox Technologies */

-#include <net/page_pool.h>
#include "en/txrx.h"
#include "en/params.h"
#include "en/trap.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 40589cebb773..12f56d0db0af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -35,6 +35,7 @@
#include "en/xdp.h"
#include "en/params.h"
#include <linux/bitfield.h>
+#include <net/page_pool/helpers.h>

int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
{
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index defb1efccb78..d9b17ab75fa4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -38,7 +38,7 @@
#include <linux/debugfs.h>
#include <linux/if_bridge.h>
#include <linux/filter.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/pkt_sched.h>
#include <net/xdp_sock_drv.h>
#include "eswitch.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 41d37159e027..d01b7f501d9a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -36,7 +36,7 @@
#include <linux/bitmap.h>
#include <linux/filter.h>
#include <net/ip6_checksum.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/inet_ecn.h>
#include <net/gro.h>
#include <net/udp.h>
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 4d77055abd4b..07b84d668fcc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -38,7 +38,7 @@
#include "en/port.h"

#ifdef CONFIG_PAGE_POOL_STATS
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#endif

static unsigned int stats_grps_num(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c b/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c
index bd72fbc2220f..3960534ac2ad 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c
@@ -2,6 +2,7 @@

#include <linux/bpf.h>
#include <linux/filter.h>
+#include <net/page_pool/helpers.h>

#include "lan966x_main.h"

diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
index 27f272831ea5..b0be99fa8174 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
@@ -10,7 +10,7 @@
#include <linux/phy.h>
#include <linux/phylink.h>
#include <linux/ptp_clock_kernel.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>
#include <net/switchdev.h>
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 2d7347b71c41..0cde22a01e03 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -15,7 +15,7 @@
#include <linux/bpf_trace.h>

#include <net/tcp.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/ip6_checksum.h>

#define NETSEC_REG_SOFT_RST 0x104
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4ce5eaaae513..a57f5f362ff5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -21,7 +21,7 @@
#include <linux/ptp_clock_kernel.h>
#include <linux/net_tstamp.h>
#include <linux/reset.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <uapi/linux/bpf.h>

struct stmmac_resources {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 19e46e8f626a..be676ffefeca 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -39,6 +39,7 @@
#include <linux/phylink.h>
#include <linux/udp.h>
#include <linux/bpf_trace.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/xdp_sock_drv.h>
#include "stmmac_ptp.h"
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index f9cd566d1c9b..ca4d4548f85e 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -31,7 +31,7 @@
#include <linux/if_vlan.h>
#include <linux/kmemleak.h>
#include <linux/sys_soc.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <linux/bpf.h>
#include <linux/bpf_trace.h>

diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index c61e4e44a78f..0e4f526b1753 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -30,7 +30,7 @@
#include <linux/sys_soc.h>

#include <net/switchdev.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/devlink.h>

diff --git a/drivers/net/ethernet/ti/cpsw_priv.c b/drivers/net/ethernet/ti/cpsw_priv.c
index e966dd47e2db..1e81e95bcc4b 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.c
+++ b/drivers/net/ethernet/ti/cpsw_priv.c
@@ -18,7 +18,7 @@
#include <linux/platform_device.h>
#include <linux/pm_runtime.h>
#include <linux/skbuff.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index 2c3f08be8c37..e04d4a5eed7b 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -3,7 +3,7 @@

#include <linux/etherdevice.h>
#include <net/ip6_checksum.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/inet_ecn.h>
#include <linux/iopoll.h>
#include <linux/sctp.h>
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..953f6d8f8db0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -26,7 +26,7 @@
#include <linux/ptr_ring.h>
#include <linux/bpf_trace.h>
#include <linux/net_tstamp.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>

#define DRV_NAME "veth"
#define DRV_VERSION "1.0"
diff --git a/drivers/net/wireless/mediatek/mt76/mac80211.c b/drivers/net/wireless/mediatek/mt76/mac80211.c
index 467afef98ba2..c8f7f80746e8 100644
--- a/drivers/net/wireless/mediatek/mt76/mac80211.c
+++ b/drivers/net/wireless/mediatek/mt76/mac80211.c
@@ -4,7 +4,6 @@
*/
#include <linux/sched.h>
#include <linux/of.h>
-#include <net/page_pool.h>
#include "mt76.h"

#define CHAN2G(_idx, _freq) { \
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 6b07b8fafec2..81192a07f17f 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -15,6 +15,7 @@
#include <linux/average.h>
#include <linux/soc/mediatek/mtk_wed.h>
#include <net/mac80211.h>
+#include <net/page_pool/helpers.h>
#include "util.h"
#include "testmode.h"

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 47d54d8ea59d..ad29f370034e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -45,7 +45,7 @@
#include <linux/slab.h>
#include <net/ip.h>
#include <linux/bpf.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <linux/bpf_trace.h>

#include <xen/xen.h>
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 16a49ba534e4..888e3d7e74c1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,7 @@
#include <linux/if_packet.h>
#include <linux/llist.h>
#include <net/flow.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
new file mode 100644
index 000000000000..e2d8d3a8810c
--- /dev/null
+++ b/include/net/page_pool/helpers.h
@@ -0,0 +1,193 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool/helpers.h
+ * Author: Jesper Dangaard Brouer <[email protected]>
+ * Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call. Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * API keeps track of in-flight pages, in-order to let API user know
+ * when it is safe to dealloactor page_pool object. Thus, API users
+ * must call page_pool_put_page() where appropriate and only attach
+ * the page to a page_pool-aware objects, like skbs marked for recycling.
+ *
+ * API user must only call page_pool_put_page() once on a page, as it
+ * will either recycle the page, or in case of elevated refcnt, it
+ * will release the DMA mapping and in-flight state accounting. We
+ * hope to lift this requirement in the future.
+ */
+#ifndef _NET_PAGE_POOL_HELPERS_H
+#define _NET_PAGE_POOL_HELPERS_H
+
+#include <net/page_pool/types.h>
+
+#ifdef CONFIG_PAGE_POOL_STATS
+int page_pool_ethtool_stats_get_count(void);
+u8 *page_pool_ethtool_stats_get_strings(u8 *data);
+u64 *page_pool_ethtool_stats_get(u64 *data, void *stats);
+
+/*
+ * Drivers that wish to harvest page pool stats and report them to users
+ * (perhaps via ethtool, debugfs, or another mechanism) can allocate a
+ * struct page_pool_stats call page_pool_get_stats to get stats for the specified pool.
+ */
+bool page_pool_get_stats(struct page_pool *pool,
+ struct page_pool_stats *stats);
+#else
+static inline int page_pool_ethtool_stats_get_count(void)
+{
+ return 0;
+}
+
+static inline u8 *page_pool_ethtool_stats_get_strings(u8 *data)
+{
+ return data;
+}
+
+static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
+{
+ return data;
+}
+#endif
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+ gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+ return page_pool_alloc_pages(pool, gfp);
+}
+
+static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
+ unsigned int *offset,
+ unsigned int size)
+{
+ gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+ return page_pool_alloc_frag(pool, offset, size, gfp);
+}
+
+/* get the stored dma direction. A driver might decide to treat this locally and
+ * avoid the extra cache line from page_pool to determine the direction
+ */
+static
+inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
+{
+ return pool->p.dma_dir;
+}
+
+/* pp_frag_count represents the number of writers who can update the page
+ * either by updating skb->data or via DMA mappings for the device.
+ * We can't rely on the page refcnt for that as we don't know who might be
+ * holding page references and we can't reliably destroy or sync DMA mappings
+ * of the fragments.
+ *
+ * When pp_frag_count reaches 0 we can either recycle the page if the page
+ * refcnt is 1 or return it back to the memory allocator and destroy any
+ * mappings we have.
+ */
+static inline void page_pool_fragment_page(struct page *page, long nr)
+{
+ atomic_long_set(&page->pp_frag_count, nr);
+}
+
+static inline long page_pool_defrag_page(struct page *page, long nr)
+{
+ long ret;
+
+ /* If nr == pp_frag_count then we have cleared all remaining
+ * references to the page. No need to actually overwrite it, instead
+ * we can leave this to be overwritten by the calling function.
+ *
+ * The main advantage to doing this is that an atomic_read is
+ * generally a much cheaper operation than an atomic update,
+ * especially when dealing with a page that may be partitioned
+ * into only 2 or 3 pieces.
+ */
+ if (atomic_long_read(&page->pp_frag_count) == nr)
+ return 0;
+
+ ret = atomic_long_sub_return(nr, &page->pp_frag_count);
+ WARN_ON(ret < 0);
+ return ret;
+}
+
+static inline bool page_pool_is_last_frag(struct page_pool *pool,
+ struct page *page)
+{
+ /* If fragments aren't enabled or count is 0 we were the last user */
+ return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
+ (page_pool_defrag_page(page, 1) == 0);
+}
+
+static inline void page_pool_put_page(struct page_pool *pool,
+ struct page *page,
+ unsigned int dma_sync_size,
+ bool allow_direct)
+{
+ /* When page_pool isn't compiled-in, net/core/xdp.c doesn't
+ * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
+ */
+#ifdef CONFIG_PAGE_POOL
+ if (!page_pool_is_last_frag(pool, page))
+ return;
+
+ page_pool_put_defragged_page(pool, page, dma_sync_size, allow_direct);
+#endif
+}
+
+/* Same as above but will try to sync the entire area pool->max_len */
+static inline void page_pool_put_full_page(struct page_pool *pool,
+ struct page *page, bool allow_direct)
+{
+ page_pool_put_page(pool, page, -1, allow_direct);
+}
+
+/* Same as above but the caller must guarantee safe context. e.g NAPI */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+ struct page *page)
+{
+ page_pool_put_full_page(pool, page, true);
+}
+
+#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
+ (sizeof(dma_addr_t) > sizeof(unsigned long))
+
+static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
+{
+ dma_addr_t ret = page->dma_addr;
+
+ if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+ ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
+
+ return ret;
+}
+
+static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+{
+ page->dma_addr = addr;
+ if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+ page->dma_addr_upper = upper_32_bits(addr);
+}
+
+static inline bool page_pool_put(struct page_pool *pool)
+{
+ return refcount_dec_and_test(&pool->user_cnt);
+}
+
+static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
+{
+ if (unlikely(pool->p.nid != new_nid))
+ page_pool_update_nid(pool, new_nid);
+}
+
+#endif /* _NET_PAGE_POOL_HELPERS_H */
diff --git a/include/net/page_pool.h b/include/net/page_pool/types.h
similarity index 51%
rename from include/net/page_pool.h
rename to include/net/page_pool/types.h
index f1d5cc1fa13b..4a0270291deb 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool/types.h
@@ -1,37 +1,10 @@
-/* SPDX-License-Identifier: GPL-2.0
- *
- * page_pool.h
- * Author: Jesper Dangaard Brouer <[email protected]>
- * Copyright (C) 2016 Red Hat, Inc.
- */
+/* SPDX-License-Identifier: GPL-2.0 */

-/**
- * DOC: page_pool allocator
- *
- * This page_pool allocator is optimized for the XDP mode that
- * uses one-frame-per-page, but have fallbacks that act like the
- * regular page allocator APIs.
- *
- * Basic use involve replacing alloc_pages() calls with the
- * page_pool_alloc_pages() call. Drivers should likely use
- * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
- *
- * API keeps track of in-flight pages, in-order to let API user know
- * when it is safe to dealloactor page_pool object. Thus, API users
- * must call page_pool_put_page() where appropriate and only attach
- * the page to a page_pool-aware objects, like skbs marked for recycling.
- *
- * API user must only call page_pool_put_page() once on a page, as it
- * will either recycle the page, or in case of elevated refcnt, it
- * will release the DMA mapping and in-flight state accounting. We
- * hope to lift this requirement in the future.
- */
-#ifndef _NET_PAGE_POOL_H
-#define _NET_PAGE_POOL_H
+#ifndef _NET_PAGE_POOL_TYPES_H
+#define _NET_PAGE_POOL_TYPES_H

-#include <linux/mm.h> /* Needed by ptr_ring */
-#include <linux/ptr_ring.h>
#include <linux/dma-direction.h>
+#include <linux/ptr_ring.h>

#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA
* map/unmap
@@ -116,35 +89,6 @@ struct page_pool_stats {
struct page_pool_alloc_stats alloc_stats;
struct page_pool_recycle_stats recycle_stats;
};
-
-int page_pool_ethtool_stats_get_count(void);
-u8 *page_pool_ethtool_stats_get_strings(u8 *data);
-u64 *page_pool_ethtool_stats_get(u64 *data, void *stats);
-
-/*
- * Drivers that wish to harvest page pool stats and report them to users
- * (perhaps via ethtool, debugfs, or another mechanism) can allocate a
- * struct page_pool_stats call page_pool_get_stats to get stats for the specified pool.
- */
-bool page_pool_get_stats(struct page_pool *pool,
- struct page_pool_stats *stats);
-#else
-
-static inline int page_pool_ethtool_stats_get_count(void)
-{
- return 0;
-}
-
-static inline u8 *page_pool_ethtool_stats_get_strings(u8 *data)
-{
- return data;
-}
-
-static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
-{
- return data;
-}
-
#endif

struct page_pool {
@@ -188,7 +132,7 @@ struct page_pool {
* association with allocation resource.
*
* Use ptr_ring, as it separates consumer and producer
- * effeciently, it a way that doesn't bounce cache-lines.
+ * efficiently, it a way that doesn't bounce cache-lines.
*
* TODO: Implement bulk return pages into this structure.
*/
@@ -210,35 +154,8 @@ struct page_pool {
};

struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
-
-static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
-{
- gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
-
- return page_pool_alloc_pages(pool, gfp);
-}
-
struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset,
unsigned int size, gfp_t gfp);
-
-static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
- unsigned int *offset,
- unsigned int size)
-{
- gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
-
- return page_pool_alloc_frag(pool, offset, size, gfp);
-}
-
-/* get the stored dma direction. A driver might decide to treat this locally and
- * avoid the extra cache line from page_pool to determine the direction
- */
-static
-inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
-{
- return pool->p.dma_dir;
-}
-
bool page_pool_return_skb_page(struct page *page, bool napi_safe);

struct page_pool *page_pool_create(const struct page_pool_params *params);
@@ -277,100 +194,6 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
unsigned int dma_sync_size,
bool allow_direct);

-/* pp_frag_count represents the number of writers who can update the page
- * either by updating skb->data or via DMA mappings for the device.
- * We can't rely on the page refcnt for that as we don't know who might be
- * holding page references and we can't reliably destroy or sync DMA mappings
- * of the fragments.
- *
- * When pp_frag_count reaches 0 we can either recycle the page if the page
- * refcnt is 1 or return it back to the memory allocator and destroy any
- * mappings we have.
- */
-static inline void page_pool_fragment_page(struct page *page, long nr)
-{
- atomic_long_set(&page->pp_frag_count, nr);
-}
-
-static inline long page_pool_defrag_page(struct page *page, long nr)
-{
- long ret;
-
- /* If nr == pp_frag_count then we have cleared all remaining
- * references to the page. No need to actually overwrite it, instead
- * we can leave this to be overwritten by the calling function.
- *
- * The main advantage to doing this is that an atomic_read is
- * generally a much cheaper operation than an atomic update,
- * especially when dealing with a page that may be partitioned
- * into only 2 or 3 pieces.
- */
- if (atomic_long_read(&page->pp_frag_count) == nr)
- return 0;
-
- ret = atomic_long_sub_return(nr, &page->pp_frag_count);
- WARN_ON(ret < 0);
- return ret;
-}
-
-static inline bool page_pool_is_last_frag(struct page_pool *pool,
- struct page *page)
-{
- /* If fragments aren't enabled or count is 0 we were the last user */
- return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
- (page_pool_defrag_page(page, 1) == 0);
-}
-
-static inline void page_pool_put_page(struct page_pool *pool,
- struct page *page,
- unsigned int dma_sync_size,
- bool allow_direct)
-{
- /* When page_pool isn't compiled-in, net/core/xdp.c doesn't
- * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
- */
-#ifdef CONFIG_PAGE_POOL
- if (!page_pool_is_last_frag(pool, page))
- return;
-
- page_pool_put_defragged_page(pool, page, dma_sync_size, allow_direct);
-#endif
-}
-
-/* Same as above but will try to sync the entire area pool->max_len */
-static inline void page_pool_put_full_page(struct page_pool *pool,
- struct page *page, bool allow_direct)
-{
- page_pool_put_page(pool, page, -1, allow_direct);
-}
-
-/* Same as above but the caller must guarantee safe context. e.g NAPI */
-static inline void page_pool_recycle_direct(struct page_pool *pool,
- struct page *page)
-{
- page_pool_put_full_page(pool, page, true);
-}
-
-#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
- (sizeof(dma_addr_t) > sizeof(unsigned long))
-
-static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
-{
- dma_addr_t ret = page->dma_addr;
-
- if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
- ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
-
- return ret;
-}
-
-static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
-{
- page->dma_addr = addr;
- if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
- page->dma_addr_upper = upper_32_bits(addr);
-}
-
static inline bool is_page_pool_compiled_in(void)
{
#ifdef CONFIG_PAGE_POOL
@@ -380,17 +203,7 @@ static inline bool is_page_pool_compiled_in(void)
#endif
}

-static inline bool page_pool_put(struct page_pool *pool)
-{
- return refcount_dec_and_test(&pool->user_cnt);
-}
-
/* Caller must provide appropriate safe context, e.g. NAPI. */
void page_pool_update_nid(struct page_pool *pool, int new_nid);
-static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
-{
- if (unlikely(pool->p.nid != new_nid))
- page_pool_update_nid(pool, new_nid);
-}

#endif /* _NET_PAGE_POOL_H */
diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h
index ca534501158b..6834356b2d2a 100644
--- a/include/trace/events/page_pool.h
+++ b/include/trace/events/page_pool.h
@@ -9,7 +9,7 @@
#include <linux/tracepoint.h>

#include <trace/events/mmflags.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>

TRACE_EVENT(page_pool_release,

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 7d47f53f20c1..f892698c8829 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -15,7 +15,7 @@
#include <net/sock.h>
#include <net/tcp.h>
#include <net/net_namespace.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <linux/error-injection.h>
#include <linux/smp.h>
#include <linux/sock_diag.h>
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 7ca456bfab71..2a75f61264c5 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -10,7 +10,7 @@
#include <linux/slab.h>
#include <linux/device.h>

-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>
#include <net/xdp.h>

#include <linux/dma-direction.h>
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a298992060e6..ac8f421f8ab3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,7 +73,7 @@
#include <net/mpls.h>
#include <net/mptcp.h>
#include <net/mctp.h>
-#include <net/page_pool.h>
+#include <net/page_pool/types.h>
#include <net/dropreason.h>

#include <linux/uaccess.h>
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8362130bf085..a70670fe9a2d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -14,7 +14,7 @@
#include <linux/idr.h>
#include <linux/rhashtable.h>
#include <linux/bug.h>
-#include <net/page_pool.h>
+#include <net/page_pool/helpers.h>

#include <net/xdp.h>
#include <net/xdp_priv.h> /* struct xdp_mem_allocator */
--
2.41.0


2023-07-27 15:29:58

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

page_pool::p is driver-defined params, copied directly from the
structure passed via page_pool_create(). The structure isn't meant
to be modified by the Page Pool core code and this even might look
confusing[0][1].
In order to be able to alter some flags, let's define our own, internal
fields. Use the slot freed earlier to stay within the same cacheline as
before (or almost if it's shorter than 64 bytes).
The flags indicating whether to perform DMA mapping and use frags can
be bool; as for DMA sync, define it as an enum to be able to extend it
later on. They are defined as bits in the driver-set params, leave them
so here as well, to not waste byte-per-bit or so. Now there are 29 free
bits left in those 4 bytes + 4 free bytes more before the cacheline
boundary.
We could've defined only new flags here or only the ones we may need
to alter, but checking some flags in one place while others in another
doesn't sound convenient or intuitive.

Suggested-by: Jakub Kicinski <[email protected]>
Link[0]: https://lore.kernel.org/netdev/[email protected]
Suggested-by: Alexander Duyck <[email protected]>
Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool/helpers.h | 2 +-
include/net/page_pool/types.h | 8 +++++++-
net/core/page_pool.c | 33 +++++++++++++++++----------------
3 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index e2d8d3a8810c..a09ba80b889e 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -125,7 +125,7 @@ static inline bool page_pool_is_last_frag(struct page_pool *pool,
struct page *page)
{
/* If fragments aren't enabled or count is 0 we were the last user */
- return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
+ return !pool->page_frag ||
(page_pool_defrag_page(page, 1) == 0);
}

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c86f65e57614..dd26f4b2b66c 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -93,7 +93,13 @@ struct page_pool_stats {

struct page_pool {
struct page_pool_params p;
- long pad;
+
+ bool dma_map:1; /* Perform DMA mapping */
+ enum {
+ PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
+ PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
+ } dma_sync_act:1;
+ bool page_frag:1; /* Allow page fragments */

long frag_users;
struct page *frag_page;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 7a23ca6b1124..6a8f105e2df5 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -183,6 +183,8 @@ static int page_pool_init(struct page_pool *pool,
if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
(pool->p.dma_dir != DMA_BIDIRECTIONAL))
return -EINVAL;
+
+ pool->dma_map = true;
}

if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) {
@@ -195,13 +197,15 @@ static int page_pool_init(struct page_pool *pool,
if (!pool->p.max_len)
return -EINVAL;

+ pool->dma_sync_act = PP_DMA_SYNC_ACT_DO;
+
/* pool->p.offset has to be set according to the address
* offset used by the DMA engine to start copying rx data
*/
}

- if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
- pool->p.flags & PP_FLAG_PAGE_FRAG)
+ pool->page_frag = !!(pool->p.flags & PP_FLAG_PAGE_FRAG);
+ if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT && pool->page_frag)
return -EINVAL;

#ifdef CONFIG_PAGE_POOL_STATS
@@ -218,7 +222,7 @@ static int page_pool_init(struct page_pool *pool,
/* Driver calling page_pool_create() also call page_pool_destroy() */
refcount_set(&pool->user_cnt, 1);

- if (pool->p.flags & PP_FLAG_DMA_MAP)
+ if (pool->dma_map)
get_device(pool->p.dev);

return 0;
@@ -346,7 +350,7 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)

page_pool_set_dma_addr(page, dma);

- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);

return true;
@@ -377,8 +381,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
if (unlikely(!page))
return NULL;

- if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
- unlikely(!page_pool_dma_map(pool, page))) {
+ if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
return NULL;
}
@@ -398,8 +401,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t gfp)
{
const int bulk = PP_ALLOC_CACHE_REFILL;
- unsigned int pp_flags = pool->p.flags;
unsigned int pp_order = pool->p.order;
+ bool dma_map = pool->dma_map;
struct page *page;
int i, nr_pages;

@@ -424,8 +427,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
*/
for (i = 0; i < nr_pages; i++) {
page = pool->alloc.cache[i];
- if ((pp_flags & PP_FLAG_DMA_MAP) &&
- unlikely(!page_pool_dma_map(pool, page))) {
+ if (dma_map && unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
continue;
}
@@ -497,7 +499,7 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page)
dma_addr_t dma;
int count;

- if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+ if (!pool->dma_map)
/* Always account for inflight pages, even if we didn't
* map them
*/
@@ -563,7 +565,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
}

/* If the page refcnt == 1, this will try to recycle the page.
- * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
+ * if pool->dma_sync_act is set, we'll try to sync the DMA area for
* the configured size min(dma_sync_size, pool->max_len).
* If the page refcnt != 1, then the page will be returned to memory
* subsystem.
@@ -584,7 +586,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
/* Read barrier done in page_ref_count / READ_ONCE */

- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page,
dma_sync_size);

@@ -683,7 +685,7 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
return NULL;

if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page, -1);

return page;
@@ -713,8 +715,7 @@ struct page *page_pool_alloc_frag(struct page_pool *pool,
unsigned int max_size = PAGE_SIZE << pool->p.order;
struct page *page = pool->frag_page;

- if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
- size > max_size))
+ if (WARN_ON(!pool->page_frag || size > max_size))
return NULL;

size = ALIGN(size, dma_get_cache_alignment());
@@ -774,7 +775,7 @@ static void page_pool_free(struct page_pool *pool)

ptr_ring_cleanup(&pool->ring, NULL);

- if (pool->p.flags & PP_FLAG_DMA_MAP)
+ if (pool->dma_map)
put_device(pool->p.dev);

#ifdef CONFIG_PAGE_POOL_STATS
--
2.41.0


2023-07-27 15:30:31

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 9/9] net: skbuff: always try to recycle PP pages directly when in softirq

Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple of other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of time we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, check whether we are in the
softirq context. It will most likely be so and then if ->list_owner
is our current CPU, we're good to use direct recycling, even though
@napi_safe is false -- concurrent access is excluded. in_softirq()
protection is needed mostly due to we can hit this place in the
process context (not the hardirq though).
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of time, which means the
MM layer is not even called to allocate any new pages.

Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e701401092d7..5ba3948cceed 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -901,8 +901,10 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
/* Allow direct recycle if we have reasons to believe that we are
* in the same context as the consumer would run, so there's
* no possible race.
+ * __page_pool_put_page() makes sure we're not in hardirq context
+ * and interrupts are enabled prior to accessing the cache.
*/
- if (napi_safe) {
+ if (napi_safe || in_softirq()) {
const struct napi_struct *napi = READ_ONCE(pp->p.napi);

allow_direct = napi &&
--
2.41.0


2023-07-27 15:33:47

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 8/9] page_pool: add a lockdep check for recycling in hardirq

From: Jakub Kicinski <[email protected]>

Page pool use in hardirq is prohibited, add debug checks
to catch misuses. IIRC we previously discussed using
DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
that people will have DEBUG_NET enabled in perf testing.
I don't think anyone enables lockdep in perf testing,
so use lockdep to avoid pushback and arguing :)

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/lockdep.h | 7 +++++++
net/core/page_pool.c | 2 ++
2 files changed, 9 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 310f85903c91..dc2844b071c2 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -625,6 +625,12 @@ do { \
WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \
} while (0)

+#define lockdep_assert_no_hardirq() \
+do { \
+ WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \
+ !this_cpu_read(hardirqs_enabled))); \
+} while (0)
+
#define lockdep_assert_preemption_enabled() \
do { \
WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
@@ -659,6 +665,7 @@ do { \
# define lockdep_assert_irqs_enabled() do { } while (0)
# define lockdep_assert_irqs_disabled() do { } while (0)
# define lockdep_assert_in_irq() do { } while (0)
+# define lockdep_assert_no_hardirq() do { } while (0)

# define lockdep_assert_preemption_enabled() do { } while (0)
# define lockdep_assert_preemption_disabled() do { } while (0)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 529e4b41e9eb..be9371cd9ac7 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -579,6 +579,8 @@ static __always_inline struct page *
__page_pool_put_page(struct page_pool *pool, struct page *page,
unsigned int dma_sync_size, bool allow_direct)
{
+ lockdep_assert_no_hardirq();
+
/* This allocator is optimized for the XDP mode that uses
* one-frame-per-page, but have fallbacks that act like the
* regular page allocator APIs.
--
2.41.0


2023-07-27 16:00:54

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 2/9] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

Currently, touching <net/page_pool/types.h> triggers a rebuild of more
than half of the kernel. That's because it's included in
<linux/skbuff.h>. And each new include to page_pool/types.h adds more
[useless] data for the toolchain to process per each source file from
that pile.

In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple of functions
defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info as
long as page pool owns the page") one of the calls was removed, so only
one was left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having very niche page_pool_types.h included only for that
looks like an overkill.

As %PP_SIGNATURE is not local to page_pool.c (was only in the
early submissions), nothing holds this function there. Teleport
page_pool_return_skb_page() to skbuff.c, just next to the main
consumer, skb_pp_recycle(). It's used also in napi_frag_unref() ->
{__,}skb_frag_unref(), so no `static` unfortunately. Maybe next time.
Now, touching page_pool_types.h only triggers rebuilding of the drivers
using it and a couple of core networking files.

Suggested-by: Jakub Kicinski <[email protected]> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <[email protected]> # move to skbuff.c
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/skbuff.h | 3 ++-
include/net/page_pool/types.h | 2 --
net/core/page_pool.c | 39 ---------------------------------
net/core/skbuff.c | 41 ++++++++++++++++++++++++++++++++++-
4 files changed, 42 insertions(+), 43 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 888e3d7e74c1..7effd94efd6c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,6 @@
#include <linux/if_packet.h>
#include <linux/llist.h>
#include <net/flow.h>
-#include <net/page_pool/types.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
@@ -3421,6 +3420,8 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
}

+bool page_pool_return_skb_page(struct page *page, bool napi_safe);
+
static inline void
napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
{
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 4a0270291deb..c7aef6c75935 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -156,8 +156,6 @@ struct page_pool {
struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset,
unsigned int size, gfp_t gfp);
-bool page_pool_return_skb_page(struct page *page, bool napi_safe);
-
struct page_pool *page_pool_create(const struct page_pool_params *params);

struct xdp_mem_info;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2a75f61264c5..7a23ca6b1124 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -906,42 +906,3 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
-
-bool page_pool_return_skb_page(struct page *page, bool napi_safe)
-{
- struct napi_struct *napi;
- struct page_pool *pp;
- bool allow_direct;
-
- page = compound_head(page);
-
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- * in order to preserve any existing bits, such as bit 0 for the
- * head page of compound page and bit 1 for pfmemalloc page, so
- * mask those bits for freeing side when doing below checking,
- * and page_is_pfmemalloc() is checked in __page_pool_put_page()
- * to avoid recycling the pfmemalloc page.
- */
- if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
- return false;
-
- pp = page->pp;
-
- /* Allow direct recycle if we have reasons to believe that we are
- * in the same context as the consumer would run, so there's
- * no possible race.
- */
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
-
- /* Driver set this to memory recycling info. Reset it on recycle.
- * This will *not* work for NIC using a split-page memory model.
- * The page will be returned to the pool here regardless of the
- * 'flipped' fragment being in use or not.
- */
- page_pool_put_full_page(pp, page, allow_direct);
-
- return true;
-}
-EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ac8f421f8ab3..3084ef59400b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,7 +73,7 @@
#include <net/mpls.h>
#include <net/mptcp.h>
#include <net/mctp.h>
-#include <net/page_pool/types.h>
+#include <net/page_pool/helpers.h>
#include <net/dropreason.h>

#include <linux/uaccess.h>
@@ -879,6 +879,45 @@ static void skb_clone_fraglist(struct sk_buff *skb)
skb_get(list);
}

+bool page_pool_return_skb_page(struct page *page, bool napi_safe)
+{
+ struct napi_struct *napi;
+ struct page_pool *pp;
+ bool allow_direct;
+
+ page = compound_head(page);
+
+ /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+ * in order to preserve any existing bits, such as bit 0 for the
+ * head page of compound page and bit 1 for pfmemalloc page, so
+ * mask those bits for freeing side when doing below checking,
+ * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+ * to avoid recycling the pfmemalloc page.
+ */
+ if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+ return false;
+
+ pp = page->pp;
+
+ /* Allow direct recycle if we have reasons to believe that we are
+ * in the same context as the consumer would run, so there's
+ * no possible race.
+ */
+ napi = READ_ONCE(pp->p.napi);
+ allow_direct = napi_safe && napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+
+ /* Driver set this to memory recycling info. Reset it on recycle.
+ * This will *not* work for NIC using a split-page memory model.
+ * The page will be returned to the pool here regardless of the
+ * 'flipped' fragment being in use or not.
+ */
+ page_pool_put_full_page(pp, page, allow_direct);
+
+ return true;
+}
+EXPORT_SYMBOL(page_pool_return_skb_page);
+
static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
{
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
--
2.41.0


2023-07-27 16:00:57

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 3/9] page_pool: place frag_* fields in one cacheline

On x86_64, frag_* fields of struct page_pool are scattered across two
cachelines despite the summary size of 24 bytes. All three fields are
used in pretty much the same places, but the last field, ::frag_users,
is pushed out to the next CL, provoking unwanted false-sharing on
hotpath (frags allocation code).
There are some holes and cold members to move around. Move frag_* one
block up, placing them right after &page_pool_params perfectly at the
beginning of CL2. This doesn't do any meaningful to the second block, as
those are some destroy-path cold structures, and doesn't do anything to
::alloc_stats, which still starts at 200-byte offset, 8 bytes after CL3
(still fitting into 1 cacheline).
On my setup, this yields 1-2% of Mpps when using PP frags actively.
When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
CL, so there should be at least no regressions from the actual change.
::pages_state_hold_cnt is not related directly to that triple, but is
paired currently with ::frags_offset and decoupling them would mean
either two 4-byte holes or more invasive layout changes.

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool/types.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c7aef6c75935..664a787948e1 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -94,16 +94,16 @@ struct page_pool_stats {
struct page_pool {
struct page_pool_params p;

+ long frag_users;
+ struct page *frag_page;
+ unsigned int frag_offset;
+ u32 pages_state_hold_cnt;
+
struct delayed_work release_dw;
void (*disconnect)(void *);
unsigned long defer_start;
unsigned long defer_warn;

- u32 pages_state_hold_cnt;
- unsigned int frag_offset;
- struct page *frag_page;
- long frag_users;
-
#ifdef CONFIG_PAGE_POOL_STATS
/* these stats are incremented while in softirq context */
struct page_pool_alloc_stats alloc_stats;
--
2.41.0


2023-07-27 16:19:39

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next 7/9] net: skbuff: avoid accessing page_pool if !napi_safe when returning page

Currently, pp->p.napi is always read, but the actual variable it gets
assigned to is read-only when @napi_safe is true. For the !napi_safe
cases, which yet is still a pack, it's an unneeded operation.
Moreover, it can lead to premature or even redundant page_pool
cacheline access. For example, when page_pool_is_last_frag() returns
false (with the recent frag improvements).
Thus, read it only when @napi_safe is true. This also allows moving
@napi inside the condition block itself. Constify it while we are
here, because why not.

Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3084ef59400b..e701401092d7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -881,9 +881,8 @@ static void skb_clone_fraglist(struct sk_buff *skb)

bool page_pool_return_skb_page(struct page *page, bool napi_safe)
{
- struct napi_struct *napi;
+ bool allow_direct = false;
struct page_pool *pp;
- bool allow_direct;

page = compound_head(page);

@@ -903,9 +902,12 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
* in the same context as the consumer would run, so there's
* no possible race.
*/
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
+ if (napi_safe) {
+ const struct napi_struct *napi = READ_ONCE(pp->p.napi);
+
+ allow_direct = napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+ }

/* Driver set this to memory recycling info. Reset it on recycle.
* This will *not* work for NIC using a split-page memory model.
--
2.41.0


2023-07-28 10:00:46

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH net-next 9/9] net: skbuff: always try to recycle PP pages directly when in softirq



On 27/07/2023 16.43, Alexander Lobakin wrote:
> Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
> NAPI") allowed direct recycling of skb pages to their PP for some cases,
> but unfortunately missed a couple of other majors.
> For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
> which unconditionally passes `false` as @napi_safe. Thus, all pages go
> through ptr_ring and locks, although most of time we're actually inside
> the NAPI polling this PP is linked with, so that it would be perfectly
> safe to recycle pages directly.

The commit messages is hard to read. It would help me as the reader if
you used a empty line between paragraphs, like in this location (same
goes for other commit descs).

> Let's address such. If @napi_safe is true, we're fine, don't change
> anything for this path. But if it's false, check whether we are in the
> softirq context. It will most likely be so and then if ->list_owner
> is our current CPU, we're good to use direct recycling, even though
> @napi_safe is false -- concurrent access is excluded. in_softirq()
> protection is needed mostly due to we can hit this place in the
> process context (not the hardirq though).

This patch make me a little nervous, as it can create hard-to-debug bugs
if this isn't 100% correct. (Thanks for previous patch that exclude
hardirq via lockdep).

> For the mentioned xdp-drop-skb-mode case, the improvement I got is
> 3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
> alloc_slow counter doesn't change most of time, which means the
> MM layer is not even called to allocate any new pages.
>
> Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> net/core/skbuff.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e701401092d7..5ba3948cceed 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -901,8 +901,10 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> /* Allow direct recycle if we have reasons to believe that we are
> * in the same context as the consumer would run, so there's
> * no possible race.
> + * __page_pool_put_page() makes sure we're not in hardirq context
> + * and interrupts are enabled prior to accessing the cache.
> */
> - if (napi_safe) {
> + if (napi_safe || in_softirq()) {

I used to have in_serving_softirq() in PP to exclude process context
that just disabled BH to do direct recycling (into a lockless array).
This changed in kernel v6.3 commit 542bcea4be86 ("net: page_pool: use
in_softirq() instead") to help threaded NAPI. I guess, nothing blew up
so I guess this was okay to relax this.

> const struct napi_struct *napi = READ_ONCE(pp->p.napi);
>
> allow_direct = napi &&

AFAIK this in_softirq() will allow process context with disabled BH to
also recycle directly into the PP lockless array. With the additional
checks (that are just outside above diff-context) that I assume makes
sure CPU (smp_processor_id()) also match. Is this safe?

--Jesper


2023-07-28 12:25:42

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] page_pool: split types and declarations from page_pool.h

On 2023/7/27 22:43, Alexander Lobakin wrote:

>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d0553ad37865..30037d39b82d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16015,8 +16015,7 @@ M: Ilias Apalodimas <[email protected]>
> L: [email protected]
> S: Supported
> F: Documentation/networking/page_pool.rst
> -F: include/net/page_pool.h
> -F: include/trace/events/page_pool.h

Is there any reason to remove the above?

> +F: include/net/page_pool/*.h

It seems more common to use 'include/net/page_pool/' in
MAINTAINERS.

> F: net/core/page_pool.c
>

2023-07-28 12:44:51

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/9] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

On 2023/7/27 22:43, Alexander Lobakin wrote:
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c

...

> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)

Still having the 'page_pool_' prefix seems odd here when it is in the
skbuff.c where most have skb_ or napi_ prefix, is it better to rename
it to something like napi_return_page_pool_page()?

> +{
> + struct napi_struct *napi;
> + struct page_pool *pp;
> + bool allow_direct;
> +
> + page = compound_head(page);
> +
> + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> + * in order to preserve any existing bits, such as bit 0 for the
> + * head page of compound page and bit 1 for pfmemalloc page, so
> + * mask those bits for freeing side when doing below checking,
> + * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> + * to avoid recycling the pfmemalloc page.
> + */
> + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> + return false;
> +
> + pp = page->pp;
> +
> + /* Allow direct recycle if we have reasons to believe that we are
> + * in the same context as the consumer would run, so there's
> + * no possible race.
> + */
> + napi = READ_ONCE(pp->p.napi);
> + allow_direct = napi_safe && napi &&
> + READ_ONCE(napi->list_owner) == smp_processor_id();
> +
> + /* Driver set this to memory recycling info. Reset it on recycle.
> + * This will *not* work for NIC using a split-page memory model.
> + * The page will be returned to the pool here regardless of the
> + * 'flipped' fragment being in use or not.
> + */
> + page_pool_put_full_page(pp, page, allow_direct);
> +
> + return true;
> +}
> +EXPORT_SYMBOL(page_pool_return_skb_page);
> +
> static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
> {
> if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
>

2023-07-28 14:20:04

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On 2023/7/27 22:43, Alexander Lobakin wrote:

>
> struct page_pool {
> struct page_pool_params p;
> - long pad;
> +
> + bool dma_map:1; /* Perform DMA mapping */
> + enum {
> + PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
> + PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
> + } dma_sync_act:1;
> + bool page_frag:1; /* Allow page fragments */
>

Isn't it more common or better to just remove the flags field in
'struct page_pool_params' and pass the flags by parameter like
below, so that patch 4 is not needed?

struct page_pool *page_pool_create(const struct page_pool_params *params,
unsigned int flags);

2023-07-28 15:03:20

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] page_pool: split types and declarations from page_pool.h

From: Yunsheng Lin <[email protected]>
Date: Fri, 28 Jul 2023 19:58:01 +0800

> On 2023/7/27 22:43, Alexander Lobakin wrote:
>
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d0553ad37865..30037d39b82d 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -16015,8 +16015,7 @@ M: Ilias Apalodimas <[email protected]>
>> L: [email protected]
>> S: Supported
>> F: Documentation/networking/page_pool.rst
>> -F: include/net/page_pool.h
>> -F: include/trace/events/page_pool.h
>
> Is there any reason to remove the above?

Breh, that was accidental >_<

>
>> +F: include/net/page_pool/*.h
>
> It seems more common to use 'include/net/page_pool/' in
> MAINTAINERS.

I see now, looks like it. Will fix ._.

>
>> F: net/core/page_pool.c
>>

Thanks,
Olek

2023-07-28 15:12:36

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 9/9] net: skbuff: always try to recycle PP pages directly when in softirq

From: Jesper Dangaard Brouer <[email protected]>
Date: Fri, 28 Jul 2023 11:32:01 +0200

>
>
> On 27/07/2023 16.43, Alexander Lobakin wrote:
>> Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
>> NAPI") allowed direct recycling of skb pages to their PP for some cases,
>> but unfortunately missed a couple of other majors.
>> For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
>> which unconditionally passes `false` as @napi_safe. Thus, all pages go
>> through ptr_ring and locks, although most of time we're actually inside
>> the NAPI polling this PP is linked with, so that it would be perfectly
>> safe to recycle pages directly.
>
> The commit messages is hard to read. It would help me as the reader if
> you used a empty line between paragraphs, like in this location (same
> goes for other commit descs).

O_o
I paste empty line basing on logics. These two don't have it, as the
second paragraph is the continuation of the first: it expands what I
mean by "a couple of other majors".
Do you want to have empty newlines between each paragraph instead?

>
>> Let's address such. If @napi_safe is true, we're fine, don't change
>> anything for this path. But if it's false, check whether we are in the
>> softirq context. It will most likely be so and then if ->list_owner
>> is our current CPU, we're good to use direct recycling, even though
>> @napi_safe is false -- concurrent access is excluded. in_softirq()
>> protection is needed mostly due to we can hit this place in the
>> process context (not the hardirq though).
>
> This patch make me a little nervous, as it can create hard-to-debug bugs
> if this isn't 100% correct.  (Thanks for previous patch that exclude
> hardirq via lockdep).

Pretty much any -next patch can create "hard-to-debug" bugs. Not a
reason to avoid any improvements, tho?
Speaking of this particular patch, can you give an example of situation
where this wouldn't be correct?

>
>> For the mentioned xdp-drop-skb-mode case, the improvement I got is
>> 3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
>> alloc_slow counter doesn't change most of time, which means the
>> MM layer is not even called to allocate any new pages.
>>
>> Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
>> Signed-off-by: Alexander Lobakin <[email protected]>
>> ---
>>   net/core/skbuff.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index e701401092d7..5ba3948cceed 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -901,8 +901,10 @@ bool page_pool_return_skb_page(struct page *page,
>> bool napi_safe)
>>       /* Allow direct recycle if we have reasons to believe that we are
>>        * in the same context as the consumer would run, so there's
>>        * no possible race.
>> +     * __page_pool_put_page() makes sure we're not in hardirq context
>> +     * and interrupts are enabled prior to accessing the cache.
>>        */
>> -    if (napi_safe) {
>> +    if (napi_safe || in_softirq()) {
>
> I used to have in_serving_softirq() in PP to exclude process context
> that just disabled BH to do direct recycling (into a lockless array).
> This changed in kernel v6.3 commit 542bcea4be86 ("net: page_pool: use
> in_softirq() instead") to help threaded NAPI.  I guess, nothing blew up
> so I guess this was okay to relax this.

(below)

>
>>           const struct napi_struct *napi = READ_ONCE(pp->p.napi);
>>             allow_direct = napi &&
>
> AFAIK this in_softirq() will allow process context with disabled BH to
> also recycle directly into the PP lockless array.  With the additional
> checks (that are just outside above diff-context) that I assume makes
> sure CPU (smp_processor_id()) also match.  Is this safe?

Disabling BH also disables preemption. smp_processor_id() can give wrong
values only when preemption is enabled (see get_cpu()/put_cpu()).
Also look at how threaded NAPI and busy polling call NAPI polling
callbacks. They just disable BH. And nobody ever said that it's not safe
to call smp_processor_id() in the NAPI polling callbacks.

When your context matches and the processor ID matches, how could you
provoke concurrent access?

>
> --Jesper
>

Thanks,
Olek

2023-07-28 15:32:27

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

From: Yunsheng Lin <[email protected]>
Date: Fri, 28 Jul 2023 20:36:50 +0800

> On 2023/7/27 22:43, Alexander Lobakin wrote:
>
>>
>> struct page_pool {
>> struct page_pool_params p;
>> - long pad;
>> +
>> + bool dma_map:1; /* Perform DMA mapping */
>> + enum {
>> + PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
>> + PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
>> + } dma_sync_act:1;
>> + bool page_frag:1; /* Allow page fragments */
>>
>
> Isn't it more common or better to just remove the flags field in
> 'struct page_pool_params' and pass the flags by parameter like
> below, so that patch 4 is not needed?
>
> struct page_pool *page_pool_create(const struct page_pool_params *params,
> unsigned int flags);

You would need a separate patch to convert all the page_pool_create()
users then either way.
And it doesn't look really natural to me to pass both driver-set params
and driver-set flags as separate function arguments. Someone may then
think "why aren't flags just put in the params itself". The fact that
Page Pool copies the whole params in the page_pool struct after
allocating it is internals, page_pool_create() prototype however isn't.
Thoughts?

Thanks,
Olek

2023-07-28 16:10:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On Fri, 28 Jul 2023 16:03:53 +0200 Alexander Lobakin wrote:
> And it doesn't look really natural to me to pass both driver-set params
> and driver-set flags as separate function arguments.

+1

2023-07-28 16:44:05

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/9] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

From: Yunsheng Lin <[email protected]>
Date: Fri, 28 Jul 2023 20:02:51 +0800

> On 2023/7/27 22:43, Alexander Lobakin wrote:
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>
> ...
>
>> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>
> Still having the 'page_pool_' prefix seems odd here when it is in the
> skbuff.c where most have skb_ or napi_ prefix, is it better to rename
> it to something like napi_return_page_pool_page()?

Given that how the function that goes next is named, maybe
skb_pp_return_page() (or skb_pp_put_page())?

>
>> +{
>> + struct napi_struct *napi;
>> + struct page_pool *pp;
>> + bool allow_direct;
>> +
>> + page = compound_head(page);
>> +
>> + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>> + * in order to preserve any existing bits, such as bit 0 for the
>> + * head page of compound page and bit 1 for pfmemalloc page, so
>> + * mask those bits for freeing side when doing below checking,
>> + * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>> + * to avoid recycling the pfmemalloc page.
>> + */
>> + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>> + return false;
>> +
>> + pp = page->pp;
>> +
>> + /* Allow direct recycle if we have reasons to believe that we are
>> + * in the same context as the consumer would run, so there's
>> + * no possible race.
>> + */
>> + napi = READ_ONCE(pp->p.napi);
>> + allow_direct = napi_safe && napi &&
>> + READ_ONCE(napi->list_owner) == smp_processor_id();
>> +
>> + /* Driver set this to memory recycling info. Reset it on recycle.
>> + * This will *not* work for NIC using a split-page memory model.
>> + * The page will be returned to the pool here regardless of the
>> + * 'flipped' fragment being in use or not.
>> + */
>> + page_pool_put_full_page(pp, page, allow_direct);
>> +
>> + return true;
>> +}
>> +EXPORT_SYMBOL(page_pool_return_skb_page);
>> +
>> static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)

(this one)

>> {
>> if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
>>

Thanks,
Olek

2023-07-29 12:14:13

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/9] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

On 2023/7/28 21:58, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Fri, 28 Jul 2023 20:02:51 +0800
>
>> On 2023/7/27 22:43, Alexander Lobakin wrote:
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>
>> ...
>>
>>> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>
>> Still having the 'page_pool_' prefix seems odd here when it is in the
>> skbuff.c where most have skb_ or napi_ prefix, is it better to rename
>> it to something like napi_return_page_pool_page()?
>
> Given that how the function that goes next is named, maybe
> skb_pp_return_page() (or skb_pp_put_page())?

skb_pp_put_page() seems better.

And I like napi_pp_put_page() with 'napi_' prefix better as
it does not take a skb as parameter and the naming is aligned
with the 'napi_safe' parameter.

>
>>
>>> +{
>>> + struct napi_struct *napi;
>>> + struct page_pool *pp;
>>> + bool allow_direct;
>>> +
>>> + page = compound_head(page);
>>> +
>>> + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>> + * in order to preserve any existing bits, such as bit 0 for the
>>> + * head page of compound page and bit 1 for pfmemalloc page, so
>>> + * mask those bits for freeing side when doing below checking,
>>> + * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>> + * to avoid recycling the pfmemalloc page.
>>> + */
>>> + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> + return false;
>>> +
>>> + pp = page->pp;
>>> +
>>> + /* Allow direct recycle if we have reasons to believe that we are
>>> + * in the same context as the consumer would run, so there's
>>> + * no possible race.
>>> + */
>>> + napi = READ_ONCE(pp->p.napi);
>>> + allow_direct = napi_safe && napi &&
>>> + READ_ONCE(napi->list_owner) == smp_processor_id();
>>> +
>>> + /* Driver set this to memory recycling info. Reset it on recycle.
>>> + * This will *not* work for NIC using a split-page memory model.
>>> + * The page will be returned to the pool here regardless of the
>>> + * 'flipped' fragment being in use or not.
>>> + */
>>> + page_pool_put_full_page(pp, page, allow_direct);
>>> +
>>> + return true;
>>> +}
>>> +EXPORT_SYMBOL(page_pool_return_skb_page);
>>> +
>>> static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
>
> (this one)
>
>>> {
>>> if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)

We may need the 'IS_ENABLED(CONFIG_PAGE_POOL' checking in the newly
moved function too.

>>>
>
> Thanks,
> Olek
>
> .
>

2023-07-29 12:18:18

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On 2023/7/28 22:03, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Fri, 28 Jul 2023 20:36:50 +0800
>
>> On 2023/7/27 22:43, Alexander Lobakin wrote:
>>
>>>
>>> struct page_pool {
>>> struct page_pool_params p;
>>> - long pad;
>>> +
>>> + bool dma_map:1; /* Perform DMA mapping */
>>> + enum {
>>> + PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
>>> + PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
>>> + } dma_sync_act:1;
>>> + bool page_frag:1; /* Allow page fragments */
>>>
>>
>> Isn't it more common or better to just remove the flags field in
>> 'struct page_pool_params' and pass the flags by parameter like
>> below, so that patch 4 is not needed?
>>
>> struct page_pool *page_pool_create(const struct page_pool_params *params,
>> unsigned int flags);
>
> You would need a separate patch to convert all the page_pool_create()
> users then either way.
> And it doesn't look really natural to me to pass both driver-set params
> and driver-set flags as separate function arguments. Someone may then
> think "why aren't flags just put in the params itself". The fact that
> Page Pool copies the whole params in the page_pool struct after
> allocating it is internals, page_pool_create() prototype however isn't.
> Thoughts?

It just seems odd to me that dma_map and page_frag is duplicated as we
seems to have the same info in the page_pool->p.flags.

What about:
In [PATCH net-next 4/9] page_pool: shrink &page_pool_params a tiny bit,
'flags' is bit-field'ed with 'dma_dir', what about changing 'dma_dir'
to be bit-field'ed with 'dma_sync_act', so that page_pool->p.flags stays
the same as before, and 'dma_map' & 'page_frag' do not seems be really
needed as we have the same info in page_pool->p.flags?


>
> Thanks,
> Olek
>
> .
>

2023-08-01 14:12:24

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

From: Yunsheng Lin <[email protected]>
Date: Sat, 29 Jul 2023 19:40:32 +0800

> On 2023/7/28 22:03, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Fri, 28 Jul 2023 20:36:50 +0800
>>
>>> On 2023/7/27 22:43, Alexander Lobakin wrote:
>>>
>>>>
>>>> struct page_pool {
>>>> struct page_pool_params p;
>>>> - long pad;
>>>> +
>>>> + bool dma_map:1; /* Perform DMA mapping */
>>>> + enum {
>>>> + PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
>>>> + PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
>>>> + } dma_sync_act:1;
>>>> + bool page_frag:1; /* Allow page fragments */
>>>>
>>>
>>> Isn't it more common or better to just remove the flags field in
>>> 'struct page_pool_params' and pass the flags by parameter like
>>> below, so that patch 4 is not needed?
>>>
>>> struct page_pool *page_pool_create(const struct page_pool_params *params,
>>> unsigned int flags);
>>
>> You would need a separate patch to convert all the page_pool_create()
>> users then either way.
>> And it doesn't look really natural to me to pass both driver-set params
>> and driver-set flags as separate function arguments. Someone may then
>> think "why aren't flags just put in the params itself". The fact that
>> Page Pool copies the whole params in the page_pool struct after
>> allocating it is internals, page_pool_create() prototype however isn't.
>> Thoughts?
>
> It just seems odd to me that dma_map and page_frag is duplicated as we
> seems to have the same info in the page_pool->p.flags.

It's just because we copy the whole &page_pool_params passed by the
driver. It doesn't look good to me to define a new structure and copy
the values field-by-field just to avoid duplicating 3 bits :s

>
> What about:
> In [PATCH net-next 4/9] page_pool: shrink &page_pool_params a tiny bit,
> 'flags' is bit-field'ed with 'dma_dir', what about changing 'dma_dir'
> to be bit-field'ed with 'dma_sync_act', so that page_pool->p.flags stays
> the same as before, and 'dma_map' & 'page_frag' do not seems be really
> needed as we have the same info in page_pool->p.flags?

Not sure I follow :z ::dma_dir is also passed by the driver, so we can't
drop it from the params struct.

>
>
>>
>> Thanks,
>> Olek
>>
>> .
>>

Thanks,
Olek

2023-08-01 15:37:52

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 2/9] net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

From: Yunsheng Lin <[email protected]>
Date: Sat, 29 Jul 2023 19:40:19 +0800

> On 2023/7/28 21:58, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Fri, 28 Jul 2023 20:02:51 +0800
>>
>>> On 2023/7/27 22:43, Alexander Lobakin wrote:
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>
>>> ...
>>>
>>>> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>
>>> Still having the 'page_pool_' prefix seems odd here when it is in the
>>> skbuff.c where most have skb_ or napi_ prefix, is it better to rename
>>> it to something like napi_return_page_pool_page()?
>>
>> Given that how the function that goes next is named, maybe
>> skb_pp_return_page() (or skb_pp_put_page())?
>
> skb_pp_put_page() seems better.
>
> And I like napi_pp_put_page() with 'napi_' prefix better as
> it does not take a skb as parameter and the naming is aligned
> with the 'napi_safe' parameter.

Ah, I see. Sounds reasonable. I'll pick napi_pp_put_page() for the next
round, I like this one.

>
>>
>>>
>>>> +{
>>>> + struct napi_struct *napi;
>>>> + struct page_pool *pp;
>>>> + bool allow_direct;
>>>> +
>>>> + page = compound_head(page);
>>>> +
>>>> + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>>> + * in order to preserve any existing bits, such as bit 0 for the
>>>> + * head page of compound page and bit 1 for pfmemalloc page, so
>>>> + * mask those bits for freeing side when doing below checking,
>>>> + * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>>> + * to avoid recycling the pfmemalloc page.
>>>> + */
>>>> + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>>> + return false;
>>>> +
>>>> + pp = page->pp;
>>>> +
>>>> + /* Allow direct recycle if we have reasons to believe that we are
>>>> + * in the same context as the consumer would run, so there's
>>>> + * no possible race.
>>>> + */
>>>> + napi = READ_ONCE(pp->p.napi);
>>>> + allow_direct = napi_safe && napi &&
>>>> + READ_ONCE(napi->list_owner) == smp_processor_id();
>>>> +
>>>> + /* Driver set this to memory recycling info. Reset it on recycle.
>>>> + * This will *not* work for NIC using a split-page memory model.
>>>> + * The page will be returned to the pool here regardless of the
>>>> + * 'flipped' fragment being in use or not.
>>>> + */
>>>> + page_pool_put_full_page(pp, page, allow_direct);
>>>> +
>>>> + return true;
>>>> +}
>>>> +EXPORT_SYMBOL(page_pool_return_skb_page);
>>>> +
>>>> static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
>>
>> (this one)
>>
>>>> {
>>>> if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
>
> We may need the 'IS_ENABLED(CONFIG_PAGE_POOL' checking in the newly
> moved function too.

The first person who noticed this, for sure we should have it!

>
>>>>
>>
>> Thanks,
>> Olek
>>
>> .
>>

Thanks,
Olek

2023-08-02 12:13:06

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On 2023/8/1 21:36, Alexander Lobakin wrote:

>>
>> It just seems odd to me that dma_map and page_frag is duplicated as we
>> seems to have the same info in the page_pool->p.flags.
>
> It's just because we copy the whole &page_pool_params passed by the
> driver. It doesn't look good to me to define a new structure and copy
> the values field-by-field just to avoid duplicating 3 bits :s
>
>>
>> What about:
>> In [PATCH net-next 4/9] page_pool: shrink &page_pool_params a tiny bit,
>> 'flags' is bit-field'ed with 'dma_dir', what about changing 'dma_dir'
>> to be bit-field'ed with 'dma_sync_act', so that page_pool->p.flags stays
>> the same as before, and 'dma_map' & 'page_frag' do not seems be really
>> needed as we have the same info in page_pool->p.flags?
>
> Not sure I follow :z ::dma_dir is also passed by the driver, so we can't
> drop it from the params struct.

My bad, I missed that dma_dir is in the params struct.

2023-08-02 21:44:22

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On Tue, 1 Aug 2023 15:36:33 +0200 Alexander Lobakin wrote:
> >> You would need a separate patch to convert all the page_pool_create()
> >> users then either way.
> >> And it doesn't look really natural to me to pass both driver-set params
> >> and driver-set flags as separate function arguments. Someone may then
> >> think "why aren't flags just put in the params itself". The fact that
> >> Page Pool copies the whole params in the page_pool struct after
> >> allocating it is internals, page_pool_create() prototype however isn't.
> >> Thoughts?
> >
> > It just seems odd to me that dma_map and page_frag is duplicated as we
> > seems to have the same info in the page_pool->p.flags.
>
> It's just because we copy the whole &page_pool_params passed by the
> driver. It doesn't look good to me to define a new structure and copy
> the values field-by-field just to avoid duplicating 3 bits :s

FWIW I'm tempted to do something like the patch below (an obvious move,
I suspect). I want to add another pointer (netdev) to the params and
I don't want it to eat up bytes in the first cache line.
The patch is incomplete, we need to stash a one-bit indication in
the first cache line to know init_callback is not present without
having to look at @slow. I'll defer doing that cleanly until your
patches land.
With this in place we can move flags outside of @fast, and interpret
it manually while copying all the other members in one go.

--->8-------------------------------

From c1290e74c3ec54090a49d0c88ca9d56c3bede825 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <[email protected]>
Date: Wed, 2 Aug 2023 14:16:51 -0700
Subject: [PATCH] net: page_pool: split the page_pool_params into fast and slow

struct page_pool is rather performance critical and we use
16B of the first cache line to store 2 pointers used only
by test code. Future patches will add more informational
(non-fast path) attributes.

It's convenient for the user of the API to not have to worry
which fields are fast and which are slow path. Use struct
groups to split the params into the two categories internally.

Signed-off-by: Jakub Kicinski <[email protected]>
---
include/net/page_pool.h | 31 +++++++++++++++++++------------
net/core/page_pool.c | 7 ++++---
2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 73d4f786418d..f0267279a8cd 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -83,18 +83,22 @@ struct pp_alloc_cache {
* @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
*/
struct page_pool_params {
- unsigned int flags;
- unsigned int order;
- unsigned int pool_size;
- int nid;
- struct device *dev;
- struct napi_struct *napi;
- enum dma_data_direction dma_dir;
- unsigned int max_len;
- unsigned int offset;
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int flags;
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ );
+ struct_group_tagged(page_pool_params_slow, slow,
/* private: used by test code only */
- void (*init_callback)(struct page *page, void *arg);
- void *init_arg;
+ void (*init_callback)(struct page *page, void *arg);
+ void *init_arg;
+ );
};

#ifdef CONFIG_PAGE_POOL_STATS
@@ -177,7 +181,7 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
#endif

struct page_pool {
- struct page_pool_params p;
+ struct page_pool_params_fast p;

struct delayed_work release_dw;
void (*disconnect)(void *);
@@ -236,6 +240,9 @@ struct page_pool {
refcount_t user_cnt;

u64 destroy_cnt;
+
+ /* Slow/Control-path information follows */
+ struct page_pool_params_slow slow;
};

struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5d615a169718..fc3f6878a002 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -173,7 +173,8 @@ static int page_pool_init(struct page_pool *pool,
{
unsigned int ring_qsize = 1024; /* Default */

- memcpy(&pool->p, params, sizeof(pool->p));
+ memcpy(&pool->p, &params->fast, sizeof(pool->p));
+ memcpy(&pool->slow, &params->slow, sizeof(pool->slow));

/* Validate only known flags were used */
if (pool->p.flags & ~(PP_FLAG_ALL))
@@ -372,8 +373,8 @@ static void page_pool_set_pp_info(struct page_pool *pool,
{
page->pp = pool;
page->pp_magic |= PP_SIGNATURE;
- if (pool->p.init_callback)
- pool->p.init_callback(page, pool->p.init_arg);
+ if (pool->slow.init_callback)
+ pool->slow.init_callback(page, pool->slow.init_arg);
}

static void page_pool_clear_pp_info(struct page *page)
--
2.41.0


2023-08-03 16:55:55

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

From: Jakub Kicinski <[email protected]>
Date: Thu, 3 Aug 2023 09:00:29 -0700

> On Thu, 3 Aug 2023 16:56:22 +0200 Alexander Lobakin wrote:
>>> FWIW I'm tempted to do something like the patch below (an obvious move,
>>> I suspect). I want to add another pointer (netdev) to the params and

[...]

>> I would propose to include it in the series, but it has grown a bunch
>> already and it's better to do that later separately :s
>
> Yeah.. I'd be trying to split your series up a little to make progress
> rather than add more things :( I was going to suggest that you post
> just the first 3 patches for instance. Should be an easy merge.

One minute before I was going to post v2 :>
Sounds good. AFACS only #4-6 are still under question (not for me tho
:D), let me move that piece out.

Thanks,
Olek

2023-08-03 16:58:41

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

From: Jakub Kicinski <[email protected]>
Date: Wed, 2 Aug 2023 14:29:20 -0700

> On Tue, 1 Aug 2023 15:36:33 +0200 Alexander Lobakin wrote:
>>>> You would need a separate patch to convert all the page_pool_create()
>>>> users then either way.
>>>> And it doesn't look really natural to me to pass both driver-set params
>>>> and driver-set flags as separate function arguments. Someone may then
>>>> think "why aren't flags just put in the params itself". The fact that
>>>> Page Pool copies the whole params in the page_pool struct after
>>>> allocating it is internals, page_pool_create() prototype however isn't.
>>>> Thoughts?
>>>
>>> It just seems odd to me that dma_map and page_frag is duplicated as we
>>> seems to have the same info in the page_pool->p.flags.
>>
>> It's just because we copy the whole &page_pool_params passed by the
>> driver. It doesn't look good to me to define a new structure and copy
>> the values field-by-field just to avoid duplicating 3 bits :s
>
> FWIW I'm tempted to do something like the patch below (an obvious move,
> I suspect). I want to add another pointer (netdev) to the params and

Just take napi->dev as I do in libie :)

> I don't want it to eat up bytes in the first cache line.
> The patch is incomplete, we need to stash a one-bit indication in
> the first cache line to know init_callback is not present without
> having to look at @slow. I'll defer doing that cleanly until your
> patches land.

I would propose to include it in the series, but it has grown a bunch
already and it's better to do that later separately :s

> With this in place we can move flags outside of @fast, and interpret

Oh, really nice. We could avoid copying them at all.

> it manually while copying all the other members in one go.

[...]

Thanks,
Olek

2023-08-03 17:11:30

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On Thu, 3 Aug 2023 18:07:23 +0200 Alexander Lobakin wrote:
> >> I would propose to include it in the series, but it has grown a bunch
> >> already and it's better to do that later separately :s
> >
> > Yeah.. I'd be trying to split your series up a little to make progress
> > rather than add more things :( I was going to suggest that you post
> > just the first 3 patches for instance. Should be an easy merge.
>
> One minute before I was going to post v2 :>
> Sounds good. AFACS only #4-6 are still under question (not for me tho
> :D), let me move that piece out.

FWIW I'll apply my doc changes in the next 5min if nobody objects.
Can you rebase on top of that?

2023-08-03 17:13:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

On Thu, 3 Aug 2023 16:56:22 +0200 Alexander Lobakin wrote:
> > FWIW I'm tempted to do something like the patch below (an obvious move,
> > I suspect). I want to add another pointer (netdev) to the params and
>
> Just take napi->dev as I do in libie :)

:) The fields have extra semantics, like napi implies that recycling
is allowed, and netdev implies that there is only _one_ netdev eating
from the PP. There's also a way to get the pp <> netdev from the memory
model registration. But I feel like explicit field is cleanest.

Anyone, conversation for a later time :)

> > I don't want it to eat up bytes in the first cache line.
> > The patch is incomplete, we need to stash a one-bit indication in
> > the first cache line to know init_callback is not present without
> > having to look at @slow. I'll defer doing that cleanly until your
> > patches land.
>
> I would propose to include it in the series, but it has grown a bunch
> already and it's better to do that later separately :s

Yeah.. I'd be trying to split your series up a little to make progress
rather than add more things :( I was going to suggest that you post
just the first 3 patches for instance. Should be an easy merge.

2023-08-03 17:53:04

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] page_pool: don't use driver-set flags field directly

From: Jakub Kicinski <[email protected]>
Date: Thu, 3 Aug 2023 09:40:43 -0700

> On Thu, 3 Aug 2023 18:07:23 +0200 Alexander Lobakin wrote:
>>>> I would propose to include it in the series, but it has grown a bunch
>>>> already and it's better to do that later separately :s
>>>
>>> Yeah.. I'd be trying to split your series up a little to make progress
>>> rather than add more things :( I was going to suggest that you post
>>> just the first 3 patches for instance. Should be an easy merge.
>>
>> One minute before I was going to post v2 :>
>> Sounds good. AFACS only #4-6 are still under question (not for me tho
>> :D), let me move that piece out.
>
> FWIW I'll apply my doc changes in the next 5min if nobody objects.
> Can you rebase on top of that?

Breh, sent already >_<

What do I do then? Rebase and send v3?

Thanks,
Olek