Here's spin-off of the IAVF PP series[0], with 1 compile-time and several
runtime (hotpath) optimizations. They're based and tested on top of the
hybrid PP allocation series[1], but don't require it to work and are
in general independent of it and each other.
Per-patch breakdown:
#1: Already was on the lists, but this time it's done the other way,
the one that Alex Duyck proposed during the review of the previous
series. Slightly reduce the amount of C preprocessing by stopping
including <net/page_pool.h> to <linux/skbuff.h> (which is included
in half of the kernel sources). Especially useful with the
abovementioned series applied, as it makes page_pool.h heavier;
#2: New. Group frag_* fields of &page_pool together to reduce cache
misses;
#3-4: New, prereqs to #5. Free 4 bytes in &page_pool_params and combine it
with the already existing hole to get a free slot in the same CL
where the params are inside &page_pool. Use it to store the internal
PP flags in opposite to the driver-set ones;
#5: Don't call to DMA sync externals when they won't do anything anyway
by doing some heuristics a bit earlier (when allocating a new page).
Also was on the lists;
#6-7: New. In addition to recycling skb PP pages directly when @napi_safe
is set, check for the context we're in and always try to recycle
directly when in softirq (on the same CPU where the consumer runs).
This allows us to use direct recycling anytime we're inside a NAPI
polling loop or GRO stuff going right after it, covering way more
cases than it does right now.
(complete tree with [1] + this + [0] is available here: [2])
[0] https://lore.kernel.org/netdev/[email protected]
[1] https://lore.kernel.org/netdev/[email protected]
[2] https://github.com/alobakin/linux/commits/iavf-pp-frag
Alexander Lobakin (7):
net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>
net: page_pool: place frag_* fields in one cacheline
net: page_pool: shrink &page_pool_params a tiny bit
net: page_pool: don't use driver-set flags field directly
net: page_pool: avoid calling no-op externals when possible
net: skbuff: avoid accessing page_pool if !napi_safe when returning
page
net: skbuff: always try to recycle PP pages directly when in softirq
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
include/linux/skbuff.h | 3 +-
include/net/page_pool.h | 23 +++---
net/core/page_pool.c | 70 +++++--------------
net/core/skbuff.c | 41 +++++++++++
11 files changed, 83 insertions(+), 61 deletions(-)
---
From RFC v1[3]:
* #1: move the entire function to skbuff.c, don't try to split it (Alex);
* #2-4: new;
* #5: use internal flags field added in #4 and don't modify driver-defined
structure (Alex, Jakub);
* #6: new;
* drop "add new NAPI state" as a redundant complication;
* #7: replace the check for the new NAPI state to just in_softirq(), should
be fine (Jakub).
[3] https://lore.kernel.org/netdev/[email protected]
--
2.41.0
Currently, touching <net/page_pool.h> triggers a rebuild of more than
half of the kernel. That's because it's included in <linux/skbuff.h>.
And each new include to page_pool.h adds more [useless] data for the
toolchain to process per each source file from that pile.
In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple of functions
defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info as
long as page pool owns the page") one of the calls was removed, so only
one was left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having very niche page_pool.h included only for that
looks like an overkill.
As %PP_SIGNATURE is not local to page_pool.c (was only in the
early submissions), nothing holds this function there. Teleport
page_pool_return_skb_page() to skbuff.c, just next to the main
consumer, skb_pp_recycle(). It's used also in napi_frag_unref() ->
{__,}skb_frag_unref(), so no `static` unfortunately. Maybe next time.
Now, after a few include fixes in the drivers, touching page_pool.h
only triggers rebuilding of the drivers using it and a couple of
core networking files.
Suggested-by: Jakub Kicinski <[email protected]> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <[email protected]> # move to skbuff.c
Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
include/linux/skbuff.h | 3 +-
include/net/page_pool.h | 2 -
net/core/page_pool.c | 39 -------------------
net/core/skbuff.c | 39 +++++++++++++++++++
11 files changed, 48 insertions(+), 42 deletions(-)
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index 84751bb303a6..6222aaa5157f 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -28,6 +28,7 @@
#include <linux/iopoll.h>
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
+#include <net/page_pool.h>
#include <net/xdp_sock_drv.h>
#define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 1b990a486059..cfc07f012254 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -38,6 +38,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <net/ip.h>
+#include <net/page_pool.h>
#include <net/selftests.h>
#include <net/tso.h>
#include <linux/tcp.h>
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index a5d03583bf79..d17a0ebc9036 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -7,6 +7,7 @@
#include <linux/interrupt.h>
#include <linux/pci.h>
+#include <net/page_pool.h>
#include <net/tso.h>
#include <linux/bitfield.h>
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
index fe8ea4e531b7..7eca434a0550 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
@@ -16,6 +16,7 @@
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
#include <linux/bitfield.h>
+#include <net/page_pool.h>
#include "otx2_reg.h"
#include "otx2_common.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 5ce28ff7685f..0f152f14165b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -6,6 +6,7 @@
#include "en/port.h"
#include "en_accel/en_accel.h"
#include "en_accel/ipsec.h"
+#include <net/page_pool.h>
#include <net/xdp_sock_drv.h>
static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 40589cebb773..16038c23b7d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -35,6 +35,7 @@
#include "en/xdp.h"
#include "en/params.h"
#include <linux/bitfield.h>
+#include <net/page_pool.h>
int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
{
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 6b07b8fafec2..95c16f11d156 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -15,6 +15,7 @@
#include <linux/average.h>
#include <linux/soc/mediatek/mtk_wed.h>
#include <net/mac80211.h>
+#include <net/page_pool.h>
#include "util.h"
#include "testmode.h"
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..f76d172ed262 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,6 @@
#include <linux/if_packet.h>
#include <linux/llist.h>
#include <net/flow.h>
-#include <net/page_pool.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
@@ -3423,6 +3422,8 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
}
+bool page_pool_return_skb_page(struct page *page, bool napi_safe);
+
static inline void
napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
{
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 2b7db9992fc0..829dc1f8ba6b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -307,8 +307,6 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
return pool->p.dma_dir;
}
-bool page_pool_return_skb_page(struct page *page, bool napi_safe);
-
struct page_pool *page_pool_create(const struct page_pool_params *params);
struct xdp_mem_info;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 985ccaffc06a..09d76e37ac69 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -913,42 +913,3 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
-
-bool page_pool_return_skb_page(struct page *page, bool napi_safe)
-{
- struct napi_struct *napi;
- struct page_pool *pp;
- bool allow_direct;
-
- page = compound_head(page);
-
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- * in order to preserve any existing bits, such as bit 0 for the
- * head page of compound page and bit 1 for pfmemalloc page, so
- * mask those bits for freeing side when doing below checking,
- * and page_is_pfmemalloc() is checked in __page_pool_put_page()
- * to avoid recycling the pfmemalloc page.
- */
- if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
- return false;
-
- pp = page->pp;
-
- /* Allow direct recycle if we have reasons to believe that we are
- * in the same context as the consumer would run, so there's
- * no possible race.
- */
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
-
- /* Driver set this to memory recycling info. Reset it on recycle.
- * This will *not* work for NIC using a split-page memory model.
- * The page will be returned to the pool here regardless of the
- * 'flipped' fragment being in use or not.
- */
- page_pool_put_full_page(pp, page, allow_direct);
-
- return true;
-}
-EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ea9c09b4b927..4de280c454e4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -879,6 +879,45 @@ static void skb_clone_fraglist(struct sk_buff *skb)
skb_get(list);
}
+bool page_pool_return_skb_page(struct page *page, bool napi_safe)
+{
+ struct napi_struct *napi;
+ struct page_pool *pp;
+ bool allow_direct;
+
+ page = compound_head(page);
+
+ /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+ * in order to preserve any existing bits, such as bit 0 for the
+ * head page of compound page and bit 1 for pfmemalloc page, so
+ * mask those bits for freeing side when doing below checking,
+ * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+ * to avoid recycling the pfmemalloc page.
+ */
+ if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+ return false;
+
+ pp = page->pp;
+
+ /* Allow direct recycle if we have reasons to believe that we are
+ * in the same context as the consumer would run, so there's
+ * no possible race.
+ */
+ napi = READ_ONCE(pp->p.napi);
+ allow_direct = napi_safe && napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+
+ /* Driver set this to memory recycling info. Reset it on recycle.
+ * This will *not* work for NIC using a split-page memory model.
+ * The page will be returned to the pool here regardless of the
+ * 'flipped' fragment being in use or not.
+ */
+ page_pool_put_full_page(pp, page, allow_direct);
+
+ return true;
+}
+EXPORT_SYMBOL(page_pool_return_skb_page);
+
static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
{
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
--
2.41.0
On x86_64, frag_* fields of struct page_pool are scattered across two
cachelines despite the summary size of 24 bytes. The last field,
::frag_users, is pushed out to the next one, sharing it with
::alloc_stats.
All three fields are used in pretty much the same places. There are some
holes and cold members to move around. Move frag_* one block up, placing
them right after &page_pool_params perfectly at the beginning of CL2.
This doesn't do any meaningful to the second block, as those are some
destroy-path cold structures, and doesn't do anything to ::alloc_stats,
which still starts at 200-byte offset, 8 bytes after CL3 (still fitting
into 1 cacheline).
On my setup, this yields 1-2% of Mpps when using PP frags actively.
When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
CL, so there should be at least no regressions from the actual change.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 829dc1f8ba6b..212d72b5cfec 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -130,16 +130,16 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
struct page_pool {
struct page_pool_params p;
+ long frag_users;
+ struct page *frag_page;
+ unsigned int frag_offset;
+ u32 pages_state_hold_cnt;
+
struct delayed_work release_dw;
void (*disconnect)(void *);
unsigned long defer_start;
unsigned long defer_warn;
- u32 pages_state_hold_cnt;
- unsigned int frag_offset;
- struct page *frag_page;
- long frag_users;
-
#ifdef CONFIG_PAGE_POOL_STATS
/* these stats are incremented while in softirq context */
struct page_pool_alloc_stats alloc_stats;
--
2.41.0
Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's
page_pool_put_page()
page_pool_put_defragged_page() <- external
__page_pool_put_page()
page_pool_dma_sync_for_device() <- non-inline
dma_sync_single_range_for_device()
dma_sync_single_for_device() <- external
dma_direct_sync_single_for_device()
dev_is_dma_coherent() <- exit
For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define the new PP DMA sync operation type, which will mean "do DMA syncs
for the device, but only when needed" and turn it on by default when the
driver asks to sync pages. When a page is mapped, check whether it needs
syncs and if so, replace that "sync when needed" back to "always do
syncs" globally for the whole pool (better safe than sorry). As long as
the pool has no pages requiring DMA syncs, this cuts off a good piece
of calls and checks. When at least one page required it, the pool
conservatively falls back to "always call sync functions", no per-page
verdicts. It's a fairly rare case anyway that only a few pages would
require syncing.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 1 +
net/core/page_pool.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index aefcb76baf0e..3a2e1b29ff3e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -133,6 +133,7 @@ struct page_pool {
bool dma_map:1; /* Perform DMA mapping */
enum {
PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
+ PP_DMA_SYNC_ACT_SKIP, /* Syncs can be skipped */
PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
} dma_sync_act:2;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 67957e74e1f5..ee6ce6786e29 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -196,7 +196,8 @@ static int page_pool_init(struct page_pool *pool,
if (!pool->p.max_len)
return -EINVAL;
- pool->dma_sync_act = PP_DMA_SYNC_ACT_DO;
+ /* Try to avoid calling no-op syncs */
+ pool->dma_sync_act = PP_DMA_SYNC_ACT_SKIP;
/* pool->p.offset has to be set according to the address
* offset used by the DMA engine to start copying rx data
@@ -345,6 +346,10 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
page_pool_set_dma_addr(page, dma);
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_SKIP &&
+ dma_need_sync(pool->p.dev, dma))
+ pool->dma_sync_act = PP_DMA_SYNC_ACT_DO;
+
if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
--
2.41.0
Currently, pp->p.napi is always read, but the actual variable it gets
assigned to is read-only when @napi_safe is true. For the !napi_safe
cases, which yet is still a pack, it's an unneeded operation.
Moreover, it can lead to premature or even redundant page_pool
cacheline access. For example, when page_pool_is_last_frag() returns
false (with the recent frag improvements).
Thus, read it only when @napi_safe is true. This also allows moving
@napi inside the condition block itself. Constify it while we are
here, because why not.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4de280c454e4..fc1470aab5cf 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -881,9 +881,8 @@ static void skb_clone_fraglist(struct sk_buff *skb)
bool page_pool_return_skb_page(struct page *page, bool napi_safe)
{
- struct napi_struct *napi;
+ bool allow_direct = false;
struct page_pool *pp;
- bool allow_direct;
page = compound_head(page);
@@ -903,9 +902,12 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
* in the same context as the consumer would run, so there's
* no possible race.
*/
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
+ if (napi_safe) {
+ const struct napi_struct *napi = READ_ONCE(pp->p.napi);
+
+ allow_direct = napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+ }
/* Driver set this to memory recycling info. Reset it on recycle.
* This will *not* work for NIC using a split-page memory model.
--
2.41.0
Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple of other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of time we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, check whether we are in the
softirq context. It will most likely be so and then if ->list_owner
is our current CPU, we're good to use direct recycling, even though
@napi_safe is false -- concurrent access is excluded. in_softirq()
protection is needed mostly due to we can hit this place in the
process context (not the hardirq though).
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of time, which means the
MM layer is not even called to allocate any new pages.
Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fc1470aab5cf..1c22fd33be6c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -902,7 +902,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
* in the same context as the consumer would run, so there's
* no possible race.
*/
- if (napi_safe) {
+ if (napi_safe || in_softirq()) {
const struct napi_struct *napi = READ_ONCE(pp->p.napi);
allow_direct = napi &&
--
2.41.0
From: Alexander Lobakin <[email protected]>
Date: Fri, 14 Jul 2023 19:08:45 +0200
> On x86_64, frag_* fields of struct page_pool are scattered across two
> cachelines despite the summary size of 24 bytes. The last field,
> ::frag_users, is pushed out to the next one, sharing it with
> ::alloc_stats.
There are two copies now of #2 and #3 as I reordered them and forgot to
remove obsolete files ._. It's an RFC anyway, so please just ignore the
duplicates. Sorry.
[...]
Thanks,
Olek
On x86_64, frag_* fields of struct page_pool are scattered across two
cachelines despite the summary size of 24 bytes. The last field,
::frag_users, is pushed out to the next one, sharing it with
::alloc_stats.
All three fields are used in pretty much same places. There are some
holes and cold members to move around. Move frag_* one block up, placing
them right after &page_pool_params perfectly at the beginning of CL2.
This doesn't do any meaningful to the second block, as those are some
destroy-path cold structures, and doesn't do anything to ::alloc_stats,
which still starts at 200 byte offset, 8 bytes after CL3 (still fitting
into 1 cacheline).
On my setup, this yields 1-2% of Mpps when using PP frags actively.
When it comes to 32-bit architectures with 32 byte CL: &page_pool_params
plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
CL, so there should be at least no regressions from the actual change.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 69e822021d95..68937deea4b1 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -131,16 +131,16 @@ struct page_pool {
struct page_pool_params p;
long pad;
+ long frag_users;
+ struct page *frag_page;
+ unsigned int frag_offset;
+ u32 pages_state_hold_cnt;
+
struct delayed_work release_dw;
void (*disconnect)(void *);
unsigned long defer_start;
unsigned long defer_warn;
- u32 pages_state_hold_cnt;
- unsigned int frag_offset;
- struct page *frag_page;
- long frag_users;
-
#ifdef CONFIG_PAGE_POOL_STATS
/* these stats are incremented while in softirq context */
struct page_pool_alloc_stats alloc_stats;
--
2.41.0
For now, this structure takes a whole 64-byte cacheline on x86_64.
But in fact, it has a 4-byte hole before ::init_callback() (yet not
sufficient to change its sizeof()).
::dma_dir is whole 4 bytes, although its values can only be 0 and 2.
Merge it with ::flags and, so that its slot gets freed and reduces
the structure's size to 56 bytes. This adds instruction when reading
that field, but the upcoming change will make those reads happen way
less often.
Pad the freed slot explicitly in &page_pool to not alter cacheline
layout while it's not used.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 212d72b5cfec..68937deea4b1 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -51,13 +51,13 @@ struct pp_alloc_cache {
};
struct page_pool_params {
- unsigned int flags;
+ unsigned int flags:30;
+ enum dma_data_direction dma_dir:2; /* DMA mapping direction */
unsigned int order;
unsigned int pool_size;
int nid; /* Numa node id to allocate from pages from */
struct device *dev; /* device, for DMA pre-mapping purposes */
struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
- enum dma_data_direction dma_dir; /* DMA mapping direction */
unsigned int max_len; /* max DMA sync memory size */
unsigned int offset; /* DMA addr offset */
void (*init_callback)(struct page *page, void *arg);
@@ -129,6 +129,7 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
struct page_pool {
struct page_pool_params p;
+ long pad;
long frag_users;
struct page *frag_page;
--
2.41.0
For now, this structure takes a whole 64-byte cacheline on x86_64.
But in fact, it has a 4-byte hole before ::init_callback() (yet not
sufficient to change its sizeof()).
::dma_dir is whole 4 bytes, although its values can only be 0 and 2.
Merge it with ::flags and, so that its slot gets freed and reduces
structure's size to 56 bytes. This adds an instruction when reading
that field, but the upcoming change will make those reads happen way
less often.
Pad the freed slot explicitly in &page_pool to not alter cacheline
layout while it's not used.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 829dc1f8ba6b..69e822021d95 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -51,13 +51,13 @@ struct pp_alloc_cache {
};
struct page_pool_params {
- unsigned int flags;
+ unsigned int flags:30;
+ enum dma_data_direction dma_dir:2; /* DMA mapping direction */
unsigned int order;
unsigned int pool_size;
int nid; /* Numa node id to allocate from pages from */
struct device *dev; /* device, for DMA pre-mapping purposes */
struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
- enum dma_data_direction dma_dir; /* DMA mapping direction */
unsigned int max_len; /* max DMA sync memory size */
unsigned int offset; /* DMA addr offset */
void (*init_callback)(struct page *page, void *arg);
@@ -129,6 +129,7 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
struct page_pool {
struct page_pool_params p;
+ long pad;
struct delayed_work release_dw;
void (*disconnect)(void *);
--
2.41.0
page_pool::p is driver-defined params, copied directly from the
structure passed via page_pool_create(). The structure isn't meant
to be modified by the Page Pool core code and this even might look
confusing[0][1].
In order to be able to alter some flags, let's define our own, internal
fields. Use the slot freed earlier to stay within the same cacheline as
before (or almost if it's shorter than 64 bytes).
The flag indicating whether to perform DMA mapping can be bool; as for
DMA sync, define it as an enum to be able to extend it later on. They
are defined as bits in the driver-set params, leave them so here as
well, to not waste byte-per-bit or so. Now there are 29 free bits left
in those 4 bytes + 4 free bytes more before the cacheline boundary.
We could've defined only new flags here or only the ones we may need
to alter, but checking some flags in one place while others in another
doesn't sound convenient or intuitive.
Suggested-by: Jakub Kicinski <[email protected]>
Link[0]: https://lore.kernel.org/netdev/[email protected]
Suggested-by: Alexander Duyck <[email protected]>
Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 7 ++++++-
net/core/page_pool.c | 26 ++++++++++++++------------
2 files changed, 20 insertions(+), 13 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 68937deea4b1..aefcb76baf0e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -129,7 +129,12 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
struct page_pool {
struct page_pool_params p;
- long pad;
+
+ bool dma_map:1; /* Perform DMA mapping */
+ enum {
+ PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */
+ PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */
+ } dma_sync_act:2;
long frag_users;
struct page *frag_page;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 09d76e37ac69..67957e74e1f5 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -182,6 +182,8 @@ static int page_pool_init(struct page_pool *pool,
if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
(pool->p.dma_dir != DMA_BIDIRECTIONAL))
return -EINVAL;
+
+ pool->dma_map = true;
}
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) {
@@ -194,6 +196,8 @@ static int page_pool_init(struct page_pool *pool,
if (!pool->p.max_len)
return -EINVAL;
+ pool->dma_sync_act = PP_DMA_SYNC_ACT_DO;
+
/* pool->p.offset has to be set according to the address
* offset used by the DMA engine to start copying rx data
*/
@@ -213,7 +217,7 @@ static int page_pool_init(struct page_pool *pool,
/* Driver calling page_pool_create() also call page_pool_destroy() */
refcount_set(&pool->user_cnt, 1);
- if (pool->p.flags & PP_FLAG_DMA_MAP)
+ if (pool->dma_map)
get_device(pool->p.dev);
return 0;
@@ -341,7 +345,7 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
page_pool_set_dma_addr(page, dma);
- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
return true;
@@ -380,8 +384,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
if (unlikely(!page))
return NULL;
- if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
- unlikely(!page_pool_dma_map(pool, page))) {
+ if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
return NULL;
}
@@ -401,8 +404,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t gfp)
{
const int bulk = PP_ALLOC_CACHE_REFILL;
- unsigned int pp_flags = pool->p.flags;
unsigned int pp_order = pool->p.order;
+ bool dma_map = pool->dma_map;
struct page *page;
int i, nr_pages;
@@ -427,8 +430,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
*/
for (i = 0; i < nr_pages; i++) {
page = pool->alloc.cache[i];
- if ((pp_flags & PP_FLAG_DMA_MAP) &&
- unlikely(!page_pool_dma_map(pool, page))) {
+ if (dma_map && unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
continue;
}
@@ -500,7 +502,7 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
dma_addr_t dma;
int count;
- if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+ if (!pool->dma_map)
/* Always account for inflight pages, even if we didn't
* map them
*/
@@ -573,7 +575,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
}
/* If the page refcnt == 1, this will try to recycle the page.
- * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
+ * if pool->dma_sync_act is set, we'll try to sync the DMA area for
* the configured size min(dma_sync_size, pool->max_len).
* If the page refcnt != 1, then the page will be returned to memory
* subsystem.
@@ -594,7 +596,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
/* Read barrier done in page_ref_count / READ_ONCE */
- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page,
dma_sync_size);
@@ -695,7 +697,7 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
return NULL;
if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
page_pool_dma_sync_for_device(pool, page, -1);
return page;
@@ -781,7 +783,7 @@ static void __page_pool_destroy(struct page_pool *pool)
ptr_ring_cleanup(&pool->ring, NULL);
- if (pool->p.flags & PP_FLAG_DMA_MAP)
+ if (pool->dma_map)
put_device(pool->p.dev);
#ifdef CONFIG_PAGE_POOL_STATS
--
2.41.0
On 14/07/2023 19.08, Alexander Lobakin wrote:
> On x86_64, frag_* fields of struct page_pool are scattered across two
> cachelines despite the summary size of 24 bytes. The last field,
> ::frag_users, is pushed out to the next one, sharing it with
> ::alloc_stats.
> All three fields are used in pretty much the same places. There are some
> holes and cold members to move around. Move frag_* one block up, placing
> them right after &page_pool_params perfectly at the beginning of CL2.
> This doesn't do any meaningful to the second block, as those are some
> destroy-path cold structures, and doesn't do anything to ::alloc_stats,
> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting
> into 1 cacheline).
> On my setup, this yields 1-2% of Mpps when using PP frags actively.
> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
> CL, so there should be at least no regressions from the actual change.
>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> include/net/page_pool.h | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 829dc1f8ba6b..212d72b5cfec 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -130,16 +130,16 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
> struct page_pool {
> struct page_pool_params p;
>
> + long frag_users;
> + struct page *frag_page;
> + unsigned int frag_offset;
> + u32 pages_state_hold_cnt;
I think this is okay, but I want to highlight that:
- pages_state_hold_cnt and pages_state_release_cnt
need to be kept on separate cache-lines.
> +
> struct delayed_work release_dw;
> void (*disconnect)(void *);
> unsigned long defer_start;
> unsigned long defer_warn;
>
> - u32 pages_state_hold_cnt;
> - unsigned int frag_offset;
> - struct page *frag_page;
> - long frag_users;
> -
> #ifdef CONFIG_PAGE_POOL_STATS
> /* these stats are incremented while in softirq context */
> struct page_pool_alloc_stats alloc_stats;
From: Jesper Dangaard Brouer <[email protected]>
Date: Fri, 14 Jul 2023 20:37:39 +0200
>
>
> On 14/07/2023 19.08, Alexander Lobakin wrote:
>> On x86_64, frag_* fields of struct page_pool are scattered across two
>> cachelines despite the summary size of 24 bytes. The last field,
>> ::frag_users, is pushed out to the next one, sharing it with
>> ::alloc_stats.
>> All three fields are used in pretty much the same places. There are some
>> holes and cold members to move around. Move frag_* one block up, placing
>> them right after &page_pool_params perfectly at the beginning of CL2.
>> This doesn't do any meaningful to the second block, as those are some
>> destroy-path cold structures, and doesn't do anything to ::alloc_stats,
>> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting
>> into 1 cacheline).
>> On my setup, this yields 1-2% of Mpps when using PP frags actively.
>> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
>> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
>> CL, so there should be at least no regressions from the actual change.
>>
>> Signed-off-by: Alexander Lobakin <[email protected]>
>> ---
>> include/net/page_pool.h | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index 829dc1f8ba6b..212d72b5cfec 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -130,16 +130,16 @@ static inline u64
>> *page_pool_ethtool_stats_get(u64 *data, void *stats)
>> struct page_pool {
>> struct page_pool_params p;
>> + long frag_users;
>> + struct page *frag_page;
>> + unsigned int frag_offset;
>> + u32 pages_state_hold_cnt;
>
> I think this is okay, but I want to highlight that:
> - pages_state_hold_cnt and pages_state_release_cnt
> need to be kept on separate cache-lines.
They're pretty far away from each other. I moved hold_cnt here as well
to keep it stacked with frag_offset and avoid introducing 32-bit holes.
>
>
>> +
>> struct delayed_work release_dw;
>> void (*disconnect)(void *);
>> unsigned long defer_start;
>> unsigned long defer_warn;
>> - u32 pages_state_hold_cnt;
>> - unsigned int frag_offset;
>> - struct page *frag_page;
>> - long frag_users;
>> -
>> #ifdef CONFIG_PAGE_POOL_STATS
>> /* these stats are incremented while in softirq context */
>> struct page_pool_alloc_stats alloc_stats;
>
Thanks,
Olek
On Fri, 14 Jul 2023 19:08:52 +0200 Alexander Lobakin wrote:
> Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
I thought I said something along the lines as "if this is safe you can
as well" which falls short of a suggestion, cause I don't think it is
safe :)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index fc1470aab5cf..1c22fd33be6c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -902,7 +902,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> * in the same context as the consumer would run, so there's
> * no possible race.
> */
> - if (napi_safe) {
> + if (napi_safe || in_softirq()) {
> const struct napi_struct *napi = READ_ONCE(pp->p.napi);
>
> allow_direct = napi &&
What if we got here from netpoll? napi budget was 0, so napi_safe is
false, but in_softirq() can be true or false.
XDP SKB is a toy, I really don't think 3-4% in XDP SKB warrants the
risk here.
From: Jakub Kicinski <[email protected]>
Date: Tue, 18 Jul 2023 17:40:42 -0700
> On Fri, 14 Jul 2023 19:08:52 +0200 Alexander Lobakin wrote:
>> Suggested-by: Jakub Kicinski <[email protected]> # in_softirq()
>
> I thought I said something along the lines as "if this is safe you can
> as well" which falls short of a suggestion, cause I don't think it is
> safe :)
Ok, I'll drop you if you insist :D
>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index fc1470aab5cf..1c22fd33be6c 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -902,7 +902,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>> * in the same context as the consumer would run, so there's
>> * no possible race.
>> */
>> - if (napi_safe) {
>> + if (napi_safe || in_softirq()) {
>> const struct napi_struct *napi = READ_ONCE(pp->p.napi);
>>
>> allow_direct = napi &&
>
> What if we got here from netpoll? napi budget was 0, so napi_safe is
> false, but in_softirq() can be true or false.
If we're on the same CPU where the NAPI would run and in the same
context, i.e. softirq, in which the NAPI would run, what is the problem?
If there really is a good one, I can handle it here.
>
> XDP SKB is a toy, I really don't think 3-4% in XDP SKB warrants the
> risk here.
It affects not only XDP skb, I told ya already :D @napi_safe is very
conservative and misses a good bunch of stuff.
I really don't think that very rare corner cases (which can be handled
if needed) warrants the perf miss here.
Thanks,
Olek
On Wed, 19 Jul 2023 18:34:46 +0200 Alexander Lobakin wrote:
> > What if we got here from netpoll? napi budget was 0, so napi_safe is
> > false, but in_softirq() can be true or false.
>
> If we're on the same CPU where the NAPI would run and in the same
> context, i.e. softirq, in which the NAPI would run, what is the problem?
> If there really is a good one, I can handle it here.
#define SOFTIRQ_BITS 8
#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
# define softirq_count() (preempt_count() & SOFTIRQ_MASK)
#define in_softirq() (softirq_count())
I don't know what else to add beyond that and the earlier explanation.
AFAIK pages as allocated by page pool do not benefit from the usual
KASAN / KMSAN checkers, so if we were to double-recycle a page once
a day because of a netcons race - it's going to be a month long debug
for those of us using Linux in production.
On Thu, 20 Jul 2023 18:46:02 +0200 Alexander Lobakin wrote:
> From: Jakub Kicinski <[email protected]>
> Date: Wed, 19 Jul 2023 13:51:50 -0700
>
> > On Wed, 19 Jul 2023 18:34:46 +0200 Alexander Lobakin wrote:
> [...]
> >>
> >> If we're on the same CPU where the NAPI would run and in the same
> >> context, i.e. softirq, in which the NAPI would run, what is the problem?
> >> If there really is a good one, I can handle it here.
> >
> > #define SOFTIRQ_BITS 8
> > #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
> > # define softirq_count() (preempt_count() & SOFTIRQ_MASK)
> > #define in_softirq() (softirq_count())
>
> I do remember those, don't worry :)
>
> > I don't know what else to add beyond that and the earlier explanation.
>
> My question was "how can two things race on one CPU in one context if it
> implies they won't ever happen simultaneously", but maybe my zero
> knowledge of netcons hides something from me.
One of them is in hardirq.
> > AFAIK pages as allocated by page pool do not benefit from the usual
> > KASAN / KMSAN checkers, so if we were to double-recycle a page once
> > a day because of a netcons race - it's going to be a month long debug
> > for those of us using Linux in production.
>
> if (!test_bit(&napi->state, NPSVC))
if you have to the right check is !in_hardirq()
> ? It would mean we're not netpolling.
> Otherwise, if this still is not enough, I'do go back to my v1 approach
> with having a NAPI flag, which would tell for sure we're good to go. I
> got confused by your "wouldn't just checking for softirq be enough"! T.T
> Joking :D
I guess the problem I'm concerned about can already happen.
I'll send a lockdep annotation shortly.
From: Jakub Kicinski <[email protected]>
Date: Wed, 19 Jul 2023 13:51:50 -0700
> On Wed, 19 Jul 2023 18:34:46 +0200 Alexander Lobakin wrote:
>>> What if we got here from netpoll? napi budget was 0, so napi_safe is
>>> false, but in_softirq() can be true or false.
>>
>> If we're on the same CPU where the NAPI would run and in the same
>> context, i.e. softirq, in which the NAPI would run, what is the problem?
>> If there really is a good one, I can handle it here.
>
> #define SOFTIRQ_BITS 8
> #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
> # define softirq_count() (preempt_count() & SOFTIRQ_MASK)
> #define in_softirq() (softirq_count())
I do remember those, don't worry :)
>
> I don't know what else to add beyond that and the earlier explanation.
My question was "how can two things race on one CPU in one context if it
implies they won't ever happen simultaneously", but maybe my zero
knowledge of netcons hides something from me.
>
> AFAIK pages as allocated by page pool do not benefit from the usual
> KASAN / KMSAN checkers, so if we were to double-recycle a page once
> a day because of a netcons race - it's going to be a month long debug
> for those of us using Linux in production.
if (!test_bit(&napi->state, NPSVC))
? It would mean we're not netpolling.
Otherwise, if this still is not enough, I'do go back to my v1 approach
with having a NAPI flag, which would tell for sure we're good to go. I
got confused by your "wouldn't just checking for softirq be enough"! T.T
Joking :D
Thanks,
Olek
From: Jakub Kicinski <[email protected]>
Date: Thu, 20 Jul 2023 10:12:31 -0700
> On Thu, 20 Jul 2023 18:46:02 +0200 Alexander Lobakin wrote:
>> From: Jakub Kicinski <[email protected]>
>> Date: Wed, 19 Jul 2023 13:51:50 -0700
>>
>>> On Wed, 19 Jul 2023 18:34:46 +0200 Alexander Lobakin wrote:
>> [...]
>>>>
>>>> If we're on the same CPU where the NAPI would run and in the same
>>>> context, i.e. softirq, in which the NAPI would run, what is the problem?
>>>> If there really is a good one, I can handle it here.
>>>
>>> #define SOFTIRQ_BITS 8
>>> #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
>>> # define softirq_count() (preempt_count() & SOFTIRQ_MASK)
>>> #define in_softirq() (softirq_count())
>>
>> I do remember those, don't worry :)
>>
>>> I don't know what else to add beyond that and the earlier explanation.
>>
>> My question was "how can two things race on one CPU in one context if it
>> implies they won't ever happen simultaneously", but maybe my zero
>> knowledge of netcons hides something from me.
>
> One of them is in hardirq.
If I got your message correctly, that means softirq_count() can return
`true` even if we're in hardirq context, but there are some softirqs
pending? I.e. if I call local_irq_save() inside NAPI poll loop,
in_softirq() will still return `true`? (I'll check it myself in a bit,
but why not ask).
Isn't checking for `interrupt_context_level() == 1` more appropriate
then? Page Pool core code also uses in_softirq(), as well as a hellaton
of other networking-related places.
>
>>> AFAIK pages as allocated by page pool do not benefit from the usual
>>> KASAN / KMSAN checkers, so if we were to double-recycle a page once
>>> a day because of a netcons race - it's going to be a month long debug
>>> for those of us using Linux in production.
>>
>> if (!test_bit(&napi->state, NPSVC))
>
> if you have to the right check is !in_hardirq()
>
>> ? It would mean we're not netpolling.
>> Otherwise, if this still is not enough, I'do go back to my v1 approach
>> with having a NAPI flag, which would tell for sure we're good to go. I
>> got confused by your "wouldn't just checking for softirq be enough"! T.T
>> Joking :D
>
> I guess the problem I'm concerned about can already happen.
> I'll send a lockdep annotation shortly.
Interesten.
Thanks,
Olek
From: Alexander Lobakin <[email protected]>
Date: Thu, 20 Jul 2023 20:01:28 +0200
> From: Jakub Kicinski <[email protected]>
> Date: Thu, 20 Jul 2023 11:00:27 -0700
>
>> On Thu, 20 Jul 2023 19:48:06 +0200 Alexander Lobakin wrote:
>>>>> My question was "how can two things race on one CPU in one context if it
>>>>> implies they won't ever happen simultaneously", but maybe my zero
>>>>> knowledge of netcons hides something from me.
>>>>
>>>> One of them is in hardirq.
>>>
>>> If I got your message correctly, that means softirq_count() can return
>>> `true` even if we're in hardirq context, but there are some softirqs
>>> pending?
>>
>> Not pending, being executed. Hardirq can come during softirq.
>>
>>> I.e. if I call local_irq_save() inside NAPI poll loop,
>>> in_softirq() will still return `true`? (I'll check it myself in a bit,
>>> but why not ask).
>>
>> Yes.
My code, run from the NAPI poll loop:
pr_info("BH. irq_count(): %lu", irq_count());
pr_info("BH. in_hardirq(): %lu", in_hardirq());
pr_info("BH. in_softirq(): %lu", in_softirq());
pr_info("BH. in_serving_softirq(): %lu", in_serving_softirq());
pr_info("BH. interrupt_context_level(): %u",
interrupt_context_level());
local_irq_save(flags);
pr_info("TH. irq_count(): %lu", irq_count());
pr_info("TH. in_hardirq(): %lu", in_hardirq());
pr_info("TH. in_softirq(): %lu", in_softirq());
pr_info("TH. in_serving_softirq(): %lu", in_serving_softirq());
pr_info("TH. interrupt_context_level(): %u",
interrupt_context_level());
local_irq_restore(flags);
The result:
BH. irq_count(): 256
BH. in_hardirq(): 0
BH. in_softirq(): 256
BH. in_serving_softirq(): 256
BH. interrupt_context_level(): 1
TH. irq_count(): 256
TH. in_hardirq(): 0
TH. in_softirq(): 256
TH. in_serving_softirq(): 256
TH. interrupt_context_level(): 1
IOW, it reports we're in softirq no bloody matter if interrupts are
enabled or not. Either I did something wrong or the entire in_*irq()
family, including interrupt_context_level(), doesn't protect from
anything at all and doesn't work the way that most devs expect it to work?
(or was it just me? :D)
I guess the only way to be sure is to always check irqs_disabled() when
in_softirq() returns true.
>>
>>> Isn't checking for `interrupt_context_level() == 1` more appropriate
>>> then? Page Pool core code also uses in_softirq(), as well as a hellaton
>>> of other networking-related places.
>>
>> Right now page pool only supports BH and process contexts. IOW the
>> "else" branch of if (in_softirq()) in page pool is expecting to be
>> in process context.
>>
>> Supporting hard irq would mean we need to switch to _irqsave() locking.
>> That's likely way too costly.
>>
>> Or stash the freed pages away and free them lazily.
>>
>> Or add a lockdep warning and hope nobody will ever free a page-pool
>> backed skb from hard IRQ context :)
>
> I told you under the previous version that this function is not supposed
> to be called under hardirq context, so we don't need to check for it :D
> But I was assuming nobody would try to do that. Seems like not really
> (netcons) if you want to sanitize this...
>
> Thanks,
> Olek
Thanks,
Olek
From: Jakub Kicinski <[email protected]>
Date: Thu, 20 Jul 2023 11:00:27 -0700
> On Thu, 20 Jul 2023 19:48:06 +0200 Alexander Lobakin wrote:
>>>> My question was "how can two things race on one CPU in one context if it
>>>> implies they won't ever happen simultaneously", but maybe my zero
>>>> knowledge of netcons hides something from me.
>>>
>>> One of them is in hardirq.
>>
>> If I got your message correctly, that means softirq_count() can return
>> `true` even if we're in hardirq context, but there are some softirqs
>> pending?
>
> Not pending, being executed. Hardirq can come during softirq.
>
>> I.e. if I call local_irq_save() inside NAPI poll loop,
>> in_softirq() will still return `true`? (I'll check it myself in a bit,
>> but why not ask).
>
> Yes.
>
>> Isn't checking for `interrupt_context_level() == 1` more appropriate
>> then? Page Pool core code also uses in_softirq(), as well as a hellaton
>> of other networking-related places.
>
> Right now page pool only supports BH and process contexts. IOW the
> "else" branch of if (in_softirq()) in page pool is expecting to be
> in process context.
>
> Supporting hard irq would mean we need to switch to _irqsave() locking.
> That's likely way too costly.
>
> Or stash the freed pages away and free them lazily.
>
> Or add a lockdep warning and hope nobody will ever free a page-pool
> backed skb from hard IRQ context :)
I told you under the previous version that this function is not supposed
to be called under hardirq context, so we don't need to check for it :D
But I was assuming nobody would try to do that. Seems like not really
(netcons) if you want to sanitize this...
Thanks,
Olek
On Thu, 20 Jul 2023 19:48:06 +0200 Alexander Lobakin wrote:
> >> My question was "how can two things race on one CPU in one context if it
> >> implies they won't ever happen simultaneously", but maybe my zero
> >> knowledge of netcons hides something from me.
> >
> > One of them is in hardirq.
>
> If I got your message correctly, that means softirq_count() can return
> `true` even if we're in hardirq context, but there are some softirqs
> pending?
Not pending, being executed. Hardirq can come during softirq.
> I.e. if I call local_irq_save() inside NAPI poll loop,
> in_softirq() will still return `true`? (I'll check it myself in a bit,
> but why not ask).
Yes.
> Isn't checking for `interrupt_context_level() == 1` more appropriate
> then? Page Pool core code also uses in_softirq(), as well as a hellaton
> of other networking-related places.
Right now page pool only supports BH and process contexts. IOW the
"else" branch of if (in_softirq()) in page pool is expecting to be
in process context.
Supporting hard irq would mean we need to switch to _irqsave() locking.
That's likely way too costly.
Or stash the freed pages away and free them lazily.
Or add a lockdep warning and hope nobody will ever free a page-pool
backed skb from hard IRQ context :)
On Thu, 20 Jul 2023 20:13:07 +0200 Alexander Lobakin wrote:
> IOW, it reports we're in softirq no bloody matter if interrupts are
> enabled or not. Either I did something wrong or the entire in_*irq()
> family, including interrupt_context_level(), doesn't protect from
> anything at all and doesn't work the way that most devs expect it to work?
>
> (or was it just me? :D)
>
> I guess the only way to be sure is to always check irqs_disabled() when
> in_softirq() returns true.
We can as well check
(in_softirq() && !irqs_disabled() && !in_hardirq())
?
The interrupt_context_level() thing is fairly new, I think.
Who knows what happens to it going forward...
> >> Right now page pool only supports BH and process contexts. IOW the
> >> "else" branch of if (in_softirq()) in page pool is expecting to be
> >> in process context.
> >>
> >> Supporting hard irq would mean we need to switch to _irqsave() locking.
> >> That's likely way too costly.
> >>
> >> Or stash the freed pages away and free them lazily.
> >>
> >> Or add a lockdep warning and hope nobody will ever free a page-pool
> >> backed skb from hard IRQ context :)
> >
> > I told you under the previous version that this function is not supposed
> > to be called under hardirq context, so we don't need to check for it :D
> > But I was assuming nobody would try to do that. Seems like not really
> > (netcons) if you want to sanitize this...
netcons or anyone who freed socket-less skbs from hardirq.
Until pp recycling was added freeing an skb from hardirq was legal,
AFAICT.
On Thu, 20 Jul 2023 21:33:40 +0200 Alexander Lobakin wrote:
> > We can as well check
> > (in_softirq() && !irqs_disabled() && !in_hardirq())
> > ?
>
> Yes, something like that. Messy, but I see no other options...
>
> So, I guess you want to add an assertion to make sure that we're *not*
> in this:
>
> in_hardirq() || irqs_disabled()
>
> Does this mean that after it's added, my patch is sane? :p
Well... it's acceptable. Make sure you add a good, informative
but concise comment :)
> > The interrupt_context_level() thing is fairly new, I think.
> > Who knows what happens to it going forward...
>
> Well, it counts the number of active hard interrupts, but doesn't take
> into account that if there are no hardirqs we can still disable them
> manually. Meh.
> Should I try to send a patch for it? :D
Depends on how you like to send your time :)
> > netcons or anyone who freed socket-less skbs from hardirq.
> > Until pp recycling was added freeing an skb from hardirq was legal,
> > AFAICT.
>
> I don't think so. Why do we have dev_kfree_skb_any() then? It checks for
>
> in_hardirq() || irqs_disabled()
>
> and if it's true, defers the skb to process it by backlog task.
> "Regular" skb freeing functions don't do that. The _any() variant lives
> here for a long time IIRC, so it's not something recent.
Drivers (or any other users of dev_kfree_skb_any()) should be fine.
I'm only paranoid about some unknown bits of code which thought they
can be clever and call kfree_skb() directly, as long as !skb->sk.
But if you add the hard irq checks to your patch then you're strictly
safer than the existing code. Hopefully the checks are not too
expensive.
From: Jakub Kicinski <[email protected]>
Date: Thu, 20 Jul 2023 12:20:15 -0700
> On Thu, 20 Jul 2023 20:13:07 +0200 Alexander Lobakin wrote:
>> IOW, it reports we're in softirq no bloody matter if interrupts are
>> enabled or not. Either I did something wrong or the entire in_*irq()
>> family, including interrupt_context_level(), doesn't protect from
>> anything at all and doesn't work the way that most devs expect it to work?
>>
>> (or was it just me? :D)
>>
>> I guess the only way to be sure is to always check irqs_disabled() when
>> in_softirq() returns true.
>
> We can as well check
> (in_softirq() && !irqs_disabled() && !in_hardirq())
> ?
Yes, something like that. Messy, but I see no other options...
So, I guess you want to add an assertion to make sure that we're *not*
in this:
in_hardirq() || irqs_disabled()
Does this mean that after it's added, my patch is sane? :p
>
> The interrupt_context_level() thing is fairly new, I think.
> Who knows what happens to it going forward...
Well, it counts the number of active hard interrupts, but doesn't take
into account that if there are no hardirqs we can still disable them
manually. Meh.
Should I try to send a patch for it? :D
>
>>>> Right now page pool only supports BH and process contexts. IOW the
>>>> "else" branch of if (in_softirq()) in page pool is expecting to be
>>>> in process context.
>>>>
>>>> Supporting hard irq would mean we need to switch to _irqsave() locking.
>>>> That's likely way too costly.
>>>>
>>>> Or stash the freed pages away and free them lazily.
>>>>
>>>> Or add a lockdep warning and hope nobody will ever free a page-pool
>>>> backed skb from hard IRQ context :)
>>>
>>> I told you under the previous version that this function is not supposed
>>> to be called under hardirq context, so we don't need to check for it :D
>>> But I was assuming nobody would try to do that. Seems like not really
>>> (netcons) if you want to sanitize this...
>
> netcons or anyone who freed socket-less skbs from hardirq.
> Until pp recycling was added freeing an skb from hardirq was legal,
> AFAICT.
I don't think so. Why do we have dev_kfree_skb_any() then? It checks for
in_hardirq() || irqs_disabled()
and if it's true, defers the skb to process it by backlog task.
"Regular" skb freeing functions don't do that. The _any() variant lives
here for a long time IIRC, so it's not something recent.
Thanks,
Olek
On 2023/7/21 3:46, Jakub Kicinski wrote:
> On Thu, 20 Jul 2023 21:33:40 +0200 Alexander Lobakin wrote:
>>> We can as well check
>>> (in_softirq() && !irqs_disabled() && !in_hardirq())
>>> ?
>>
>> Yes, something like that. Messy, but I see no other options...
>>
>> So, I guess you want to add an assertion to make sure that we're *not*
>> in this:
>>
>> in_hardirq() || irqs_disabled()
>>
>> Does this mean that after it's added, my patch is sane? :p
>
> Well... it's acceptable. Make sure you add a good, informative
> but concise comment :)
>
Does it mean ptr_ring_produce_any() is needed in
page_pool_recycle_in_ring() too?
As it is assumed that page pool API can be called in the context with
irqs_disabled() || in_hardirq(), and force recylcling happens in the
prt_ring.
Isn't it conflit with the below patch? as the below patch assume page
pool API can not be called in the context with irqs_disabled() || in_hardirq():
[PATCH net-next] page_pool: add a lockdep check for recycling in hardirq
Or am I missing something obvious here?
On Fri, 21 Jul 2023 17:37:57 +0200 Alexander Lobakin wrote:
> > Does it mean ptr_ring_produce_any() is needed in
> > page_pool_recycle_in_ring() too?
> >
> > As it is assumed that page pool API can be called in the context with
> > irqs_disabled() || in_hardirq(), and force recylcling happens in the
> > prt_ring.
> >
> > Isn't it conflit with the below patch? as the below patch assume page
> > pool API can not be called in the context with irqs_disabled() || in_hardirq():
> > [PATCH net-next] page_pool: add a lockdep check for recycling in hardirq
> >
> > Or am I missing something obvious here?
>
> No, Page Pool is *not* intended to be called when IRQs are disabled,
> hence the fix Jakub's posted in the separate thread.
Yeah, it's just a bandaid / compromise, since Olek really wants his
optimization and I really don't want to spend a month debugging
potential production crashes :)
On the ptr ring the use may still be incorrect but there is a spin lock
so things will explode in much more obvious way, if they do.
From: Yunsheng Lin <[email protected]>
Date: Fri, 21 Jul 2023 19:53:33 +0800
> On 2023/7/21 3:46, Jakub Kicinski wrote:
>> On Thu, 20 Jul 2023 21:33:40 +0200 Alexander Lobakin wrote:
>>>> We can as well check
>>>> (in_softirq() && !irqs_disabled() && !in_hardirq())
>>>> ?
>>>
>>> Yes, something like that. Messy, but I see no other options...
>>>
>>> So, I guess you want to add an assertion to make sure that we're *not*
>>> in this:
>>>
>>> in_hardirq() || irqs_disabled()
>>>
>>> Does this mean that after it's added, my patch is sane? :p
>>
>> Well... it's acceptable. Make sure you add a good, informative
>> but concise comment :)
>>
>
> Does it mean ptr_ring_produce_any() is needed in
> page_pool_recycle_in_ring() too?
>
> As it is assumed that page pool API can be called in the context with
> irqs_disabled() || in_hardirq(), and force recylcling happens in the
> prt_ring.
>
> Isn't it conflit with the below patch? as the below patch assume page
> pool API can not be called in the context with irqs_disabled() || in_hardirq():
> [PATCH net-next] page_pool: add a lockdep check for recycling in hardirq
>
> Or am I missing something obvious here?
No, Page Pool is *not* intended to be called when IRQs are disabled,
hence the fix Jakub's posted in the separate thread.
Thanks,
Olek
Apologies for the late reply, I was on vacation and start going
through my email piles...
On Tue, 18 Jul 2023 at 16:52, Alexander Lobakin
<[email protected]> wrote:
>
> From: Jesper Dangaard Brouer <[email protected]>
> Date: Fri, 14 Jul 2023 20:37:39 +0200
>
> >
> >
> > On 14/07/2023 19.08, Alexander Lobakin wrote:
> >> On x86_64, frag_* fields of struct page_pool are scattered across two
> >> cachelines despite the summary size of 24 bytes. The last field,
> >> ::frag_users, is pushed out to the next one, sharing it with
> >> ::alloc_stats.
> >> All three fields are used in pretty much the same places. There are some
> >> holes and cold members to move around. Move frag_* one block up, placing
> >> them right after &page_pool_params perfectly at the beginning of CL2.
> >> This doesn't do any meaningful to the second block, as those are some
> >> destroy-path cold structures, and doesn't do anything to ::alloc_stats,
> >> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting
> >> into 1 cacheline).
> >> On my setup, this yields 1-2% of Mpps when using PP frags actively.
> >> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
> >> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
> >> CL, so there should be at least no regressions from the actual change.
> >>
> >> Signed-off-by: Alexander Lobakin <[email protected]>
> >> ---
> >> include/net/page_pool.h | 10 +++++-----
> >> 1 file changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >> index 829dc1f8ba6b..212d72b5cfec 100644
> >> --- a/include/net/page_pool.h
> >> +++ b/include/net/page_pool.h
> >> @@ -130,16 +130,16 @@ static inline u64
> >> *page_pool_ethtool_stats_get(u64 *data, void *stats)
> >> struct page_pool {
> >> struct page_pool_params p;
> >> + long frag_users;
> >> + struct page *frag_page;
> >> + unsigned int frag_offset;
> >> + u32 pages_state_hold_cnt;
> >
> > I think this is okay, but I want to highlight that:
> > - pages_state_hold_cnt and pages_state_release_cnt
> > need to be kept on separate cache-lines.
>
> They're pretty far away from each other. I moved hold_cnt here as well
> to keep it stacked with frag_offset and avoid introducing 32-bit holes.
This is to prevent cache line bouncing and/or false sharing right?
The change seems fine to me as well but mind adding a comment about
this when you resend?
Thanks
/Ilias
>
> >
> >
> >> +
> >> struct delayed_work release_dw;
> >> void (*disconnect)(void *);
> >> unsigned long defer_start;
> >> unsigned long defer_warn;
> >> - u32 pages_state_hold_cnt;
> >> - unsigned int frag_offset;
> >> - struct page *frag_page;
> >> - long frag_users;
> >> -
> >> #ifdef CONFIG_PAGE_POOL_STATS
> >> /* these stats are incremented while in softirq context */
> >> struct page_pool_alloc_stats alloc_stats;
> >
>
> Thanks,
> Olek
From: Ilias Apalodimas <[email protected]>
Date: Wed, 26 Jul 2023 11:13:26 +0300
> Apologies for the late reply, I was on vacation and start going
> through my email piles...
No worries. I remember having to grind through hundreds of mails after
each vacation :s :D
>
> On Tue, 18 Jul 2023 at 16:52, Alexander Lobakin
> <[email protected]> wrote:
>>
>> From: Jesper Dangaard Brouer <[email protected]>
>> Date: Fri, 14 Jul 2023 20:37:39 +0200
>>
>>>
>>>
>>> On 14/07/2023 19.08, Alexander Lobakin wrote:
>>>> On x86_64, frag_* fields of struct page_pool are scattered across two
>>>> cachelines despite the summary size of 24 bytes. The last field,
>>>> ::frag_users, is pushed out to the next one, sharing it with
>>>> ::alloc_stats.
>>>> All three fields are used in pretty much the same places. There are some
>>>> holes and cold members to move around. Move frag_* one block up, placing
>>>> them right after &page_pool_params perfectly at the beginning of CL2.
>>>> This doesn't do any meaningful to the second block, as those are some
>>>> destroy-path cold structures, and doesn't do anything to ::alloc_stats,
>>>> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting
>>>> into 1 cacheline).
>>>> On my setup, this yields 1-2% of Mpps when using PP frags actively.
>>>> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params
>>>> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one
>>>> CL, so there should be at least no regressions from the actual change.
>>>>
>>>> Signed-off-by: Alexander Lobakin <[email protected]>
>>>> ---
>>>> include/net/page_pool.h | 10 +++++-----
>>>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>>> index 829dc1f8ba6b..212d72b5cfec 100644
>>>> --- a/include/net/page_pool.h
>>>> +++ b/include/net/page_pool.h
>>>> @@ -130,16 +130,16 @@ static inline u64
>>>> *page_pool_ethtool_stats_get(u64 *data, void *stats)
>>>> struct page_pool {
>>>> struct page_pool_params p;
>>>> + long frag_users;
>>>> + struct page *frag_page;
>>>> + unsigned int frag_offset;
>>>> + u32 pages_state_hold_cnt;
>>>
>>> I think this is okay, but I want to highlight that:
>>> - pages_state_hold_cnt and pages_state_release_cnt
>>> need to be kept on separate cache-lines.
>>
>> They're pretty far away from each other. I moved hold_cnt here as well
>> to keep it stacked with frag_offset and avoid introducing 32-bit holes.
>
> This is to prevent cache line bouncing and/or false sharing right?
> The change seems fine to me as well but mind adding a comment about
> this when you resend?
Right. Sure, why not.
>
> Thanks
> /Ilias
>>
>>>
>>>
>>>> +
>>>> struct delayed_work release_dw;
>>>> void (*disconnect)(void *);
>>>> unsigned long defer_start;
>>>> unsigned long defer_warn;
>>>> - u32 pages_state_hold_cnt;
>>>> - unsigned int frag_offset;
>>>> - struct page *frag_page;
>>>> - long frag_users;
>>>> -
>>>> #ifdef CONFIG_PAGE_POOL_STATS
>>>> /* these stats are incremented while in softirq context */
>>>> struct page_pool_alloc_stats alloc_stats;
>>>
>>
>> Thanks,
>> Olek
Thanks,
Olek