2023-06-29 15:54:23

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH RFC net-next 0/4] net: page_pool: a couple assorted optimizations

Here's spin off the IAVF PP series[0], with 2 runtime (hotpath) and 1
compile-time optimizations. They're based and tested on top of the
hybrid PP allocation series[1], but don't require it to work and
in general independent of it and each other.

Per-patch breakdown:
#1: already was on the lists, but this time it's done the other way, the
one that Alex Duyck proposed during the review of the previous series.
Slightly reduce amount of C preprocessing by stopping including
<net/page_pool.h> to <linux/skbuff.h> (which is included in the
half of the kernel sources). Especially useful with the abovementioned
series applied, as it makes page_pool.h heavier;
#2: don't call to DMA sync externals when they won't do anything anyway
by doing some heuristics a bit earlier (when allocating a new page),
also was on the lists;
#3: new, prereq to #4. Add NAPI state flag, which would indicate
napi->poll() is running right now, so that napi->list_owner would
point to the CPU where it's being run, not just scheduled;
#4: new. In addition to recycling skb PP pages directly when @napi_safe
is set, check for the flag from #3, which will mean the same if
->list_owner is pointing to us. This allows to use direct recycling
anytime we're inside a NAPI polling loop or GRO stuff going right
after it, covering way more cases than is right now.

(complete tree with [1] + this + [0] is available here: [2])

[0] https://lore.kernel.org/netdev/[email protected]
[1] https://lore.kernel.org/netdev/[email protected]
[2] https://github.com/alobakin/linux/commits/iavf-pp-frag

Alexander Lobakin (4):
net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>
net: page_pool: avoid calling no-op externals when possible
net: add flag to indicate NAPI/GRO is running right now
net: skbuff: always recycle PP pages directly when inside a NAPI loop

drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 3 +-
include/net/page_pool.h | 5 +-
net/core/dev.c | 23 +++++--
net/core/page_pool.c | 62 +++++++------------
net/core/skbuff.c | 29 +++++++++
13 files changed, 83 insertions(+), 48 deletions(-)

---
Really curious about #3. Implementing the idea correctly (this or other
way) potentially unblocks a lot more interesting stuff (besides #4).
--
2.41.0



2023-06-29 15:56:56

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH RFC net-next 4/4] net: skbuff: always recycle PP pages directly when inside a NAPI loop

Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of times we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, test the introduced
%NAPI_STATE_RUNNING. There's good probability it will be set and, if
->list_owner is our current CPU, we're good to use direct recycling,
even though @napi_safe is false.
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of times, which means the
MM layer is not even called to allocate any new pages.

Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4b7d00d5b5d7..931c83d7b251 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -893,7 +893,8 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
* no possible race.
*/
napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
+ allow_direct = napi &&
+ (napi_safe || test_bit(NAPI_STATE_RUNNING, &napi->state)) &&
READ_ONCE(napi->list_owner) == smp_processor_id();

/* Driver set this to memory recycling info. Reset it on recycle.
--
2.41.0


2023-06-29 15:57:29

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH RFC net-next 3/4] net: add flag to indicate NAPI/GRO is running right now

Currently, there's no easy way to check if a NAPI polling cycle is
running and on which CPU, although this might come very handy in
several cases.
Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") added napi_struct::list_owner, BUT it's set when the NAPI is
*scheduled*. `->list_owner == smp_processor_id()` doesn't mean we're
inside the corresponding polling loop.
Introduce new NAPI state flag, NAPI{,F}_STATE_RUNNING. Set it right
before calling to ->poll() and clear after all napi_gro_flush() and
gro_normal_list() are done. They are run in the same context and, in
fact, are part of the receive routine.
When `packets == budget`, napi_complete_done() is not called, so in
that case it's safe to clear the flag after ->poll() ends. Otherwise,
however, napi_complete_done() can lead to reenabling interrupts and
scheduling the NAPI already on another CPU. In that case, clear the
flag in napi_complete_done() itself.
The main usecase for the flag is as follows:

if (test_bit(NAPI_STATE_RUNNING, &n->state) &&
READ_ONCE(n->list_owner) == smp_processor_id())
/* you're inside n->poll() or the following GRO
* processing context
*/

IOW, when the condition is true, feel free to use resources protected
by this NAPI, such as page_pools covered by it, percpu NAPI caches etc.
Just make sure interrupts are enabled.

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/netdevice.h | 2 ++
net/core/dev.c | 23 +++++++++++++++++------
2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b828c7a75be2..db3aea863ea9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -389,6 +389,7 @@ enum {
NAPI_STATE_PREFER_BUSY_POLL, /* prefer busy-polling over softirq processing*/
NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
+ NAPI_STATE_RUNNING, /* This NAPI or GRO is running on ::list_owner */
};

enum {
@@ -402,6 +403,7 @@ enum {
NAPIF_STATE_PREFER_BUSY_POLL = BIT(NAPI_STATE_PREFER_BUSY_POLL),
NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
NAPIF_STATE_SCHED_THREADED = BIT(NAPI_STATE_SCHED_THREADED),
+ NAPIF_STATE_RUNNING = BIT(NAPI_STATE_RUNNING),
};

enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 69a3e544676c..7f0d23c9e25e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6039,7 +6039,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)

new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
NAPIF_STATE_SCHED_THREADED |
- NAPIF_STATE_PREFER_BUSY_POLL);
+ NAPIF_STATE_PREFER_BUSY_POLL |
+ NAPIF_STATE_RUNNING);

/* If STATE_MISSED was set, leave STATE_SCHED set,
* because we will call napi->poll() one more time.
@@ -6128,6 +6129,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool
/* All we really want here is to re-enable device interrupts.
* Ideally, a new ndo_busy_poll_stop() could avoid another round.
*/
+ set_bit(NAPI_STATE_RUNNING, &napi->state);
rc = napi->poll(napi, budget);
/* We can't gro_normal_list() here, because napi->poll() might have
* rearmed the napi (napi_complete_done()) in which case it could
@@ -6135,8 +6137,10 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool
*/
trace_napi_poll(napi, rc, budget);
netpoll_poll_unlock(have_poll_lock);
- if (rc == budget)
+ if (rc == budget) {
__busy_poll_stop(napi, skip_schedule);
+ clear_bit(NAPI_STATE_RUNNING, &napi->state);
+ }
local_bh_enable();
}

@@ -6186,9 +6190,11 @@ void napi_busy_loop(unsigned int napi_id,
have_poll_lock = netpoll_poll_lock(napi);
napi_poll = napi->poll;
}
+ set_bit(NAPI_STATE_RUNNING, &napi->state);
work = napi_poll(napi, budget);
trace_napi_poll(napi, work, budget);
gro_normal_list(napi);
+ clear_bit(NAPI_STATE_RUNNING, &napi->state);
count:
if (work > 0)
__NET_ADD_STATS(dev_net(napi->dev),
@@ -6457,6 +6463,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
*/
work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
+ set_bit(NAPI_STATE_RUNNING, &n->state);
work = n->poll(n, weight);
trace_napi_poll(n, work, weight);
}
@@ -6466,7 +6473,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
n->poll, work, weight);

if (likely(work < weight))
- return work;
+ goto out;

/* Drivers must not modify the NAPI state if they
* consume the entire weight. In such cases this code
@@ -6475,7 +6482,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
*/
if (unlikely(napi_disable_pending(n))) {
napi_complete(n);
- return work;
+ goto out;
}

/* The NAPI context has more processing work, but busy-polling
@@ -6488,7 +6495,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
*/
napi_schedule(n);
}
- return work;
+ goto out;
}

if (n->gro_bitmask) {
@@ -6506,11 +6513,15 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
if (unlikely(!list_empty(&n->poll_list))) {
pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
n->dev ? n->dev->name : "backlog");
- return work;
+ goto out;
}

*repoll = true;

+out:
+ if (READ_ONCE(n->list_owner) == smp_processor_id())
+ clear_bit(NAPI_STATE_RUNNING, &n->state);
+
return work;
}

--
2.41.0


2023-06-29 15:57:54

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH RFC net-next 1/4] net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>

Currently, touching <net/page_pool.h> triggers a rebuild of more than
a half of the kernel. That's because it's included in <linux/skbuff.h>.
And each new include to page_pool.h adds more [useless] data for the
toolchain to process per each source file from that pile.

In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple functions
defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info
as long as page pool owns the page") one of the calls was removed, so
only one left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having include of very niche page_pool.h only for that
looks like an overkill.
As Alex noticed, the only thing that holds this function in page_pool.c
is %PP_SIGNATURE. By moving the check for magic a couple functions up,
the whole page_pool_return_skb_page() can be moved to skbuff.c.
The arguments for moving the check are the following:

1) It checks for a corner case that shouldn't ever happen when the code
is sane. And execution speed doesn't matter on exception path, thus
doing more calls before bailing out doesn't make any weather.
2) There are 2 users of the internal __page_pool_put_page(), where this
check is moved: page_pool_put_defragged_page() and
page_pool_put_page_bulk(). Both are exported and can be called from
the drivers, but previously only the skb recycling path of the former
was protected with the magic check. So this also makes the code a bit
more reliable.

After the check is moved, teleport page_pool_return_skb_page() to
skbuff.c, just next to the main consumer, skb_pp_recycle(). It's used
also in napi_frag_unref() -> {__,}skb_frag_unref(), so no `static`
unfortunately. Maybe next time.
Now, after a few include fixes in the drivers, touching page_pool.h
only triggers rebuilding of the drivers using it and a couple core
networking files.

Suggested-by: Jakub Kicinski <[email protected]> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <[email protected]> # move to skbuff.c
Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
include/linux/skbuff.h | 3 +-
include/net/page_pool.h | 2 -
net/core/page_pool.c | 52 +++++--------------
net/core/skbuff.c | 28 ++++++++++
11 files changed, 50 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index 84751bb303a6..6222aaa5157f 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -28,6 +28,7 @@
#include <linux/iopoll.h>
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
+#include <net/page_pool.h>
#include <net/xdp_sock_drv.h>

#define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 8fbe47703d47..99b3b4f79603 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -38,6 +38,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <net/ip.h>
+#include <net/page_pool.h>
#include <net/selftests.h>
#include <net/tso.h>
#include <linux/tcp.h>
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index a5d03583bf79..d17a0ebc9036 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -7,6 +7,7 @@

#include <linux/interrupt.h>
#include <linux/pci.h>
+#include <net/page_pool.h>
#include <net/tso.h>
#include <linux/bitfield.h>

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
index fe8ea4e531b7..7eca434a0550 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
@@ -16,6 +16,7 @@
#include <linux/bpf.h>
#include <linux/bpf_trace.h>
#include <linux/bitfield.h>
+#include <net/page_pool.h>

#include "otx2_reg.h"
#include "otx2_common.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 5ce28ff7685f..0f152f14165b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -6,6 +6,7 @@
#include "en/port.h"
#include "en_accel/en_accel.h"
#include "en_accel/ipsec.h"
+#include <net/page_pool.h>
#include <net/xdp_sock_drv.h>

static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index f0e6095809fa..1bd91bc09eb8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -35,6 +35,7 @@
#include "en/xdp.h"
#include "en/params.h"
#include <linux/bitfield.h>
+#include <net/page_pool.h>

int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
{
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 6b07b8fafec2..95c16f11d156 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -15,6 +15,7 @@
#include <linux/average.h>
#include <linux/soc/mediatek/mtk_wed.h>
#include <net/mac80211.h>
+#include <net/page_pool.h>
#include "util.h"
#include "testmode.h"

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..f76d172ed262 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,6 @@
#include <linux/if_packet.h>
#include <linux/llist.h>
#include <net/flow.h>
-#include <net/page_pool.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
#include <linux/netfilter/nf_conntrack_common.h>
#endif
@@ -3423,6 +3422,8 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
}

+bool page_pool_return_skb_page(struct page *page, bool napi_safe);
+
static inline void
napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
{
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 2b7db9992fc0..829dc1f8ba6b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -307,8 +307,6 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
return pool->p.dma_dir;
}

-bool page_pool_return_skb_page(struct page *page, bool napi_safe);
-
struct page_pool *page_pool_create(const struct page_pool_params *params);

struct xdp_mem_info;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 985ccaffc06a..dff0b4fa2316 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -582,6 +582,19 @@ static __always_inline struct page *
__page_pool_put_page(struct page_pool *pool, struct page *page,
unsigned int dma_sync_size, bool allow_direct)
{
+ /* Avoid recycling non-PP pages, give them back to the page allocator.
+ * page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+ * in order to preserve any existing bits, such as bit 0 for the
+ * head page of compound page and bit 1 for pfmemalloc page, so
+ * mask those bits for freeing side when doing below checking,
+ * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+ * to avoid recycling the pfmemalloc page.
+ */
+ if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) {
+ put_page(page);
+ return NULL;
+ }
+
/* This allocator is optimized for the XDP mode that uses
* one-frame-per-page, but have fallbacks that act like the
* regular page allocator APIs.
@@ -913,42 +926,3 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
-
-bool page_pool_return_skb_page(struct page *page, bool napi_safe)
-{
- struct napi_struct *napi;
- struct page_pool *pp;
- bool allow_direct;
-
- page = compound_head(page);
-
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- * in order to preserve any existing bits, such as bit 0 for the
- * head page of compound page and bit 1 for pfmemalloc page, so
- * mask those bits for freeing side when doing below checking,
- * and page_is_pfmemalloc() is checked in __page_pool_put_page()
- * to avoid recycling the pfmemalloc page.
- */
- if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
- return false;
-
- pp = page->pp;
-
- /* Allow direct recycle if we have reasons to believe that we are
- * in the same context as the consumer would run, so there's
- * no possible race.
- */
- napi = READ_ONCE(pp->p.napi);
- allow_direct = napi_safe && napi &&
- READ_ONCE(napi->list_owner) == smp_processor_id();
-
- /* Driver set this to memory recycling info. Reset it on recycle.
- * This will *not* work for NIC using a split-page memory model.
- * The page will be returned to the pool here regardless of the
- * 'flipped' fragment being in use or not.
- */
- page_pool_put_full_page(pp, page, allow_direct);
-
- return true;
-}
-EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7edabf17988a..4b7d00d5b5d7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -879,6 +879,34 @@ static void skb_clone_fraglist(struct sk_buff *skb)
skb_get(list);
}

+bool page_pool_return_skb_page(struct page *page, bool napi_safe)
+{
+ struct napi_struct *napi;
+ struct page_pool *pp;
+ bool allow_direct;
+
+ page = compound_head(page);
+ pp = page->pp;
+
+ /* Allow direct recycle if we have reasons to believe that we are
+ * in the same context as the consumer would run, so there's
+ * no possible race.
+ */
+ napi = READ_ONCE(pp->p.napi);
+ allow_direct = napi_safe && napi &&
+ READ_ONCE(napi->list_owner) == smp_processor_id();
+
+ /* Driver set this to memory recycling info. Reset it on recycle.
+ * This will *not* work for NIC using a split-page memory model.
+ * The page will be returned to the pool here regardless of the
+ * 'flipped' fragment being in use or not.
+ */
+ page_pool_put_full_page(pp, page, allow_direct);
+
+ return true;
+}
+EXPORT_SYMBOL(page_pool_return_skb_page);
+
static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
{
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
--
2.41.0


2023-06-29 17:31:41

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 1/4] net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>

On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
> Currently, touching <net/page_pool.h> triggers a rebuild of more than
> a half of the kernel. That's because it's included in <linux/skbuff.h>.
> And each new include to page_pool.h adds more [useless] data for the
> toolchain to process per each source file from that pile.
>
> In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB
> recycling"), Matteo included it to be able to call a couple functions
> defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info
> as long as page pool owns the page") one of the calls was removed, so
> only one left. It's the call to page_pool_return_skb_page() in
> napi_frag_unref(). The function is external and doesn't have any
> dependencies. Having include of very niche page_pool.h only for that
> looks like an overkill.
> As Alex noticed, the only thing that holds this function in page_pool.c
> is %PP_SIGNATURE. By moving the check for magic a couple functions up,
> the whole page_pool_return_skb_page() can be moved to skbuff.c.
> The arguments for moving the check are the following:
>
> 1) It checks for a corner case that shouldn't ever happen when the code
> is sane. And execution speed doesn't matter on exception path, thus
> doing more calls before bailing out doesn't make any weather.
> 2) There are 2 users of the internal __page_pool_put_page(), where this
> check is moved: page_pool_put_defragged_page() and
> page_pool_put_page_bulk(). Both are exported and can be called from
> the drivers, but previously only the skb recycling path of the former
> was protected with the magic check. So this also makes the code a bit
> more reliable.
>
> After the check is moved, teleport page_pool_return_skb_page() to
> skbuff.c, just next to the main consumer, skb_pp_recycle(). It's used
> also in napi_frag_unref() -> {__,}skb_frag_unref(), so no `static`
> unfortunately. Maybe next time.
> Now, after a few include fixes in the drivers, touching page_pool.h
> only triggers rebuilding of the drivers using it and a couple core
> networking files.
>
> Suggested-by: Jakub Kicinski <[email protected]> # make skbuff.h less heavy
> Suggested-by: Alexander Duyck <[email protected]> # move to skbuff.c
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> drivers/net/ethernet/engleder/tsnep_main.c | 1 +
> drivers/net/ethernet/freescale/fec_main.c | 1 +
> .../marvell/octeontx2/nic/otx2_common.c | 1 +
> .../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
> .../ethernet/mellanox/mlx5/core/en/params.c | 1 +
> .../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
> drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
> include/linux/skbuff.h | 3 +-
> include/net/page_pool.h | 2 -
> net/core/page_pool.c | 52 +++++--------------
> net/core/skbuff.c | 28 ++++++++++
> 11 files changed, 50 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
> index 84751bb303a6..6222aaa5157f 100644
> --- a/drivers/net/ethernet/engleder/tsnep_main.c
> +++ b/drivers/net/ethernet/engleder/tsnep_main.c
> @@ -28,6 +28,7 @@
> #include <linux/iopoll.h>
> #include <linux/bpf.h>
> #include <linux/bpf_trace.h>
> +#include <net/page_pool.h>
> #include <net/xdp_sock_drv.h>
>
> #define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN)
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 8fbe47703d47..99b3b4f79603 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -38,6 +38,7 @@
> #include <linux/in.h>
> #include <linux/ip.h>
> #include <net/ip.h>
> +#include <net/page_pool.h>
> #include <net/selftests.h>
> #include <net/tso.h>
> #include <linux/tcp.h>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> index a5d03583bf79..d17a0ebc9036 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> @@ -7,6 +7,7 @@
>
> #include <linux/interrupt.h>
> #include <linux/pci.h>
> +#include <net/page_pool.h>
> #include <net/tso.h>
> #include <linux/bitfield.h>
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> index fe8ea4e531b7..7eca434a0550 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> @@ -16,6 +16,7 @@
> #include <linux/bpf.h>
> #include <linux/bpf_trace.h>
> #include <linux/bitfield.h>
> +#include <net/page_pool.h>
>
> #include "otx2_reg.h"
> #include "otx2_common.h"
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> index 5ce28ff7685f..0f152f14165b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> @@ -6,6 +6,7 @@
> #include "en/port.h"
> #include "en_accel/en_accel.h"
> #include "en_accel/ipsec.h"
> +#include <net/page_pool.h>
> #include <net/xdp_sock_drv.h>
>
> static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev)
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index f0e6095809fa..1bd91bc09eb8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -35,6 +35,7 @@
> #include "en/xdp.h"
> #include "en/params.h"
> #include <linux/bitfield.h>
> +#include <net/page_pool.h>
>
> int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
> {
> diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
> index 6b07b8fafec2..95c16f11d156 100644
> --- a/drivers/net/wireless/mediatek/mt76/mt76.h
> +++ b/drivers/net/wireless/mediatek/mt76/mt76.h
> @@ -15,6 +15,7 @@
> #include <linux/average.h>
> #include <linux/soc/mediatek/mtk_wed.h>
> #include <net/mac80211.h>
> +#include <net/page_pool.h>
> #include "util.h"
> #include "testmode.h"
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 91ed66952580..f76d172ed262 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -32,7 +32,6 @@
> #include <linux/if_packet.h>
> #include <linux/llist.h>
> #include <net/flow.h>
> -#include <net/page_pool.h>
> #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> #include <linux/netfilter/nf_conntrack_common.h>
> #endif
> @@ -3423,6 +3422,8 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
> __skb_frag_ref(&skb_shinfo(skb)->frags[f]);
> }
>
> +bool page_pool_return_skb_page(struct page *page, bool napi_safe);
> +
> static inline void
> napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
> {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 2b7db9992fc0..829dc1f8ba6b 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -307,8 +307,6 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
> return pool->p.dma_dir;
> }
>
> -bool page_pool_return_skb_page(struct page *page, bool napi_safe);
> -
> struct page_pool *page_pool_create(const struct page_pool_params *params);
>
> struct xdp_mem_info;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 985ccaffc06a..dff0b4fa2316 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -582,6 +582,19 @@ static __always_inline struct page *
> __page_pool_put_page(struct page_pool *pool, struct page *page,
> unsigned int dma_sync_size, bool allow_direct)
> {
> + /* Avoid recycling non-PP pages, give them back to the page allocator.
> + * page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> + * in order to preserve any existing bits, such as bit 0 for the
> + * head page of compound page and bit 1 for pfmemalloc page, so
> + * mask those bits for freeing side when doing below checking,
> + * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> + * to avoid recycling the pfmemalloc page.
> + */
> + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) {
> + put_page(page);
> + return NULL;
> + }
> +

Rather than moving this block of code down into here. You may just want
to look at creating an inline function that would act as a accessor for
retrieving the page pool for pages with the signature, and for those
without just returning NULL.

> /* This allocator is optimized for the XDP mode that uses
> * one-frame-per-page, but have fallbacks that act like the
> * regular page allocator APIs.
> @@ -913,42 +926,3 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> }
> }
> EXPORT_SYMBOL(page_pool_update_nid);
> -
> -bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> -{
> - struct napi_struct *napi;
> - struct page_pool *pp;
> - bool allow_direct;
> -
> - page = compound_head(page);
> -
> - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> - * in order to preserve any existing bits, such as bit 0 for the
> - * head page of compound page and bit 1 for pfmemalloc page, so
> - * mask those bits for freeing side when doing below checking,
> - * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> - * to avoid recycling the pfmemalloc page.
> - */
> - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> - return false;
> -
> - pp = page->pp;
> -
> - /* Allow direct recycle if we have reasons to believe that we are
> - * in the same context as the consumer would run, so there's
> - * no possible race.
> - */
> - napi = READ_ONCE(pp->p.napi);
> - allow_direct = napi_safe && napi &&
> - READ_ONCE(napi->list_owner) == smp_processor_id();
> -
> - /* Driver set this to memory recycling info. Reset it on recycle.
> - * This will *not* work for NIC using a split-page memory model.
> - * The page will be returned to the pool here regardless of the
> - * 'flipped' fragment being in use or not.
> - */
> - page_pool_put_full_page(pp, page, allow_direct);
> -
> - return true;
> -}
> -EXPORT_SYMBOL(page_pool_return_skb_page);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7edabf17988a..4b7d00d5b5d7 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -879,6 +879,34 @@ static void skb_clone_fraglist(struct sk_buff *skb)
> skb_get(list);
> }
>
> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> +{
> + struct napi_struct *napi;
> + struct page_pool *pp;
> + bool allow_direct;
> +
> + page = compound_head(page);
> + pp = page->pp;

So this is just assuming that any page we pass thru is a page pool
page. The problem is there may be some other pointer stored here that
could cause issues.

I would suggest creating an accessor as mentioned above to verify it is
a page pool page before you access page->pp.

> +
> + /* Allow direct recycle if we have reasons to believe that we are
> + * in the same context as the consumer would run, so there's
> + * no possible race.
> + */
> + napi = READ_ONCE(pp->p.napi);
> + allow_direct = napi_safe && napi &&
> + READ_ONCE(napi->list_owner) == smp_processor_id();
> +
> + /* Driver set this to memory recycling info. Reset it on recycle.
> + * This will *not* work for NIC using a split-page memory model.
> + * The page will be returned to the pool here regardless of the
> + * 'flipped' fragment being in use or not.
> + */
> + page_pool_put_full_page(pp, page, allow_direct);
> +
> + return true;
> +}
> +EXPORT_SYMBOL(page_pool_return_skb_page);
> +
> static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
> {
> if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)


2023-06-30 13:10:20

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 1/4] net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>

From: Alexander H Duyck <[email protected]>
Date: Thu, 29 Jun 2023 09:55:15 -0700

> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
>> Currently, touching <net/page_pool.h> triggers a rebuild of more than
>> a half of the kernel. That's because it's included in <linux/skbuff.h>.
>> And each new include to page_pool.h adds more [useless] data for the
>> toolchain to process per each source file from that pile.

[...]

>> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>> +{
>> + struct napi_struct *napi;
>> + struct page_pool *pp;
>> + bool allow_direct;
>> +
>> + page = compound_head(page);
>> + pp = page->pp;
>
> So this is just assuming that any page we pass thru is a page pool
> page. The problem is there may be some other pointer stored here that
> could cause issues.

But that is exactly what you suggested in the previous revision's
thread... Hey! :D

"I suspect we could look at pulling parts of it out as well. The
pp_magic check should always be succeeding unless we have pages getting
routed the wrong way somewhere. So maybe we should look at pulling it
out and moving it to another part of the path such as
__page_pool_put_page() and making it a bit more visible to catch those
cases".

Anyway, since some drivers still mix PP pages with non-PP ones (mt76
IIRC, maybe more), I feel the check for magic is still relevant. I just
believed you and forgot about that T.T

>
> I would suggest creating an accessor as mentioned above to verify it is
> a page pool page before you access page->pp.
>
>> +
>> + /* Allow direct recycle if we have reasons to believe that we are
[...]

Thanks,
Olek

2023-06-30 15:18:44

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 1/4] net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>

On Fri, Jun 30, 2023 at 5:39 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander H Duyck <[email protected]>
> Date: Thu, 29 Jun 2023 09:55:15 -0700
>
> > On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
> >> Currently, touching <net/page_pool.h> triggers a rebuild of more than
> >> a half of the kernel. That's because it's included in <linux/skbuff.h>.
> >> And each new include to page_pool.h adds more [useless] data for the
> >> toolchain to process per each source file from that pile.
>
> [...]
>
> >> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >> +{
> >> + struct napi_struct *napi;
> >> + struct page_pool *pp;
> >> + bool allow_direct;
> >> +
> >> + page = compound_head(page);
> >> + pp = page->pp;
> >
> > So this is just assuming that any page we pass thru is a page pool
> > page. The problem is there may be some other pointer stored here that
> > could cause issues.
>
> But that is exactly what you suggested in the previous revision's
> thread... Hey! :D
>
> "I suspect we could look at pulling parts of it out as well. The
> pp_magic check should always be succeeding unless we have pages getting
> routed the wrong way somewhere. So maybe we should look at pulling it
> out and moving it to another part of the path such as
> __page_pool_put_page() and making it a bit more visible to catch those
> cases".

Yeah, I have had a few off days recently, amazing what happens when
you are running on only a few hours of sleep.. :-/

Anyway, as I was saying it might make sense to wrap the whole thing up
as a page pool accessor that would return NULL if the MAGIC value
isn't present. Alternatively one other possibility would be to look at
creating an inline function here that could check to see if the
skb_frag is page pool and then just keep that in sk_buff.h since it
looks like it should only need to rely on the page struct and
PP_SIGNATURE which is a poison.h value. With that napi_frag_unref
could avoid an unnecessary trip into the page_pool_return_skb_page
function entirely if it isn't a page pool page and we could look at
dropping the return value from page_pool_return_skb_page entirely.

> Anyway, since some drivers still mix PP pages with non-PP ones (mt76
> IIRC, maybe more), I feel the check for magic is still relevant. I just
> believed you and forgot about that T.T

Yeah, sorry about that, my bad. I was a bit too focused on the main
drivers we use and not thinking outside the box enough.

2023-06-30 16:41:15

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 1/4] net: skbuff: don't include <net/page_pool.h> to <linux/skbuff.h>

From: Alexander Duyck <[email protected]>
Date: Fri, 30 Jun 2023 08:11:02 -0700

> On Fri, Jun 30, 2023 at 5:39 AM Alexander Lobakin
> <[email protected]> wrote:
>>
>> From: Alexander H Duyck <[email protected]>
>> Date: Thu, 29 Jun 2023 09:55:15 -0700
>>
>>> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
>>>> Currently, touching <net/page_pool.h> triggers a rebuild of more than
>>>> a half of the kernel. That's because it's included in <linux/skbuff.h>.
>>>> And each new include to page_pool.h adds more [useless] data for the
>>>> toolchain to process per each source file from that pile.
>>
>> [...]
>>
>>>> +bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>> +{
>>>> + struct napi_struct *napi;
>>>> + struct page_pool *pp;
>>>> + bool allow_direct;
>>>> +
>>>> + page = compound_head(page);
>>>> + pp = page->pp;
>>>
>>> So this is just assuming that any page we pass thru is a page pool
>>> page. The problem is there may be some other pointer stored here that
>>> could cause issues.
>>
>> But that is exactly what you suggested in the previous revision's
>> thread... Hey! :D
>>
>> "I suspect we could look at pulling parts of it out as well. The
>> pp_magic check should always be succeeding unless we have pages getting
>> routed the wrong way somewhere. So maybe we should look at pulling it
>> out and moving it to another part of the path such as
>> __page_pool_put_page() and making it a bit more visible to catch those
>> cases".
>
> Yeah, I have had a few off days recently, amazing what happens when
> you are running on only a few hours of sleep.. :-/

Aaah I see :D

>
> Anyway, as I was saying it might make sense to wrap the whole thing up
> as a page pool accessor that would return NULL if the MAGIC value
> isn't present. Alternatively one other possibility would be to look at
> creating an inline function here that could check to see if the
> skb_frag is page pool and then just keep that in sk_buff.h since it
> looks like it should only need to rely on the page struct and
> PP_SIGNATURE which is a poison.h value. With that napi_frag_unref
> could avoid an unnecessary trip into the page_pool_return_skb_page
> function entirely if it isn't a page pool page and we could look at
> dropping the return value from page_pool_return_skb_page entirely.

I like this one. This is rather simple check and shouldn't cause code
size inflating, but I'll recheck with bloat-o-meter just in case.

>
>> Anyway, since some drivers still mix PP pages with non-PP ones (mt76
>> IIRC, maybe more), I feel the check for magic is still relevant. I just
>> believed you and forgot about that T.T
>
> Yeah, sorry about that, my bad. I was a bit too focused on the main
> drivers we use and not thinking outside the box enough.

Not a prob!

Thanks,
Olek

2023-07-02 00:23:38

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 0/4] net: page_pool: a couple assorted optimizations

On Thu, 29 Jun 2023 17:23:01 +0200 Alexander Lobakin wrote:
> #3: new, prereq to #4. Add NAPI state flag, which would indicate
> napi->poll() is running right now, so that napi->list_owner would
> point to the CPU where it's being run, not just scheduled;
> #4: new. In addition to recycling skb PP pages directly when @napi_safe
> is set, check for the flag from #3, which will mean the same if
> ->list_owner is pointing to us. This allows to use direct recycling
> anytime we're inside a NAPI polling loop or GRO stuff going right
> after it, covering way more cases than is right now.

You know NAPI pretty well so I'm worried I'm missing something.
I don't think the new flag adds any value. NAPI does not have to
be running, you can drop patch 3 and use in_softirq() instead of
the new flag, AFAIU.

The reason I did not do that is that I wasn't sure if there is no
weird (netcons?) case where skb gets freed from an IRQ :(

2023-07-03 14:23:29

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 0/4] net: page_pool: a couple assorted optimizations

From: Jakub Kicinski <[email protected]>
Date: Sat, 1 Jul 2023 17:01:55 -0700

> On Thu, 29 Jun 2023 17:23:01 +0200 Alexander Lobakin wrote:
>> #3: new, prereq to #4. Add NAPI state flag, which would indicate
>> napi->poll() is running right now, so that napi->list_owner would
>> point to the CPU where it's being run, not just scheduled;
>> #4: new. In addition to recycling skb PP pages directly when @napi_safe
>> is set, check for the flag from #3, which will mean the same if
>> ->list_owner is pointing to us. This allows to use direct recycling
>> anytime we're inside a NAPI polling loop or GRO stuff going right
>> after it, covering way more cases than is right now.
>
> You know NAPI pretty well so I'm worried I'm missing something.

I wouldn't say I know it well :D

> I don't think the new flag adds any value. NAPI does not have to
> be running, you can drop patch 3 and use in_softirq() instead of
> the new flag, AFAIU.

That's most likely true for the patch 4 case, but I wanted to add some
flag for wider usage.
For example, busy polling relies on whether ->poll() returned whole
budget to decide whether interrupts were reenabled to avoid possible
concurrent access, but I wouldn't say it's precise enough.
napi_complete_done() doesn't always return true.
OTOH, the new flag or, more precisely, flag + list_owner combo would
tell for sure.

>
> The reason I did not do that is that I wasn't sure if there is no
> weird (netcons?) case where skb gets freed from an IRQ :(

Shouldn't they use dev_kfree_skb_any() or _irq()? Usage of plain
kfree_skb() is not allowed in the TH :s

Anyway, if the flag really makes no sense, I can replace it with
in_softirq(), it's my hobby to break weird drivers :D

Thanks,
Olek

2023-07-03 19:09:06

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 0/4] net: page_pool: a couple assorted optimizations

On Mon, 3 Jul 2023 15:50:55 +0200 Alexander Lobakin wrote:
> > The reason I did not do that is that I wasn't sure if there is no
> > weird (netcons?) case where skb gets freed from an IRQ :(
>
> Shouldn't they use dev_kfree_skb_any() or _irq()? Usage of plain
> kfree_skb() is not allowed in the TH :s

I haven't looked at the code so I could be lying but I thought that
the only thing that can't run in hard IRQ context is the destructor,
so if the caller knows there's no destructor they can free the skb.

I'd ask you the inverse question. If the main use case is skb xdp
(which eh, uh, okay..) then why not make it use napi_consume_skb()?
I don't think skb XDP can run in hard IRQ context, can it?

> Anyway, if the flag really makes no sense, I can replace it with
> in_softirq(), it's my hobby to break weird drivers :D

2023-07-05 13:08:21

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 0/4] net: page_pool: a couple assorted optimizations

From: Jakub Kicinski <[email protected]>
Date: Mon, 3 Jul 2023 11:57:34 -0700

> On Mon, 3 Jul 2023 15:50:55 +0200 Alexander Lobakin wrote:
>>> The reason I did not do that is that I wasn't sure if there is no
>>> weird (netcons?) case where skb gets freed from an IRQ :(
>>
>> Shouldn't they use dev_kfree_skb_any() or _irq()? Usage of plain
>> kfree_skb() is not allowed in the TH :s
>
> I haven't looked at the code so I could be lying but I thought that
> the only thing that can't run in hard IRQ context is the destructor,
> so if the caller knows there's no destructor they can free the skb.
>
> I'd ask you the inverse question. If the main use case is skb xdp
> (which eh, uh, okay..) then why not make it use napi_consume_skb()?

Remember about Wi-Fi, DSA, and other poor citizens with no native XDP! :D
That was mostly a joke, but I thought of this, too. At the end my
thought was "let's try making it cover more usecases" and I found this
approach. I'm not saying it's optimal or even much needed, that's why I
sent it to discuss basically.

(e.g. I wanted to try speed up xdp_return_frame{,_bulk}() using it)

> I don't think skb XDP can run in hard IRQ context, can it?

skb XDP can't happen in the TH and I think we could assume it's safe to
use napi_consume_skb() there (with a fake non-zero budget :p).

>
>> Anyway, if the flag really makes no sense, I can replace it with
>> in_softirq(), it's my hobby to break weird drivers :D

Thanks,
Olek