Here's a two-shot: introduce Intel Ethernet common library (libie) and
switch iavf to Page Pool. Details in the commit messages; here's
summary:
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers.
The first thing that came to my mind was "libie" -- "Intel Ethernet
common library". Also this sounds like "lovelie" and can be expanded as
"lib Internet Explorer" :P I'm open for anything else (but justified).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks: "can
I share this?".
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so a driver can't use
much of the lib until it's converted. iavf is only the example, the rest
will eventually be converted soon on a per-driver basis. That is when it
gets really interesting. Stay tech.
Alexander Lobakin (12):
net: intel: introduce Intel Ethernet common library
iavf: kill "legacy-rx" for good
iavf: optimize Rx buffer allocation a bunch
iavf: remove page splitting/recycling
iavf: always use a full order-0 page
net: skbuff: don't include <net/page_pool.h> into <linux/skbuff.h>
net: page_pool: avoid calling no-op externals when possible
net: page_pool: add DMA-sync-for-CPU inline helpers
iavf: switch to Page Pool
libie: add common queue stats
libie: add per-queue Page Pool stats
iavf: switch queue stats to libie
MAINTAINERS | 3 +-
drivers/net/ethernet/engleder/tsnep_main.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
drivers/net/ethernet/intel/Kconfig | 12 +-
drivers/net/ethernet/intel/Makefile | 1 +
drivers/net/ethernet/intel/i40e/i40e_common.c | 253 -------
drivers/net/ethernet/intel/i40e/i40e_main.c | 1 +
.../net/ethernet/intel/i40e/i40e_prototype.h | 7 -
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 74 +-
drivers/net/ethernet/intel/i40e/i40e_type.h | 88 ---
drivers/net/ethernet/intel/iavf/iavf.h | 2 +-
drivers/net/ethernet/intel/iavf/iavf_common.c | 253 -------
.../net/ethernet/intel/iavf/iavf_ethtool.c | 227 +-----
drivers/net/ethernet/intel/iavf/iavf_main.c | 45 +-
.../net/ethernet/intel/iavf/iavf_prototype.h | 7 -
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 715 +++++-------------
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 185 +----
drivers/net/ethernet/intel/iavf/iavf_type.h | 90 ---
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 16 +-
.../net/ethernet/intel/ice/ice_lan_tx_rx.h | 316 --------
drivers/net/ethernet/intel/ice/ice_main.c | 1 +
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 74 +-
drivers/net/ethernet/intel/libie/Makefile | 7 +
drivers/net/ethernet/intel/libie/internal.h | 23 +
drivers/net/ethernet/intel/libie/rx.c | 158 ++++
drivers/net/ethernet/intel/libie/stats.c | 190 +++++
.../marvell/octeontx2/nic/otx2_common.c | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_pf.c | 1 +
.../ethernet/mellanox/mlx5/core/en/params.c | 1 +
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
drivers/net/wireless/mediatek/mt76/mt76.h | 1 +
include/linux/net/intel/libie/rx.h | 170 +++++
include/linux/net/intel/libie/stats.h | 214 ++++++
include/linux/skbuff.h | 4 +-
include/net/page_pool.h | 69 +-
net/core/page_pool.c | 10 +
36 files changed, 1141 insertions(+), 2081 deletions(-)
create mode 100644 drivers/net/ethernet/intel/libie/Makefile
create mode 100644 drivers/net/ethernet/intel/libie/internal.h
create mode 100644 drivers/net/ethernet/intel/libie/rx.c
create mode 100644 drivers/net/ethernet/intel/libie/stats.c
create mode 100644 include/linux/net/intel/libie/rx.h
create mode 100644 include/linux/net/intel/libie/stats.h
---
Directly to net-next, has non-Intel code changes (0006-0008).
From v2[0]:
* 0006: fix page_pool.h include in OcteonTX2 files (Jakub, Patchwork);
* no functional changes.
From v1[1]:
* 0006: new (me, Jakub);
* 0008: give the helpers more intuitive names (Jakub, Ilias);
* -^-: also expand their kdoc a bit for the same reason;
* -^-: fix kdoc copy-paste issue (Patchwork, Jakub);
* 0011: drop `inline` from C file (Patchwork, Jakub).
[0] https://lore.kernel.org/netdev/[email protected]
[1] https://lore.kernel.org/netdev/[email protected]
--
2.40.1
Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's
page_pool_put_page()
page_pool_put_defragged_page() <- external
__page_pool_put_page()
page_pool_dma_sync_for_device() <- non-inline
dma_sync_single_range_for_device()
dma_sync_single_for_device() <- external
dma_direct_sync_single_for_device()
dev_is_dma_coherent() <- exit
For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define new PP flag, which will mean "do DMA syncs for device, but only
when needed" and turn it on by default when the driver asks to sync
pages. When a page is mapped, check whether it needs syncs and if so,
replace that "sync when needed" back to "always do syncs" globally for
the whole pool (better safe than sorry). As long as a pool has no pages
requiring DMA syncs, this cuts off a good piece of calls and checks.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 3 +++
net/core/page_pool.c | 10 ++++++++++
2 files changed, 13 insertions(+)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 2a9ce2aa6eb2..ee895376270e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -46,6 +46,9 @@
* device driver responsibility
*/
#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */
+#define PP_FLAG_DMA_MAYBE_SYNC BIT(3) /* Internal, should not be used in
+ * drivers
+ */
#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\
PP_FLAG_DMA_SYNC_DEV |\
PP_FLAG_PAGE_FRAG)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a3e12a61d456..102b5e3718c2 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -198,6 +198,10 @@ static int page_pool_init(struct page_pool *pool,
/* pool->p.offset has to be set according to the address
* offset used by the DMA engine to start copying rx data
*/
+
+ /* Try to avoid calling no-op syncs */
+ pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
+ pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
}
if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
@@ -346,6 +350,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
page_pool_set_dma_addr(page, dma);
+ if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
+ dma_need_sync(pool->p.dev, dma)) {
+ pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
+ pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
+ }
+
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
--
2.40.1
Ever since build_skb() became stable, the old way with allocating an skb
for storing the headers separately, which will be then copied manually,
was slower, less flexible and thus obsolete.
* it had higher pressure on MM since it actually allocates new pages,
which then get split and refcount-biased (NAPI page cache);
* it implies memcpy() of packet headers (40+ bytes per each frame);
* the actual header length was calculated via eth_get_headlen(), which
invokes Flow Dissector and thus wastes a bunch of CPU cycles;
* XDP makes it even more weird since it requires headroom for long and
also tailroom for some time (since mbuf landed). Take a look at the
ice driver, which is built around work-arounds to make XDP work with
it.
Even on some quite low-end hardware (not a common case for 100G NICs) it
was performing worse.
The only advantage "legacy-rx" had is that it didn't require any
reserved headroom and tailroom. But iavf didn't use this, as it always
splits pages into two halves of 2k, while that save would only be useful
when striding. And again, XDP effectively removes that sole pro.
There's a train of features to land in IAVF soon: Page Pool, XDP, XSk,
multi-buffer etc. Each new would require adding more and more Danse
Macabre for absolutely no reason, besides making hotpath less and less
effective.
Remove the "feature" with all the related code. This includes at least
one very hot branch (typically hit on each new frame), which was either
always-true or always-false at least for a complete NAPI bulk of 64
frames, the whole private flags cruft and so on. Some stats:
Function: add/remove: 0/2 grow/shrink: 0/7 up/down: 0/-774 (-774)
RO Data: add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-40 (-40)
Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/iavf/iavf.h | 2 +-
.../net/ethernet/intel/iavf/iavf_ethtool.c | 140 ------------------
drivers/net/ethernet/intel/iavf/iavf_main.c | 10 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 84 +----------
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 18 +--
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 3 +-
6 files changed, 8 insertions(+), 249 deletions(-)
diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index 9abaff1f2aff..a780e7aa1c2f 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -298,7 +298,7 @@ struct iavf_adapter {
#define IAVF_FLAG_CLIENT_NEEDS_L2_PARAMS BIT(12)
#define IAVF_FLAG_PROMISC_ON BIT(13)
#define IAVF_FLAG_ALLMULTI_ON BIT(14)
-#define IAVF_FLAG_LEGACY_RX BIT(15)
+/* BIT(15) is free, was IAVF_FLAG_LEGACY_RX */
#define IAVF_FLAG_REINIT_ITR_NEEDED BIT(16)
#define IAVF_FLAG_QUEUES_DISABLED BIT(17)
#define IAVF_FLAG_SETUP_NETDEV_FEATURES BIT(18)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index 6f171d1d85b7..de3050c02b6f 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -239,29 +239,6 @@ static const struct iavf_stats iavf_gstrings_stats[] = {
#define IAVF_QUEUE_STATS_LEN ARRAY_SIZE(iavf_gstrings_queue_stats)
-/* For now we have one and only one private flag and it is only defined
- * when we have support for the SKIP_CPU_SYNC DMA attribute. Instead
- * of leaving all this code sitting around empty we will strip it unless
- * our one private flag is actually available.
- */
-struct iavf_priv_flags {
- char flag_string[ETH_GSTRING_LEN];
- u32 flag;
- bool read_only;
-};
-
-#define IAVF_PRIV_FLAG(_name, _flag, _read_only) { \
- .flag_string = _name, \
- .flag = _flag, \
- .read_only = _read_only, \
-}
-
-static const struct iavf_priv_flags iavf_gstrings_priv_flags[] = {
- IAVF_PRIV_FLAG("legacy-rx", IAVF_FLAG_LEGACY_RX, 0),
-};
-
-#define IAVF_PRIV_FLAGS_STR_LEN ARRAY_SIZE(iavf_gstrings_priv_flags)
-
/**
* iavf_get_link_ksettings - Get Link Speed and Duplex settings
* @netdev: network interface device structure
@@ -341,8 +318,6 @@ static int iavf_get_sset_count(struct net_device *netdev, int sset)
return IAVF_STATS_LEN +
(IAVF_QUEUE_STATS_LEN * 2 *
netdev->real_num_tx_queues);
- else if (sset == ETH_SS_PRIV_FLAGS)
- return IAVF_PRIV_FLAGS_STR_LEN;
else
return -EINVAL;
}
@@ -384,24 +359,6 @@ static void iavf_get_ethtool_stats(struct net_device *netdev,
rcu_read_unlock();
}
-/**
- * iavf_get_priv_flag_strings - Get private flag strings
- * @netdev: network interface device structure
- * @data: buffer for string data
- *
- * Builds the private flags string table
- **/
-static void iavf_get_priv_flag_strings(struct net_device *netdev, u8 *data)
-{
- unsigned int i;
-
- for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
- snprintf(data, ETH_GSTRING_LEN, "%s",
- iavf_gstrings_priv_flags[i].flag_string);
- data += ETH_GSTRING_LEN;
- }
-}
-
/**
* iavf_get_stat_strings - Get stat strings
* @netdev: network interface device structure
@@ -440,105 +397,11 @@ static void iavf_get_strings(struct net_device *netdev, u32 sset, u8 *data)
case ETH_SS_STATS:
iavf_get_stat_strings(netdev, data);
break;
- case ETH_SS_PRIV_FLAGS:
- iavf_get_priv_flag_strings(netdev, data);
- break;
default:
break;
}
}
-/**
- * iavf_get_priv_flags - report device private flags
- * @netdev: network interface device structure
- *
- * The get string set count and the string set should be matched for each
- * flag returned. Add new strings for each flag to the iavf_gstrings_priv_flags
- * array.
- *
- * Returns a u32 bitmap of flags.
- **/
-static u32 iavf_get_priv_flags(struct net_device *netdev)
-{
- struct iavf_adapter *adapter = netdev_priv(netdev);
- u32 i, ret_flags = 0;
-
- for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
- const struct iavf_priv_flags *priv_flags;
-
- priv_flags = &iavf_gstrings_priv_flags[i];
-
- if (priv_flags->flag & adapter->flags)
- ret_flags |= BIT(i);
- }
-
- return ret_flags;
-}
-
-/**
- * iavf_set_priv_flags - set private flags
- * @netdev: network interface device structure
- * @flags: bit flags to be set
- **/
-static int iavf_set_priv_flags(struct net_device *netdev, u32 flags)
-{
- struct iavf_adapter *adapter = netdev_priv(netdev);
- u32 orig_flags, new_flags, changed_flags;
- u32 i;
-
- orig_flags = READ_ONCE(adapter->flags);
- new_flags = orig_flags;
-
- for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
- const struct iavf_priv_flags *priv_flags;
-
- priv_flags = &iavf_gstrings_priv_flags[i];
-
- if (flags & BIT(i))
- new_flags |= priv_flags->flag;
- else
- new_flags &= ~(priv_flags->flag);
-
- if (priv_flags->read_only &&
- ((orig_flags ^ new_flags) & ~BIT(i)))
- return -EOPNOTSUPP;
- }
-
- /* Before we finalize any flag changes, any checks which we need to
- * perform to determine if the new flags will be supported should go
- * here...
- */
-
- /* Compare and exchange the new flags into place. If we failed, that
- * is if cmpxchg returns anything but the old value, this means
- * something else must have modified the flags variable since we
- * copied it. We'll just punt with an error and log something in the
- * message buffer.
- */
- if (cmpxchg(&adapter->flags, orig_flags, new_flags) != orig_flags) {
- dev_warn(&adapter->pdev->dev,
- "Unable to update adapter->flags as it was modified by another thread...\n");
- return -EAGAIN;
- }
-
- changed_flags = orig_flags ^ new_flags;
-
- /* Process any additional changes needed as a result of flag changes.
- * The changed_flags value reflects the list of bits that were changed
- * in the code above.
- */
-
- /* issue a reset to force legacy-rx change to take effect */
- if (changed_flags & IAVF_FLAG_LEGACY_RX) {
- if (netif_running(netdev)) {
- adapter->flags |= IAVF_FLAG_RESET_NEEDED;
- queue_work(adapter->wq, &adapter->reset_task);
- }
- }
-
- return 0;
-}
-
/**
* iavf_get_msglevel - Get debug message level
* @netdev: network interface device structure
@@ -584,7 +447,6 @@ static void iavf_get_drvinfo(struct net_device *netdev,
strscpy(drvinfo->driver, iavf_driver_name, 32);
strscpy(drvinfo->fw_version, "N/A", 4);
strscpy(drvinfo->bus_info, pci_name(adapter->pdev), 32);
- drvinfo->n_priv_flags = IAVF_PRIV_FLAGS_STR_LEN;
}
/**
@@ -1969,8 +1831,6 @@ static const struct ethtool_ops iavf_ethtool_ops = {
.get_strings = iavf_get_strings,
.get_ethtool_stats = iavf_get_ethtool_stats,
.get_sset_count = iavf_get_sset_count,
- .get_priv_flags = iavf_get_priv_flags,
- .set_priv_flags = iavf_set_priv_flags,
.get_msglevel = iavf_get_msglevel,
.set_msglevel = iavf_set_msglevel,
.get_coalesce = iavf_get_coalesce,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index c17e909d3ff0..a5a6c9861a93 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -713,9 +713,7 @@ static void iavf_configure_rx(struct iavf_adapter *adapter)
struct iavf_hw *hw = &adapter->hw;
int i;
- /* Legacy Rx will always default to a 2048 buffer size. */
-#if (PAGE_SIZE < 8192)
- if (!(adapter->flags & IAVF_FLAG_LEGACY_RX)) {
+ if (PAGE_SIZE < 8192) {
struct net_device *netdev = adapter->netdev;
/* For jumbo frames on systems with 4K pages we have to use
@@ -732,16 +730,10 @@ static void iavf_configure_rx(struct iavf_adapter *adapter)
(netdev->mtu <= ETH_DATA_LEN))
rx_buf_len = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
}
-#endif
for (i = 0; i < adapter->num_active_queues; i++) {
adapter->rx_rings[i].tail = hw->hw_addr + IAVF_QRX_TAIL1(i);
adapter->rx_rings[i].rx_buf_len = rx_buf_len;
-
- if (adapter->flags & IAVF_FLAG_LEGACY_RX)
- clear_ring_build_skb_enabled(&adapter->rx_rings[i]);
- else
- set_ring_build_skb_enabled(&adapter->rx_rings[i]);
}
}
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index a83b96e9b6fc..a7121dc5c32b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -824,17 +824,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
writel(val, rx_ring->tail);
}
-/**
- * iavf_rx_offset - Return expected offset into page to access data
- * @rx_ring: Ring we are requesting offset of
- *
- * Returns the offset value for ring into the data buffer.
- */
-static inline unsigned int iavf_rx_offset(struct iavf_ring *rx_ring)
-{
- return ring_uses_build_skb(rx_ring) ? IAVF_SKB_PAD : 0;
-}
-
/**
* iavf_alloc_mapped_page - recycle or make a new page
* @rx_ring: ring to use
@@ -879,7 +868,7 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
bi->dma = dma;
bi->page = page;
- bi->page_offset = iavf_rx_offset(rx_ring);
+ bi->page_offset = IAVF_SKB_PAD;
/* initialize pagecnt_bias to 1 representing we fully own page */
bi->pagecnt_bias = 1;
@@ -1220,7 +1209,7 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
#if (PAGE_SIZE < 8192)
unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
#else
- unsigned int truesize = SKB_DATA_ALIGN(size + iavf_rx_offset(rx_ring));
+ unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
#endif
if (!size)
@@ -1268,71 +1257,6 @@ static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
return rx_buffer;
}
-/**
- * iavf_construct_skb - Allocate skb and populate it
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
- * @size: size of buffer to add to skb
- *
- * This function allocates an skb. It then populates it with the page
- * data from the current receive descriptor, taking care to set up the
- * skb correctly.
- */
-static struct sk_buff *iavf_construct_skb(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer,
- unsigned int size)
-{
- void *va;
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
- unsigned int truesize = SKB_DATA_ALIGN(size);
-#endif
- unsigned int headlen;
- struct sk_buff *skb;
-
- if (!rx_buffer)
- return NULL;
- /* prefetch first cache line of first page */
- va = page_address(rx_buffer->page) + rx_buffer->page_offset;
- net_prefetch(va);
-
- /* allocate a skb to store the frags */
- skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
- IAVF_RX_HDR_SIZE,
- GFP_ATOMIC | __GFP_NOWARN);
- if (unlikely(!skb))
- return NULL;
-
- /* Determine available headroom for copy */
- headlen = size;
- if (headlen > IAVF_RX_HDR_SIZE)
- headlen = eth_get_headlen(skb->dev, va, IAVF_RX_HDR_SIZE);
-
- /* align pull length to size of long to optimize memcpy performance */
- memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
-
- /* update all of the pointers */
- size -= headlen;
- if (size) {
- skb_add_rx_frag(skb, 0, rx_buffer->page,
- rx_buffer->page_offset + headlen,
- size, truesize);
-
- /* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
- } else {
- /* buffer is unused, reset bias back to rx_buffer */
- rx_buffer->pagecnt_bias++;
- }
-
- return skb;
-}
-
/**
* iavf_build_skb - Build skb around an existing buffer
* @rx_ring: Rx descriptor ring to transact packets on
@@ -1505,10 +1429,8 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
/* retrieve a buffer from the ring */
if (skb)
iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
- else if (ring_uses_build_skb(rx_ring))
- skb = iavf_build_skb(rx_ring, rx_buffer, size);
else
- skb = iavf_construct_skb(rx_ring, rx_buffer, size);
+ skb = iavf_build_skb(rx_ring, rx_buffer, size);
/* exit if we failed to retrieve a buffer */
if (!skb) {
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 2624bf6d009e..234e189c1987 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -362,7 +362,8 @@ struct iavf_ring {
u16 flags;
#define IAVF_TXR_FLAGS_WB_ON_ITR BIT(0)
-#define IAVF_RXR_FLAGS_BUILD_SKB_ENABLED BIT(1)
+/* BIT(1) is free, was IAVF_RXR_FLAGS_BUILD_SKB_ENABLED */
+/* BIT(2) is free */
#define IAVF_TXRX_FLAGS_VLAN_TAG_LOC_L2TAG1 BIT(3)
#define IAVF_TXR_FLAGS_VLAN_TAG_LOC_L2TAG2 BIT(4)
#define IAVF_RXR_FLAGS_VLAN_TAG_LOC_L2TAG2_2 BIT(5)
@@ -393,21 +394,6 @@ struct iavf_ring {
*/
} ____cacheline_internodealigned_in_smp;
-static inline bool ring_uses_build_skb(struct iavf_ring *ring)
-{
- return !!(ring->flags & IAVF_RXR_FLAGS_BUILD_SKB_ENABLED);
-}
-
-static inline void set_ring_build_skb_enabled(struct iavf_ring *ring)
-{
- ring->flags |= IAVF_RXR_FLAGS_BUILD_SKB_ENABLED;
-}
-
-static inline void clear_ring_build_skb_enabled(struct iavf_ring *ring)
-{
- ring->flags &= ~IAVF_RXR_FLAGS_BUILD_SKB_ENABLED;
-}
-
#define IAVF_ITR_ADAPTIVE_MIN_INC 0x0002
#define IAVF_ITR_ADAPTIVE_MIN_USECS 0x0002
#define IAVF_ITR_ADAPTIVE_MAX_USECS 0x007e
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index 7c0578b5457b..fdddc3588487 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -290,8 +290,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
return;
/* Limit maximum frame size when jumbo frames is not enabled */
- if (!(adapter->flags & IAVF_FLAG_LEGACY_RX) &&
- (adapter->netdev->mtu <= ETH_DATA_LEN))
+ if (adapter->netdev->mtu <= ETH_DATA_LEN)
max_frame = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
vqci->vsi_id = adapter->vsi_res->vsi_id;
--
2.40.1
Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
no custom recycling logics and one whole page per frame, it can easily
be switched to using Page Pool API instead.
Introduce libie_rx_page_pool_create(), a wrapper for creating a PP with
the default libie settings applicable to all Intel hardware, and replace
the alloc/free calls with the corresponding PP functions, including the
newly added sync-for-CPU helpers. Use skb_mark_for_recycle() to bring
back the recycling and restore the initial performance.
From the important object code changes, worth mentioning that
__iavf_alloc_rx_pages() is now inlined due to the greatly reduced size.
The resulting driver is on par with the pre-series code and 1-2% slower
than the "optimized" version right before the recycling removal.
But the number of locs and object code bytes slaughtered is much more
important here after all, not speaking of that there's still a vast
space for optimization and improvements.
Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/Kconfig | 1 +
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 126 +++++---------------
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 8 +-
drivers/net/ethernet/intel/libie/rx.c | 28 +++++
include/linux/net/intel/libie/rx.h | 5 +-
5 files changed, 69 insertions(+), 99 deletions(-)
diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index cec4a938fbd0..a368afc42b8d 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -86,6 +86,7 @@ config E1000E_HWTS
config LIBIE
tristate
+ select PAGE_POOL
help
libie (Intel Ethernet library) is a common library containing
routines shared by several Intel Ethernet drivers.
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index c33a3d681c83..1de67a70f045 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -3,7 +3,6 @@
#include <linux/net/intel/libie/rx.h>
#include <linux/prefetch.h>
-#include <net/page_pool.h>
#include "iavf.h"
#include "iavf_trace.h"
@@ -691,8 +690,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
**/
void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
{
- u16 i;
-
/* ring already cleared, nothing to do */
if (!rx_ring->rx_pages)
return;
@@ -703,28 +700,17 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
}
/* Free all the Rx ring sk_buffs */
- for (i = 0; i < rx_ring->count; i++) {
+ for (u32 i = 0; i < rx_ring->count; i++) {
struct page *page = rx_ring->rx_pages[i];
- dma_addr_t dma;
if (!page)
continue;
- dma = page_pool_get_dma_addr(page);
-
/* Invalidate cache lines that may have been written to by
* device so that we avoid corrupting memory.
*/
- dma_sync_single_range_for_cpu(rx_ring->dev, dma,
- LIBIE_SKB_HEADROOM,
- LIBIE_RX_BUF_LEN,
- DMA_FROM_DEVICE);
-
- /* free resources associated with mapping */
- dma_unmap_page_attrs(rx_ring->dev, dma, LIBIE_RX_TRUESIZE,
- DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
-
- __free_page(page);
+ page_pool_dma_sync_full_for_cpu(rx_ring->pool, page);
+ page_pool_put_full_page(rx_ring->pool, page, false);
}
rx_ring->next_to_clean = 0;
@@ -739,10 +725,15 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
**/
void iavf_free_rx_resources(struct iavf_ring *rx_ring)
{
+ struct device *dev = rx_ring->pool->p.dev;
+
iavf_clean_rx_ring(rx_ring);
kfree(rx_ring->rx_pages);
rx_ring->rx_pages = NULL;
+ page_pool_destroy(rx_ring->pool);
+ rx_ring->dev = dev;
+
if (rx_ring->desc) {
dma_free_coherent(rx_ring->dev, rx_ring->size,
rx_ring->desc, rx_ring->dma);
@@ -759,13 +750,15 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
{
struct device *dev = rx_ring->dev;
+ struct page_pool *pool;
+ int ret = -ENOMEM;
/* warn if we are about to overwrite the pointer */
WARN_ON(rx_ring->rx_pages);
rx_ring->rx_pages = kcalloc(rx_ring->count, sizeof(*rx_ring->rx_pages),
GFP_KERNEL);
if (!rx_ring->rx_pages)
- return -ENOMEM;
+ return ret;
u64_stats_init(&rx_ring->syncp);
@@ -781,15 +774,27 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
goto err;
}
+ pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
+ rx_ring->count);
+ if (IS_ERR(pool)) {
+ ret = PTR_ERR(pool);
+ goto err_free_dma;
+ }
+
+ rx_ring->pool = pool;
+
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
return 0;
+
+err_free_dma:
+ dma_free_coherent(dev, rx_ring->size, rx_ring->desc, rx_ring->dma);
err:
kfree(rx_ring->rx_pages);
rx_ring->rx_pages = NULL;
- return -ENOMEM;
+ return ret;
}
/**
@@ -810,40 +815,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
writel(val, rx_ring->tail);
}
-/**
- * iavf_alloc_mapped_page - allocate and map a new page
- * @dev: device used for DMA mapping
- * @gfp: GFP mask to allocate page
- *
- * Returns a new &page if the it was successfully allocated, %NULL otherwise.
- **/
-static struct page *iavf_alloc_mapped_page(struct device *dev, gfp_t gfp)
-{
- struct page *page;
- dma_addr_t dma;
-
- /* alloc new page for storage */
- page = __dev_alloc_page(gfp);
- if (unlikely(!page))
- return NULL;
-
- /* map page for use */
- dma = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE,
- IAVF_RX_DMA_ATTR);
-
- /* if mapping failed free memory back to system since
- * there isn't much point in holding memory we can't use
- */
- if (dma_mapping_error(dev, dma)) {
- __free_page(page);
- return NULL;
- }
-
- page_pool_set_dma_addr(page, dma);
-
- return page;
-}
-
/**
* iavf_receive_skb - Send a completed packet up the stack
* @rx_ring: rx ring in play
@@ -877,7 +848,7 @@ static void iavf_receive_skb(struct iavf_ring *rx_ring,
static u32 __iavf_alloc_rx_pages(struct iavf_ring *rx_ring, u32 to_refill,
gfp_t gfp)
{
- struct device *dev = rx_ring->dev;
+ struct page_pool *pool = rx_ring->pool;
u32 ntu = rx_ring->next_to_use;
union iavf_rx_desc *rx_desc;
@@ -891,7 +862,7 @@ static u32 __iavf_alloc_rx_pages(struct iavf_ring *rx_ring, u32 to_refill,
struct page *page;
dma_addr_t dma;
- page = iavf_alloc_mapped_page(dev, gfp);
+ page = page_pool_alloc_pages(pool, gfp);
if (!page) {
rx_ring->rx_stats.alloc_page_failed++;
break;
@@ -900,11 +871,6 @@ static u32 __iavf_alloc_rx_pages(struct iavf_ring *rx_ring, u32 to_refill,
rx_ring->rx_pages[ntu] = page;
dma = page_pool_get_dma_addr(page);
- /* sync the buffer for use by the device */
- dma_sync_single_range_for_device(dev, dma, LIBIE_SKB_HEADROOM,
- LIBIE_RX_BUF_LEN,
- DMA_FROM_DEVICE);
-
/* Refresh the desc even if buffer_addrs didn't change
* because each write-back erases this info.
*/
@@ -1091,21 +1057,6 @@ static void iavf_add_rx_frag(struct sk_buff *skb, struct page *page, u32 size)
LIBIE_SKB_HEADROOM, size, LIBIE_RX_TRUESIZE);
}
-/**
- * iavf_sync_rx_page - Synchronize received data for use
- * @dev: device used for DMA mapping
- * @page: Rx page containing the data
- * @size: size of the received data
- *
- * This function will synchronize the Rx buffer for use by the CPU.
- */
-static void iavf_sync_rx_page(struct device *dev, struct page *page, u32 size)
-{
- dma_sync_single_range_for_cpu(dev, page_pool_get_dma_addr(page),
- LIBIE_SKB_HEADROOM, size,
- DMA_FROM_DEVICE);
-}
-
/**
* iavf_build_skb - Build skb around an existing buffer
* @page: Rx page to with the data
@@ -1128,6 +1079,8 @@ static struct sk_buff *iavf_build_skb(struct page *page, u32 size)
if (unlikely(!skb))
return NULL;
+ skb_mark_for_recycle(skb);
+
/* update pointers within the skb to store the data */
skb_reserve(skb, LIBIE_SKB_HEADROOM);
__skb_put(skb, size);
@@ -1135,19 +1088,6 @@ static struct sk_buff *iavf_build_skb(struct page *page, u32 size)
return skb;
}
-/**
- * iavf_unmap_rx_page - Unmap used page
- * @dev: device used for DMA mapping
- * @page: page to release
- */
-static void iavf_unmap_rx_page(struct device *dev, struct page *page)
-{
- dma_unmap_page_attrs(dev, page_pool_get_dma_addr(page),
- LIBIE_RX_TRUESIZE, DMA_FROM_DEVICE,
- IAVF_RX_DMA_ATTR);
- page_pool_set_dma_addr(page, 0);
-}
-
/**
* iavf_is_non_eop - process handling of non-EOP buffers
* @rx_ring: Rx ring being processed
@@ -1190,8 +1130,8 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
const gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
u32 to_refill = IAVF_DESC_UNUSED(rx_ring);
+ struct page_pool *pool = rx_ring->pool;
struct sk_buff *skb = rx_ring->skb;
- struct device *dev = rx_ring->dev;
u32 ntc = rx_ring->next_to_clean;
u32 ring_size = rx_ring->count;
u32 cleaned_count = 0;
@@ -1240,13 +1180,11 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
* stripped by the HW.
*/
if (unlikely(!size)) {
- iavf_unmap_rx_page(dev, page);
- __free_page(page);
+ page_pool_recycle_direct(pool, page);
goto skip_data;
}
- iavf_sync_rx_page(dev, page, size);
- iavf_unmap_rx_page(dev, page);
+ page_pool_dma_sync_for_cpu(pool, page, size);
/* retrieve a buffer from the ring */
if (skb)
@@ -1256,7 +1194,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
/* exit if we failed to retrieve a buffer */
if (!skb) {
- __free_page(page);
+ page_pool_put_page(pool, page, size, true);
rx_ring->rx_stats.alloc_buff_failed++;
break;
}
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 1421e90c7c4e..8fbe549ce6a5 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -83,9 +83,6 @@ enum iavf_dyn_idx_t {
#define iavf_rx_desc iavf_32byte_rx_desc
-#define IAVF_RX_DMA_ATTR \
- (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
-
/**
* iavf_test_staterr - tests bits in Rx descriptor status and error fields
* @rx_desc: pointer to receive descriptor (in le64 format)
@@ -240,7 +237,10 @@ struct iavf_rx_queue_stats {
struct iavf_ring {
struct iavf_ring *next; /* pointer to next ring in q_vector */
void *desc; /* Descriptor ring memory */
- struct device *dev; /* Used for DMA mapping */
+ union {
+ struct page_pool *pool; /* Used for Rx page management */
+ struct device *dev; /* Used for DMA mapping on Tx */
+ };
struct net_device *netdev; /* netdev ring maps to */
union {
struct iavf_tx_buffer *tx_bi;
diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
index f503476d8eef..d68eab76593c 100644
--- a/drivers/net/ethernet/intel/libie/rx.c
+++ b/drivers/net/ethernet/intel/libie/rx.c
@@ -105,6 +105,34 @@ const struct libie_rx_ptype_parsed libie_rx_ptype_lut[LIBIE_RX_PTYPE_NUM] = {
};
EXPORT_SYMBOL_NS_GPL(libie_rx_ptype_lut, LIBIE);
+/* Page Pool */
+
+/**
+ * libie_rx_page_pool_create - create a PP with the default libie settings
+ * @napi: &napi_struct covering this PP (no usage outside its poll loops)
+ * @size: size of the PP, usually simply Rx queue len
+ *
+ * Returns &page_pool on success, casted -errno on failure.
+ */
+struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
+ u32 size)
+{
+ const struct page_pool_params pp = {
+ .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+ .order = LIBIE_RX_PAGE_ORDER,
+ .pool_size = size,
+ .nid = NUMA_NO_NODE,
+ .dev = napi->dev->dev.parent,
+ .napi = napi,
+ .dma_dir = DMA_FROM_DEVICE,
+ .max_len = LIBIE_RX_BUF_LEN,
+ .offset = LIBIE_SKB_HEADROOM,
+ };
+
+ return page_pool_create(&pp);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
+
MODULE_AUTHOR("Intel Corporation");
MODULE_DESCRIPTION("Intel(R) Ethernet common library");
MODULE_LICENSE("GPL");
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
index 3e8d0d5206e1..b86cadd281f1 100644
--- a/include/linux/net/intel/libie/rx.h
+++ b/include/linux/net/intel/libie/rx.h
@@ -5,7 +5,7 @@
#define __LIBIE_RX_H
#include <linux/if_vlan.h>
-#include <linux/netdevice.h>
+#include <net/page_pool.h>
/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
* bitfield struct.
@@ -160,4 +160,7 @@ static inline void libie_skb_set_hash(struct sk_buff *skb, u32 hash,
/* Maximum frame size minus LL overhead */
#define LIBIE_MAX_MTU (LIBIE_MAX_RX_FRM_LEN - LIBIE_RX_LL_LEN)
+struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
+ u32 size);
+
#endif /* __LIBIE_RX_H */
--
2.40.1
On Tue, 2023-05-30 at 17:00 +0200, Alexander Lobakin wrote:
> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
> no custom recycling logics and one whole page per frame, it can easily
> be switched to using Page Pool API instead.
> Introduce libie_rx_page_pool_create(), a wrapper for creating a PP with
> the default libie settings applicable to all Intel hardware, and replace
> the alloc/free calls with the corresponding PP functions, including the
> newly added sync-for-CPU helpers. Use skb_mark_for_recycle() to bring
> back the recycling and restore the initial performance.
>
> From the important object code changes, worth mentioning that
> __iavf_alloc_rx_pages() is now inlined due to the greatly reduced size.
> The resulting driver is on par with the pre-series code and 1-2% slower
> than the "optimized" version right before the recycling removal.
> But the number of locs and object code bytes slaughtered is much more
> important here after all, not speaking of that there's still a vast
> space for optimization and improvements.
>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> drivers/net/ethernet/intel/Kconfig | 1 +
> drivers/net/ethernet/intel/iavf/iavf_txrx.c | 126 +++++---------------
> drivers/net/ethernet/intel/iavf/iavf_txrx.h | 8 +-
> drivers/net/ethernet/intel/libie/rx.c | 28 +++++
> include/linux/net/intel/libie/rx.h | 5 +-
> 5 files changed, 69 insertions(+), 99 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> index cec4a938fbd0..a368afc42b8d 100644
> --- a/drivers/net/ethernet/intel/Kconfig
> +++ b/drivers/net/ethernet/intel/Kconfig
> @@ -86,6 +86,7 @@ config E1000E_HWTS
>
> config LIBIE
> tristate
> + select PAGE_POOL
> help
> libie (Intel Ethernet library) is a common library containing
> routines shared by several Intel Ethernet drivers.
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> index c33a3d681c83..1de67a70f045 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> @@ -3,7 +3,6 @@
>
> #include <linux/net/intel/libie/rx.h>
> #include <linux/prefetch.h>
> -#include <net/page_pool.h>
>
> #include "iavf.h"
> #include "iavf_trace.h"
> @@ -691,8 +690,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
> **/
> void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> {
> - u16 i;
> -
> /* ring already cleared, nothing to do */
> if (!rx_ring->rx_pages)
> return;
> @@ -703,28 +700,17 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> }
>
> /* Free all the Rx ring sk_buffs */
> - for (i = 0; i < rx_ring->count; i++) {
> + for (u32 i = 0; i < rx_ring->count; i++) {
Did we make a change to our coding style to allow declaration of
variables inside of for statements? Just wondering if this is a change
since the recent updates to the ISO C standard, or if this doesn't
match up with what we would expect per the coding standard.
> struct page *page = rx_ring->rx_pages[i];
> - dma_addr_t dma;
>
> if (!page)
> continue;
>
> - dma = page_pool_get_dma_addr(page);
> -
> /* Invalidate cache lines that may have been written to by
> * device so that we avoid corrupting memory.
> */
> - dma_sync_single_range_for_cpu(rx_ring->dev, dma,
> - LIBIE_SKB_HEADROOM,
> - LIBIE_RX_BUF_LEN,
> - DMA_FROM_DEVICE);
> -
> - /* free resources associated with mapping */
> - dma_unmap_page_attrs(rx_ring->dev, dma, LIBIE_RX_TRUESIZE,
> - DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> -
> - __free_page(page);
> + page_pool_dma_sync_full_for_cpu(rx_ring->pool, page);
> + page_pool_put_full_page(rx_ring->pool, page, false);
> }
>
> rx_ring->next_to_clean = 0;
> @@ -739,10 +725,15 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> **/
> void iavf_free_rx_resources(struct iavf_ring *rx_ring)
> {
> + struct device *dev = rx_ring->pool->p.dev;
> +
> iavf_clean_rx_ring(rx_ring);
> kfree(rx_ring->rx_pages);
> rx_ring->rx_pages = NULL;
>
> + page_pool_destroy(rx_ring->pool);
> + rx_ring->dev = dev;
> +
> if (rx_ring->desc) {
> dma_free_coherent(rx_ring->dev, rx_ring->size,
> rx_ring->desc, rx_ring->dma);
Not a fan of this switching back and forth between being a page pool
pointer and a dev pointer. Seems problematic as it is easily
misinterpreted. I would say that at a minimum stick to either it is
page_pool(Rx) or dev(Tx) on a ring type basis.
> @@ -759,13 +750,15 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
> int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
> {
> struct device *dev = rx_ring->dev;
> + struct page_pool *pool;
> + int ret = -ENOMEM;
>
> /* warn if we are about to overwrite the pointer */
> WARN_ON(rx_ring->rx_pages);
> rx_ring->rx_pages = kcalloc(rx_ring->count, sizeof(*rx_ring->rx_pages),
> GFP_KERNEL);
> if (!rx_ring->rx_pages)
> - return -ENOMEM;
> + return ret;
>
> u64_stats_init(&rx_ring->syncp);
>
> @@ -781,15 +774,27 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
> goto err;
> }
>
> + pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
> + rx_ring->count);
> + if (IS_ERR(pool)) {
> + ret = PTR_ERR(pool);
> + goto err_free_dma;
> + }
> +
> + rx_ring->pool = pool;
> +
> rx_ring->next_to_clean = 0;
> rx_ring->next_to_use = 0;
>
> return 0;
> +
> +err_free_dma:
> + dma_free_coherent(dev, rx_ring->size, rx_ring->desc, rx_ring->dma);
> err:
> kfree(rx_ring->rx_pages);
> rx_ring->rx_pages = NULL;
>
> - return -ENOMEM;
> + return ret;
> }
>
> /**
This setup works for iavf, however for i40e/ice you may run into issues
since the setup_rx_descriptors call is also used to setup the ethtool
loopback test w/o a napi struct as I recall so there may not be a
q_vector.
> @@ -810,40 +815,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
> writel(val, rx_ring->tail);
> }
>
> -/**
> - * iavf_alloc_mapped_page - allocate and map a new page
> - * @dev: device used for DMA mapping
> - * @gfp: GFP mask to allocate page
> - *
> - * Returns a new &page if the it was successfully allocated, %NULL otherwise.
> - **/
> -static struct page *iavf_alloc_mapped_page(struct device *dev, gfp_t gfp)
> -{
> - struct page *page;
> - dma_addr_t dma;
> -
> - /* alloc new page for storage */
> - page = __dev_alloc_page(gfp);
> - if (unlikely(!page))
> - return NULL;
> -
> - /* map page for use */
> - dma = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE,
> - IAVF_RX_DMA_ATTR);
> -
> - /* if mapping failed free memory back to system since
> - * there isn't much point in holding memory we can't use
> - */
> - if (dma_mapping_error(dev, dma)) {
> - __free_page(page);
> - return NULL;
> - }
> -
> - page_pool_set_dma_addr(page, dma);
> -
> - return page;
> -}
> -
> /**
> * iavf_receive_skb - Send a completed packet up the stack
> * @rx_ring: rx ring in play
> @@ -877,7 +848,7 @@ static void iavf_receive_skb(struct iavf_ring *rx_ring,
> static u32 __iavf_alloc_rx_pages(struct iavf_ring *rx_ring, u32 to_refill,
> gfp_t gfp)
> {
> - struct device *dev = rx_ring->dev;
> + struct page_pool *pool = rx_ring->pool;
> u32 ntu = rx_ring->next_to_use;
> union iavf_rx_desc *rx_desc;
>
> @@ -891,7 +862,7 @@ static u32 __iavf_alloc_rx_pages(struct iavf_ring *rx_ring, u32 to_refill,
> struct page *page;
> dma_addr_t dma;
>
> - page = iavf_alloc_mapped_page(dev, gfp);
> + page = page_pool_alloc_pages(pool, gfp);
> if (!page) {
> rx_ring->rx_stats.alloc_page_failed++;
> break;
> @@ -900,11 +871,6 @@ static u32 __iavf_alloc_rx_pages(struct iavf_ring *rx_ring, u32 to_refill,
> rx_ring->rx_pages[ntu] = page;
> dma = page_pool_get_dma_addr(page);
>
> - /* sync the buffer for use by the device */
> - dma_sync_single_range_for_device(dev, dma, LIBIE_SKB_HEADROOM,
> - LIBIE_RX_BUF_LEN,
> - DMA_FROM_DEVICE);
> -
> /* Refresh the desc even if buffer_addrs didn't change
> * because each write-back erases this info.
> */
> @@ -1091,21 +1057,6 @@ static void iavf_add_rx_frag(struct sk_buff *skb, struct page *page, u32 size)
> LIBIE_SKB_HEADROOM, size, LIBIE_RX_TRUESIZE);
> }
>
> -/**
> - * iavf_sync_rx_page - Synchronize received data for use
> - * @dev: device used for DMA mapping
> - * @page: Rx page containing the data
> - * @size: size of the received data
> - *
> - * This function will synchronize the Rx buffer for use by the CPU.
> - */
> -static void iavf_sync_rx_page(struct device *dev, struct page *page, u32 size)
> -{
> - dma_sync_single_range_for_cpu(dev, page_pool_get_dma_addr(page),
> - LIBIE_SKB_HEADROOM, size,
> - DMA_FROM_DEVICE);
> -}
> -
> /**
> * iavf_build_skb - Build skb around an existing buffer
> * @page: Rx page to with the data
> @@ -1128,6 +1079,8 @@ static struct sk_buff *iavf_build_skb(struct page *page, u32 size)
> if (unlikely(!skb))
> return NULL;
>
> + skb_mark_for_recycle(skb);
> +
> /* update pointers within the skb to store the data */
> skb_reserve(skb, LIBIE_SKB_HEADROOM);
> __skb_put(skb, size);
> @@ -1135,19 +1088,6 @@ static struct sk_buff *iavf_build_skb(struct page *page, u32 size)
> return skb;
> }
>
> -/**
> - * iavf_unmap_rx_page - Unmap used page
> - * @dev: device used for DMA mapping
> - * @page: page to release
> - */
> -static void iavf_unmap_rx_page(struct device *dev, struct page *page)
> -{
> - dma_unmap_page_attrs(dev, page_pool_get_dma_addr(page),
> - LIBIE_RX_TRUESIZE, DMA_FROM_DEVICE,
> - IAVF_RX_DMA_ATTR);
> - page_pool_set_dma_addr(page, 0);
> -}
> -
> /**
> * iavf_is_non_eop - process handling of non-EOP buffers
> * @rx_ring: Rx ring being processed
> @@ -1190,8 +1130,8 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> const gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
> u32 to_refill = IAVF_DESC_UNUSED(rx_ring);
> + struct page_pool *pool = rx_ring->pool;
> struct sk_buff *skb = rx_ring->skb;
> - struct device *dev = rx_ring->dev;
> u32 ntc = rx_ring->next_to_clean;
> u32 ring_size = rx_ring->count;
> u32 cleaned_count = 0;
> @@ -1240,13 +1180,11 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> * stripped by the HW.
> */
> if (unlikely(!size)) {
> - iavf_unmap_rx_page(dev, page);
> - __free_page(page);
> + page_pool_recycle_direct(pool, page);
> goto skip_data;
> }
>
> - iavf_sync_rx_page(dev, page, size);
> - iavf_unmap_rx_page(dev, page);
> + page_pool_dma_sync_for_cpu(pool, page, size);
>
> /* retrieve a buffer from the ring */
> if (skb)
> @@ -1256,7 +1194,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
>
> /* exit if we failed to retrieve a buffer */
> if (!skb) {
> - __free_page(page);
> + page_pool_put_page(pool, page, size, true);
> rx_ring->rx_stats.alloc_buff_failed++;
> break;
> }
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> index 1421e90c7c4e..8fbe549ce6a5 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> @@ -83,9 +83,6 @@ enum iavf_dyn_idx_t {
>
> #define iavf_rx_desc iavf_32byte_rx_desc
>
> -#define IAVF_RX_DMA_ATTR \
> - (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
> -
> /**
> * iavf_test_staterr - tests bits in Rx descriptor status and error fields
> * @rx_desc: pointer to receive descriptor (in le64 format)
> @@ -240,7 +237,10 @@ struct iavf_rx_queue_stats {
> struct iavf_ring {
> struct iavf_ring *next; /* pointer to next ring in q_vector */
> void *desc; /* Descriptor ring memory */
> - struct device *dev; /* Used for DMA mapping */
> + union {
> + struct page_pool *pool; /* Used for Rx page management */
> + struct device *dev; /* Used for DMA mapping on Tx */
> + };
> struct net_device *netdev; /* netdev ring maps to */
> union {
> struct iavf_tx_buffer *tx_bi;
Would it make more sense to have the page pool in the q_vector rather
than the ring? Essentially the page pool is associated per napi
instance so it seems like it would make more sense to store it with the
napi struct rather than potentially have multiple instances per napi.
> diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
> index f503476d8eef..d68eab76593c 100644
> --- a/drivers/net/ethernet/intel/libie/rx.c
> +++ b/drivers/net/ethernet/intel/libie/rx.c
> @@ -105,6 +105,34 @@ const struct libie_rx_ptype_parsed libie_rx_ptype_lut[LIBIE_RX_PTYPE_NUM] = {
> };
> EXPORT_SYMBOL_NS_GPL(libie_rx_ptype_lut, LIBIE);
>
> +/* Page Pool */
> +
> +/**
> + * libie_rx_page_pool_create - create a PP with the default libie settings
> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
> + * @size: size of the PP, usually simply Rx queue len
> + *
> + * Returns &page_pool on success, casted -errno on failure.
> + */
> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
> + u32 size)
> +{
> + const struct page_pool_params pp = {
> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> + .order = LIBIE_RX_PAGE_ORDER,
> + .pool_size = size,
> + .nid = NUMA_NO_NODE,
> + .dev = napi->dev->dev.parent,
> + .napi = napi,
> + .dma_dir = DMA_FROM_DEVICE,
> + .max_len = LIBIE_RX_BUF_LEN,
> + .offset = LIBIE_SKB_HEADROOM,
> + };
> +
> + return page_pool_create(&pp);
> +}
> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
> +
> MODULE_AUTHOR("Intel Corporation");
> MODULE_DESCRIPTION("Intel(R) Ethernet common library");
> MODULE_LICENSE("GPL");
> diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
> index 3e8d0d5206e1..b86cadd281f1 100644
> --- a/include/linux/net/intel/libie/rx.h
> +++ b/include/linux/net/intel/libie/rx.h
> @@ -5,7 +5,7 @@
> #define __LIBIE_RX_H
>
> #include <linux/if_vlan.h>
> -#include <linux/netdevice.h>
> +#include <net/page_pool.h>
>
> /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
> * bitfield struct.
> @@ -160,4 +160,7 @@ static inline void libie_skb_set_hash(struct sk_buff *skb, u32 hash,
> /* Maximum frame size minus LL overhead */
> #define LIBIE_MAX_MTU (LIBIE_MAX_RX_FRM_LEN - LIBIE_RX_LL_LEN)
>
> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
> + u32 size);
> +
> #endif /* __LIBIE_RX_H */
From: Alexander H Duyck <[email protected]>
Date: Wed, 31 May 2023 09:19:06 -0700
> On Tue, 2023-05-30 at 17:00 +0200, Alexander Lobakin wrote:
>> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
>> no custom recycling logics and one whole page per frame, it can easily
>> be switched to using Page Pool API instead.
[...]
>> @@ -691,8 +690,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
>> **/
>> void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>> {
>> - u16 i;
>> -
>> /* ring already cleared, nothing to do */
>> if (!rx_ring->rx_pages)
>> return;
>> @@ -703,28 +700,17 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>> }
>>
>> /* Free all the Rx ring sk_buffs */
>> - for (i = 0; i < rx_ring->count; i++) {
>> + for (u32 i = 0; i < rx_ring->count; i++) {
>
> Did we make a change to our coding style to allow declaration of
> variables inside of for statements? Just wondering if this is a change
> since the recent updates to the ISO C standard, or if this doesn't
> match up with what we would expect per the coding standard.
It's optional right now, nobody would object declaring it either way.
Doing it inside is allowed since we switched to C11, right.
Here I did that because my heart was breaking to see this little u16
alone (and yeah, u16 on the stack).
>
>> struct page *page = rx_ring->rx_pages[i];
>> - dma_addr_t dma;
>>
>> if (!page)
>> continue;
>>
>> - dma = page_pool_get_dma_addr(page);
>> -
>> /* Invalidate cache lines that may have been written to by
>> * device so that we avoid corrupting memory.
>> */
>> - dma_sync_single_range_for_cpu(rx_ring->dev, dma,
>> - LIBIE_SKB_HEADROOM,
>> - LIBIE_RX_BUF_LEN,
>> - DMA_FROM_DEVICE);
>> -
>> - /* free resources associated with mapping */
>> - dma_unmap_page_attrs(rx_ring->dev, dma, LIBIE_RX_TRUESIZE,
>> - DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
>> -
>> - __free_page(page);
>> + page_pool_dma_sync_full_for_cpu(rx_ring->pool, page);
>> + page_pool_put_full_page(rx_ring->pool, page, false);
>> }
>>
>> rx_ring->next_to_clean = 0;
>> @@ -739,10 +725,15 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>> **/
>> void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>> {
>> + struct device *dev = rx_ring->pool->p.dev;
>> +
>> iavf_clean_rx_ring(rx_ring);
>> kfree(rx_ring->rx_pages);
>> rx_ring->rx_pages = NULL;
>>
>> + page_pool_destroy(rx_ring->pool);
>> + rx_ring->dev = dev;
>> +
>> if (rx_ring->desc) {
>> dma_free_coherent(rx_ring->dev, rx_ring->size,
>> rx_ring->desc, rx_ring->dma);
>
> Not a fan of this switching back and forth between being a page pool
> pointer and a dev pointer. Seems problematic as it is easily
> misinterpreted. I would say that at a minimum stick to either it is
> page_pool(Rx) or dev(Tx) on a ring type basis.
The problem is that page_pool has lifetime from ifup to ifdown, while
its ring lives longer. So I had to do something with this, but also I
didn't want to have 2 pointers at the same time since it's redundant and
+8 bytes to the ring for nothing.
[...]
> This setup works for iavf, however for i40e/ice you may run into issues
> since the setup_rx_descriptors call is also used to setup the ethtool
> loopback test w/o a napi struct as I recall so there may not be a
> q_vector.
I'll handle that. Somehow :D Thanks for noticing, I'll take a look
whether I should do something right now or it can be done later when
switching the actual mentioned drivers.
[...]
>> @@ -240,7 +237,10 @@ struct iavf_rx_queue_stats {
>> struct iavf_ring {
>> struct iavf_ring *next; /* pointer to next ring in q_vector */
>> void *desc; /* Descriptor ring memory */
>> - struct device *dev; /* Used for DMA mapping */
>> + union {
>> + struct page_pool *pool; /* Used for Rx page management */
>> + struct device *dev; /* Used for DMA mapping on Tx */
>> + };
>> struct net_device *netdev; /* netdev ring maps to */
>> union {
>> struct iavf_tx_buffer *tx_bi;
>
> Would it make more sense to have the page pool in the q_vector rather
> than the ring? Essentially the page pool is associated per napi
> instance so it seems like it would make more sense to store it with the
> napi struct rather than potentially have multiple instances per napi.
As per Page Pool design, you should have it per ring. Plus you have
rxq_info (XDP-related structure), which is also per-ring and
participates in recycling in some cases. So I wouldn't complicate.
I went down the chain and haven't found any place where having more than
1 PP per NAPI would break anything. If I got it correctly, Jakub's
optimization discourages having 1 PP per several NAPIs (or scheduling
one NAPI on different CPUs), but not the other way around. The goal was
to exclude concurrent access to one PP from different threads, and here
it's impossible.
Lemme know. I can always disable NAPI optimization for cases when one
vector is shared by several queues -- and it's not a usual case for
these NICs anyway -- but I haven't found a reason for that.
[...]
Thanks,
Olek
On Fri, Jun 2, 2023 at 9:31 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander H Duyck <[email protected]>
> Date: Wed, 31 May 2023 09:19:06 -0700
>
> > On Tue, 2023-05-30 at 17:00 +0200, Alexander Lobakin wrote:
> >> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
> >> no custom recycling logics and one whole page per frame, it can easily
> >> be switched to using Page Pool API instead.
>
> [...]
>
> >> @@ -691,8 +690,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
> >> **/
> >> void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> >> {
> >> - u16 i;
> >> -
> >> /* ring already cleared, nothing to do */
> >> if (!rx_ring->rx_pages)
> >> return;
> >> @@ -703,28 +700,17 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> >> }
> >>
> >> /* Free all the Rx ring sk_buffs */
> >> - for (i = 0; i < rx_ring->count; i++) {
> >> + for (u32 i = 0; i < rx_ring->count; i++) {
> >
> > Did we make a change to our coding style to allow declaration of
> > variables inside of for statements? Just wondering if this is a change
> > since the recent updates to the ISO C standard, or if this doesn't
> > match up with what we would expect per the coding standard.
>
> It's optional right now, nobody would object declaring it either way.
> Doing it inside is allowed since we switched to C11, right.
> Here I did that because my heart was breaking to see this little u16
> alone (and yeah, u16 on the stack).
Yeah, that was back when I was declaring stack variables the exact
same size as the ring parameters. So u16 should match the size of
rx_ring->count not that it matters. It was just a quirk I had at the
time.
> >
> >> struct page *page = rx_ring->rx_pages[i];
> >> - dma_addr_t dma;
> >>
> >> if (!page)
> >> continue;
> >>
> >> - dma = page_pool_get_dma_addr(page);
> >> -
> >> /* Invalidate cache lines that may have been written to by
> >> * device so that we avoid corrupting memory.
> >> */
> >> - dma_sync_single_range_for_cpu(rx_ring->dev, dma,
> >> - LIBIE_SKB_HEADROOM,
> >> - LIBIE_RX_BUF_LEN,
> >> - DMA_FROM_DEVICE);
> >> -
> >> - /* free resources associated with mapping */
> >> - dma_unmap_page_attrs(rx_ring->dev, dma, LIBIE_RX_TRUESIZE,
> >> - DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> >> -
> >> - __free_page(page);
> >> + page_pool_dma_sync_full_for_cpu(rx_ring->pool, page);
> >> + page_pool_put_full_page(rx_ring->pool, page, false);
> >> }
> >>
> >> rx_ring->next_to_clean = 0;
> >> @@ -739,10 +725,15 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> >> **/
> >> void iavf_free_rx_resources(struct iavf_ring *rx_ring)
> >> {
> >> + struct device *dev = rx_ring->pool->p.dev;
> >> +
> >> iavf_clean_rx_ring(rx_ring);
> >> kfree(rx_ring->rx_pages);
> >> rx_ring->rx_pages = NULL;
> >>
> >> + page_pool_destroy(rx_ring->pool);
> >> + rx_ring->dev = dev;
> >> +
> >> if (rx_ring->desc) {
> >> dma_free_coherent(rx_ring->dev, rx_ring->size,
> >> rx_ring->desc, rx_ring->dma);
> >
> > Not a fan of this switching back and forth between being a page pool
> > pointer and a dev pointer. Seems problematic as it is easily
> > misinterpreted. I would say that at a minimum stick to either it is
> > page_pool(Rx) or dev(Tx) on a ring type basis.
>
> The problem is that page_pool has lifetime from ifup to ifdown, while
> its ring lives longer. So I had to do something with this, but also I
> didn't want to have 2 pointers at the same time since it's redundant and
> +8 bytes to the ring for nothing.
It might be better to just go with NULL rather than populating it w/
two different possible values. Then at least you know if it is an
rx_ring it is a page_pool and if it is a tx_ring it is dev. You can
reset to the page pool when you repopulate the rest of the ring.
> > This setup works for iavf, however for i40e/ice you may run into issues
> > since the setup_rx_descriptors call is also used to setup the ethtool
> > loopback test w/o a napi struct as I recall so there may not be a
> > q_vector.
>
> I'll handle that. Somehow :D Thanks for noticing, I'll take a look
> whether I should do something right now or it can be done later when
> switching the actual mentioned drivers.
>
> [...]
>
> >> @@ -240,7 +237,10 @@ struct iavf_rx_queue_stats {
> >> struct iavf_ring {
> >> struct iavf_ring *next; /* pointer to next ring in q_vector */
> >> void *desc; /* Descriptor ring memory */
> >> - struct device *dev; /* Used for DMA mapping */
> >> + union {
> >> + struct page_pool *pool; /* Used for Rx page management */
> >> + struct device *dev; /* Used for DMA mapping on Tx */
> >> + };
> >> struct net_device *netdev; /* netdev ring maps to */
> >> union {
> >> struct iavf_tx_buffer *tx_bi;
> >
> > Would it make more sense to have the page pool in the q_vector rather
> > than the ring? Essentially the page pool is associated per napi
> > instance so it seems like it would make more sense to store it with the
> > napi struct rather than potentially have multiple instances per napi.
>
> As per Page Pool design, you should have it per ring. Plus you have
> rxq_info (XDP-related structure), which is also per-ring and
> participates in recycling in some cases. So I wouldn't complicate.
> I went down the chain and haven't found any place where having more than
> 1 PP per NAPI would break anything. If I got it correctly, Jakub's
> optimization discourages having 1 PP per several NAPIs (or scheduling
> one NAPI on different CPUs), but not the other way around. The goal was
> to exclude concurrent access to one PP from different threads, and here
> it's impossible.
The xdp_rxq can be mapped many:1 to the page pool if I am not mistaken.
The only reason why I am a fan of trying to keep the page_pool tightly
associated with the napi instance is because the napi instance is what
essentially is guaranteeing the page_pool is consistent as it is only
accessed by that one napi instance.
> Lemme know. I can always disable NAPI optimization for cases when one
> vector is shared by several queues -- and it's not a usual case for
> these NICs anyway -- but I haven't found a reason for that.
I suppose we should be fine if we have a many to one mapping though I
suppose. As you said the issue would be if multiple NAPI were
accessing the same page pool.
From: Alexander Duyck <[email protected]>
Date: Fri, 2 Jun 2023 11:00:07 -0700
> On Fri, Jun 2, 2023 at 9:31 AM Alexander Lobakin
> <[email protected]> wrote:
[...]
>>> Not a fan of this switching back and forth between being a page pool
>>> pointer and a dev pointer. Seems problematic as it is easily
>>> misinterpreted. I would say that at a minimum stick to either it is
>>> page_pool(Rx) or dev(Tx) on a ring type basis.
>>
>> The problem is that page_pool has lifetime from ifup to ifdown, while
>> its ring lives longer. So I had to do something with this, but also I
>> didn't want to have 2 pointers at the same time since it's redundant and
>> +8 bytes to the ring for nothing.
>
> It might be better to just go with NULL rather than populating it w/
> two different possible values. Then at least you know if it is an
> rx_ring it is a page_pool and if it is a tx_ring it is dev. You can
> reset to the page pool when you repopulate the rest of the ring.
IIRC I did that to have struct device pointer at the moment of creating
page_pools. But sounds reasonable, I'll take a look.
>
>>> This setup works for iavf, however for i40e/ice you may run into issues
>>> since the setup_rx_descriptors call is also used to setup the ethtool
>>> loopback test w/o a napi struct as I recall so there may not be a
>>> q_vector.
>>
>> I'll handle that. Somehow :D Thanks for noticing, I'll take a look
>> whether I should do something right now or it can be done later when
>> switching the actual mentioned drivers.
>>
>> [...]
>>
>>>> @@ -240,7 +237,10 @@ struct iavf_rx_queue_stats {
>>>> struct iavf_ring {
>>>> struct iavf_ring *next; /* pointer to next ring in q_vector */
>>>> void *desc; /* Descriptor ring memory */
>>>> - struct device *dev; /* Used for DMA mapping */
>>>> + union {
>>>> + struct page_pool *pool; /* Used for Rx page management */
>>>> + struct device *dev; /* Used for DMA mapping on Tx */
>>>> + };
>>>> struct net_device *netdev; /* netdev ring maps to */
>>>> union {
>>>> struct iavf_tx_buffer *tx_bi;
>>>
>>> Would it make more sense to have the page pool in the q_vector rather
>>> than the ring? Essentially the page pool is associated per napi
>>> instance so it seems like it would make more sense to store it with the
>>> napi struct rather than potentially have multiple instances per napi.
>>
>> As per Page Pool design, you should have it per ring. Plus you have
>> rxq_info (XDP-related structure), which is also per-ring and
>> participates in recycling in some cases. So I wouldn't complicate.
>> I went down the chain and haven't found any place where having more than
>> 1 PP per NAPI would break anything. If I got it correctly, Jakub's
>> optimization discourages having 1 PP per several NAPIs (or scheduling
>> one NAPI on different CPUs), but not the other way around. The goal was
>> to exclude concurrent access to one PP from different threads, and here
>> it's impossible.
>
> The xdp_rxq can be mapped many:1 to the page pool if I am not mistaken.
>
> The only reason why I am a fan of trying to keep the page_pool tightly
> associated with the napi instance is because the napi instance is what
> essentially is guaranteeing the page_pool is consistent as it is only
> accessed by that one napi instance.
Here we can't have more than one NAPI instance accessing one page_pool,
so I did that unconditionally. I'm a fan of what you've said, too :p
>
>> Lemme know. I can always disable NAPI optimization for cases when one
>> vector is shared by several queues -- and it's not a usual case for
>> these NICs anyway -- but I haven't found a reason for that.
>
> I suppose we should be fine if we have a many to one mapping though I
> suppose. As you said the issue would be if multiple NAPI were
> accessing the same page pool.
Thanks,
Olek