2024-04-04 15:48:49

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf

Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:

Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.

Alexander Lobakin (9):
net: intel: introduce {,Intel} Ethernet common library
iavf: kill "legacy-rx" for good
iavf: drop page splitting and recycling
slab: introduce kvmalloc_array_node() and kvcalloc_node()
page_pool: constify some read-only function arguments
page_pool: add DMA-sync-for-CPU inline helper
libeth: add Rx buffer management
iavf: pack iavf_ring more efficiently
iavf: switch to Page Pool

MAINTAINERS | 4 +-
drivers/net/ethernet/intel/Kconfig | 7 +
drivers/net/ethernet/intel/libeth/Kconfig | 9 +
drivers/net/ethernet/intel/libie/Kconfig | 10 +
drivers/net/ethernet/intel/Makefile | 3 +
drivers/net/ethernet/intel/libeth/Makefile | 6 +
drivers/net/ethernet/intel/libie/Makefile | 6 +
include/net/page_pool/types.h | 4 +-
.../net/ethernet/intel/i40e/i40e_prototype.h | 7 -
drivers/net/ethernet/intel/i40e/i40e_type.h | 88 ---
drivers/net/ethernet/intel/iavf/iavf.h | 2 +-
.../net/ethernet/intel/iavf/iavf_prototype.h | 7 -
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 146 +----
drivers/net/ethernet/intel/iavf/iavf_type.h | 90 ---
.../net/ethernet/intel/ice/ice_lan_tx_rx.h | 320 ----------
include/linux/net/intel/libie/rx.h | 50 ++
include/linux/slab.h | 17 +-
include/net/libeth/rx.h | 242 ++++++++
include/net/page_pool/helpers.h | 34 +-
drivers/net/ethernet/intel/i40e/i40e_common.c | 253 --------
drivers/net/ethernet/intel/i40e/i40e_main.c | 1 +
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 72 +--
drivers/net/ethernet/intel/iavf/iavf_common.c | 253 --------
.../net/ethernet/intel/iavf/iavf_ethtool.c | 140 -----
drivers/net/ethernet/intel/iavf/iavf_main.c | 40 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 551 +++---------------
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 17 +-
drivers/net/ethernet/intel/ice/ice_main.c | 1 +
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 111 +---
drivers/net/ethernet/intel/libeth/rx.c | 150 +++++
drivers/net/ethernet/intel/libie/rx.c | 124 ++++
net/core/page_pool.c | 10 +-
32 files changed, 819 insertions(+), 1956 deletions(-)
create mode 100644 drivers/net/ethernet/intel/libeth/Kconfig
create mode 100644 drivers/net/ethernet/intel/libie/Kconfig
create mode 100644 drivers/net/ethernet/intel/libeth/Makefile
create mode 100644 drivers/net/ethernet/intel/libie/Makefile
create mode 100644 include/linux/net/intel/libie/rx.h
create mode 100644 include/net/libeth/rx.h
create mode 100644 drivers/net/ethernet/intel/libeth/rx.c
create mode 100644 drivers/net/ethernet/intel/libie/rx.c

---
libeth has way more generic functionality and code in the idpf XDP
tree[0], take a look if you want to have more complete picture of
what this really is about.

From v8[1]:
* rebase on top of net-next (6.9-rc1+);
* introduce kvmalloc_array_node() and kvcalloc_node();
* make Rx buffer management NUMA-aware;
* resolve kdoc issues (Jakub, me).

From v7[2]:
* drop Page Pool optimization prereqs;
* drop generic stats part: will redo to the new per-queue stats later;
* split libie into "generic" and "fnic" (i40e, ice, iavf) parts;
* use shorter and modern struct names;
* #1: allow to compile-out hotpath IPv6 code when !CONFIG_IPV6;
* #1: generate XDP RSS hash directly in the lookup table;
* #8: fix rare skb nullptr deref bug.

From v6[3]:
* #04: resolve ethtool_puts() Git conflict (Jakub);
* #06: pick RB from Ilias;
* no functional changes.

From v5[4]:
* drop Page Pool DMA shortcut: will pick up Eric's more global DMA sync
optimization[5] and expand it to cover both IOMMU and direct DMA a bit
later (Yunsheng);
* drop per-queue Page Pool Ethtool stats: they are now exported via
generic Netlink interface (Jakub);
* #01: leave a comment why exactly this alignment (Jakub, Yunsheng);
* #08: make use of page_pool_params::netdev when calculating PP params;
* #08: rename ``libie_rx_queue`` -> ``libie_buf_queue``.

From v4[6]:
* make use of Jakub's &page_pool_params split;
* #01: prevent frag fields from spanning into 2 cachelines after
splitting &page_pool_params into fast and slow;
* #02-03: bring back the DMA sync shortcut, now as a per-page flag
(me, Yunsheng);
* #04: let libie have its own Kconfig to stop further bloating of poor
intel/Kconfig;
* #06: merge page split-reuse-recycle drop into one commit (Alex);
* #07: decouple constifying of several Page Pool function arguments
into a separate commit, constify some more;
* #09: stop abusing internal PP fields in the driver code (Yunsheng);
* #09: calculate DMA sync size (::max_len) correctly: within one page,
not one buffer (Yunsheng);
* #10: decouple rearranging &iavf_ring into separate commit, optimize
it even more;
* #11: let the driver get back to the last descriptor to process after
an skb allocation fail, don't drop it (Alex);
* #11: stop touching unrelated stuff like watchdog timeout etc. (Alex);
* fix "Return:" in the kdoc (now `W=12 C=1` is clean), misc typos.

From v3[7]:
* base on the latest net-next, update bloat-o-meter and perf stats;
* split generic PP optimizations into a separate series;
* drop "optimize hotpath a bunch" commit: a lot of [controversial]
changes in one place, worth own series (Alex);
* 02: pick Rev-by (Alex);
* 03: move in-place recycling removal here from the dropped patch;
* 05: new, add libie Rx buffer API separatelly from IAVF changes;
* 05-06: use new "hybrid" allocation API from[8] to reduce memory usage
when a page can fit more than 1 truesize (also asked by David);
* 06: merge with "always use order-0 page" commit to reduce diffs and
simplify things (Alex);
* 09: fix page_alloc_fail counter.

From v2[9]:
* 0006: fix page_pool.h include in OcteonTX2 files (Jakub, Patchwork);
* no functional changes.

From v1[10]:
* 0006: new (me, Jakub);
* 0008: give the helpers more intuitive names (Jakub, Ilias);
* -^-: also expand their kdoc a bit for the same reason;
* -^-: fix kdoc copy-paste issue (Patchwork, Jakub);
* 0011: drop `inline` from C file (Patchwork, Jakub).

[0] https://github.com/alobakin/linux/commits/idpf-libie-new
[1] https://lore.kernel.org/netdev/[email protected]
[2] https://lore.kernel.org/netdev/[email protected]
[3] https://lore.kernel.org/netdev/[email protected]
[4] https://lore.kernel.org/netdev/[email protected]
[5] https://lore.kernel.org/netdev/[email protected]
[6] https://lore.kernel.org/netdev/[email protected]
[7] https://lore.kernel.org/netdev/[email protected]
[8] https://lore.kernel.org/netdev/[email protected]
[9] https://lore.kernel.org/netdev/[email protected]
[10] https://lore.kernel.org/netdev/[email protected]

--
2.44.0



2024-04-04 15:49:01

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 2/9] iavf: kill "legacy-rx" for good

Ever since build_skb() became stable, the old way with allocating an skb
for storing the headers separately, which will be then copied manually,
was slower, less flexible, and thus obsolete.

* It had higher pressure on MM since it actually allocates new pages,
which then get split and refcount-biased (NAPI page cache);
* It implies memcpy() of packet headers (40+ bytes per each frame);
* the actual header length was calculated via eth_get_headlen(), which
invokes Flow Dissector and thus wastes a bunch of CPU cycles;
* XDP makes it even more weird since it requires headroom for long and
also tailroom for some time (since mbuf landed). Take a look at the
ice driver, which is built around work-arounds to make XDP work with
it.

Even on some quite low-end hardware (not a common case for 100G NICs) it
was performing worse.
The only advantage "legacy-rx" had is that it didn't require any
reserved headroom and tailroom. But iavf didn't use this, as it always
splits pages into two halves of 2k, while that save would only be useful
when striding. And again, XDP effectively removes that sole pro.

There's a train of features to land in IAVF soon: Page Pool, XDP, XSk,
multi-buffer etc. Each new would require adding more and more Danse
Macabre for absolutely no reason, besides making hotpath less and less
effective.
Remove the "feature" with all the related code. This includes at least
one very hot branch (typically hit on each new frame), which was either
always-true or always-false at least for a complete NAPI bulk of 64
frames, the whole private flags cruft, and so on. Some stats:

Function: add/remove: 0/4 grow/shrink: 0/7 up/down: 0/-721 (-721)
RO Data: add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-40 (-40)

Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/iavf/iavf.h | 2 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 27 +---
.../net/ethernet/intel/iavf/iavf_ethtool.c | 140 ------------------
drivers/net/ethernet/intel/iavf/iavf_main.c | 10 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 82 +---------
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 3 +-
6 files changed, 8 insertions(+), 256 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index db8188c7ac4b..23a6557fc3db 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -287,7 +287,7 @@ struct iavf_adapter {
#define IAVF_FLAG_RESET_PENDING BIT(4)
#define IAVF_FLAG_RESET_NEEDED BIT(5)
#define IAVF_FLAG_WB_ON_ITR_CAPABLE BIT(6)
-#define IAVF_FLAG_LEGACY_RX BIT(15)
+/* BIT(15) is free, was IAVF_FLAG_LEGACY_RX */
#define IAVF_FLAG_REINIT_ITR_NEEDED BIT(16)
#define IAVF_FLAG_QUEUES_DISABLED BIT(17)
#define IAVF_FLAG_SETUP_NETDEV_FEATURES BIT(18)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 10ba36602c0c..68543efdd29b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -81,20 +81,11 @@ enum iavf_dyn_idx_t {
BIT_ULL(IAVF_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))

/* Supported Rx Buffer Sizes (a multiple of 128) */
-#define IAVF_RXBUFFER_256 256
#define IAVF_RXBUFFER_1536 1536 /* 128B aligned standard Ethernet frame */
#define IAVF_RXBUFFER_2048 2048
#define IAVF_RXBUFFER_3072 3072 /* Used for large frames w/ padding */
#define IAVF_MAX_RXBUFFER 9728 /* largest size for single descriptor */

-/* NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
- * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
- * this adds up to 512 bytes of extra data meaning the smallest allocation
- * we could have is 1K.
- * i.e. RXBUFFER_256 --> 960 byte skb (size-1024 slab)
- * i.e. RXBUFFER_512 --> 1216 byte skb (size-2048 slab)
- */
-#define IAVF_RX_HDR_SIZE IAVF_RXBUFFER_256
#define IAVF_PACKET_HDR_PAD (ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))
#define iavf_rx_desc iavf_32byte_rx_desc

@@ -361,7 +352,8 @@ struct iavf_ring {

u16 flags;
#define IAVF_TXR_FLAGS_WB_ON_ITR BIT(0)
-#define IAVF_RXR_FLAGS_BUILD_SKB_ENABLED BIT(1)
+/* BIT(1) is free, was IAVF_RXR_FLAGS_BUILD_SKB_ENABLED */
+/* BIT(2) is free */
#define IAVF_TXRX_FLAGS_VLAN_TAG_LOC_L2TAG1 BIT(3)
#define IAVF_TXR_FLAGS_VLAN_TAG_LOC_L2TAG2 BIT(4)
#define IAVF_RXR_FLAGS_VLAN_TAG_LOC_L2TAG2_2 BIT(5)
@@ -392,21 +384,6 @@ struct iavf_ring {
*/
} ____cacheline_internodealigned_in_smp;

-static inline bool ring_uses_build_skb(struct iavf_ring *ring)
-{
- return !!(ring->flags & IAVF_RXR_FLAGS_BUILD_SKB_ENABLED);
-}
-
-static inline void set_ring_build_skb_enabled(struct iavf_ring *ring)
-{
- ring->flags |= IAVF_RXR_FLAGS_BUILD_SKB_ENABLED;
-}
-
-static inline void clear_ring_build_skb_enabled(struct iavf_ring *ring)
-{
- ring->flags &= ~IAVF_RXR_FLAGS_BUILD_SKB_ENABLED;
-}
-
#define IAVF_ITR_ADAPTIVE_MIN_INC 0x0002
#define IAVF_ITR_ADAPTIVE_MIN_USECS 0x0002
#define IAVF_ITR_ADAPTIVE_MAX_USECS 0x007e
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index 378c3e9ddf9d..52273f7eab2c 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -240,29 +240,6 @@ static const struct iavf_stats iavf_gstrings_stats[] = {

#define IAVF_QUEUE_STATS_LEN ARRAY_SIZE(iavf_gstrings_queue_stats)

-/* For now we have one and only one private flag and it is only defined
- * when we have support for the SKIP_CPU_SYNC DMA attribute. Instead
- * of leaving all this code sitting around empty we will strip it unless
- * our one private flag is actually available.
- */
-struct iavf_priv_flags {
- char flag_string[ETH_GSTRING_LEN];
- u32 flag;
- bool read_only;
-};
-
-#define IAVF_PRIV_FLAG(_name, _flag, _read_only) { \
- .flag_string = _name, \
- .flag = _flag, \
- .read_only = _read_only, \
-}
-
-static const struct iavf_priv_flags iavf_gstrings_priv_flags[] = {
- IAVF_PRIV_FLAG("legacy-rx", IAVF_FLAG_LEGACY_RX, 0),
-};
-
-#define IAVF_PRIV_FLAGS_STR_LEN ARRAY_SIZE(iavf_gstrings_priv_flags)
-
/**
* iavf_get_link_ksettings - Get Link Speed and Duplex settings
* @netdev: network interface device structure
@@ -342,8 +319,6 @@ static int iavf_get_sset_count(struct net_device *netdev, int sset)
return IAVF_STATS_LEN +
(IAVF_QUEUE_STATS_LEN * 2 *
netdev->real_num_tx_queues);
- else if (sset == ETH_SS_PRIV_FLAGS)
- return IAVF_PRIV_FLAGS_STR_LEN;
else
return -EINVAL;
}
@@ -385,21 +360,6 @@ static void iavf_get_ethtool_stats(struct net_device *netdev,
rcu_read_unlock();
}

-/**
- * iavf_get_priv_flag_strings - Get private flag strings
- * @netdev: network interface device structure
- * @data: buffer for string data
- *
- * Builds the private flags string table
- **/
-static void iavf_get_priv_flag_strings(struct net_device *netdev, u8 *data)
-{
- unsigned int i;
-
- for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++)
- ethtool_puts(&data, iavf_gstrings_priv_flags[i].flag_string);
-}
-
/**
* iavf_get_stat_strings - Get stat strings
* @netdev: network interface device structure
@@ -438,108 +398,11 @@ static void iavf_get_strings(struct net_device *netdev, u32 sset, u8 *data)
case ETH_SS_STATS:
iavf_get_stat_strings(netdev, data);
break;
- case ETH_SS_PRIV_FLAGS:
- iavf_get_priv_flag_strings(netdev, data);
- break;
default:
break;
}
}

-/**
- * iavf_get_priv_flags - report device private flags
- * @netdev: network interface device structure
- *
- * The get string set count and the string set should be matched for each
- * flag returned. Add new strings for each flag to the iavf_gstrings_priv_flags
- * array.
- *
- * Returns a u32 bitmap of flags.
- **/
-static u32 iavf_get_priv_flags(struct net_device *netdev)
-{
- struct iavf_adapter *adapter = netdev_priv(netdev);
- u32 i, ret_flags = 0;
-
- for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
- const struct iavf_priv_flags *priv_flags;
-
- priv_flags = &iavf_gstrings_priv_flags[i];
-
- if (priv_flags->flag & adapter->flags)
- ret_flags |= BIT(i);
- }
-
- return ret_flags;
-}
-
-/**
- * iavf_set_priv_flags - set private flags
- * @netdev: network interface device structure
- * @flags: bit flags to be set
- **/
-static int iavf_set_priv_flags(struct net_device *netdev, u32 flags)
-{
- struct iavf_adapter *adapter = netdev_priv(netdev);
- u32 orig_flags, new_flags, changed_flags;
- int ret = 0;
- u32 i;
-
- orig_flags = READ_ONCE(adapter->flags);
- new_flags = orig_flags;
-
- for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
- const struct iavf_priv_flags *priv_flags;
-
- priv_flags = &iavf_gstrings_priv_flags[i];
-
- if (flags & BIT(i))
- new_flags |= priv_flags->flag;
- else
- new_flags &= ~(priv_flags->flag);
-
- if (priv_flags->read_only &&
- ((orig_flags ^ new_flags) & ~BIT(i)))
- return -EOPNOTSUPP;
- }
-
- /* Before we finalize any flag changes, any checks which we need to
- * perform to determine if the new flags will be supported should go
- * here...
- */
-
- /* Compare and exchange the new flags into place. If we failed, that
- * is if cmpxchg returns anything but the old value, this means
- * something else must have modified the flags variable since we
- * copied it. We'll just punt with an error and log something in the
- * message buffer.
- */
- if (cmpxchg(&adapter->flags, orig_flags, new_flags) != orig_flags) {
- dev_warn(&adapter->pdev->dev,
- "Unable to update adapter->flags as it was modified by another thread...\n");
- return -EAGAIN;
- }
-
- changed_flags = orig_flags ^ new_flags;
-
- /* Process any additional changes needed as a result of flag changes.
- * The changed_flags value reflects the list of bits that were changed
- * in the code above.
- */
-
- /* issue a reset to force legacy-rx change to take effect */
- if (changed_flags & IAVF_FLAG_LEGACY_RX) {
- if (netif_running(netdev)) {
- iavf_schedule_reset(adapter, IAVF_FLAG_RESET_NEEDED);
- ret = iavf_wait_for_reset(adapter);
- if (ret)
- netdev_warn(netdev, "Changing private flags timeout or interrupted waiting for reset");
- }
- }
-
- return ret;
-}
-
/**
* iavf_get_msglevel - Get debug message level
* @netdev: network interface device structure
@@ -585,7 +448,6 @@ static void iavf_get_drvinfo(struct net_device *netdev,
strscpy(drvinfo->driver, iavf_driver_name, 32);
strscpy(drvinfo->fw_version, "N/A", 4);
strscpy(drvinfo->bus_info, pci_name(adapter->pdev), 32);
- drvinfo->n_priv_flags = IAVF_PRIV_FLAGS_STR_LEN;
}

/**
@@ -1995,8 +1857,6 @@ static const struct ethtool_ops iavf_ethtool_ops = {
.get_strings = iavf_get_strings,
.get_ethtool_stats = iavf_get_ethtool_stats,
.get_sset_count = iavf_get_sset_count,
- .get_priv_flags = iavf_get_priv_flags,
- .set_priv_flags = iavf_set_priv_flags,
.get_msglevel = iavf_get_msglevel,
.set_msglevel = iavf_set_msglevel,
.get_coalesce = iavf_get_coalesce,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index d6cbe5022815..5eb7379956e4 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -719,9 +719,7 @@ static void iavf_configure_rx(struct iavf_adapter *adapter)
struct iavf_hw *hw = &adapter->hw;
int i;

- /* Legacy Rx will always default to a 2048 buffer size. */
-#if (PAGE_SIZE < 8192)
- if (!(adapter->flags & IAVF_FLAG_LEGACY_RX)) {
+ if (PAGE_SIZE < 8192) {
struct net_device *netdev = adapter->netdev;

/* For jumbo frames on systems with 4K pages we have to use
@@ -738,16 +736,10 @@ static void iavf_configure_rx(struct iavf_adapter *adapter)
(netdev->mtu <= ETH_DATA_LEN))
rx_buf_len = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
}
-#endif

for (i = 0; i < adapter->num_active_queues; i++) {
adapter->rx_rings[i].tail = hw->hw_addr + IAVF_QRX_TAIL1(i);
adapter->rx_rings[i].rx_buf_len = rx_buf_len;
-
- if (adapter->flags & IAVF_FLAG_LEGACY_RX)
- clear_ring_build_skb_enabled(&adapter->rx_rings[i]);
- else
- set_ring_build_skb_enabled(&adapter->rx_rings[i]);
}
}

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index ab8420719a9a..4b61675b4548 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -824,17 +824,6 @@ static void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
writel(val, rx_ring->tail);
}

-/**
- * iavf_rx_offset - Return expected offset into page to access data
- * @rx_ring: Ring we are requesting offset of
- *
- * Returns the offset value for ring into the data buffer.
- */
-static unsigned int iavf_rx_offset(struct iavf_ring *rx_ring)
-{
- return ring_uses_build_skb(rx_ring) ? IAVF_SKB_PAD : 0;
-}
-
/**
* iavf_alloc_mapped_page - recycle or make a new page
* @rx_ring: ring to use
@@ -879,7 +868,7 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,

bi->dma = dma;
bi->page = page;
- bi->page_offset = iavf_rx_offset(rx_ring);
+ bi->page_offset = IAVF_SKB_PAD;

/* initialize pagecnt_bias to 1 representing we fully own page */
bi->pagecnt_bias = 1;
@@ -1218,7 +1207,7 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
#if (PAGE_SIZE < 8192)
unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
#else
- unsigned int truesize = SKB_DATA_ALIGN(size + iavf_rx_offset(rx_ring));
+ unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
#endif

if (!size)
@@ -1266,69 +1255,6 @@ static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
return rx_buffer;
}

-/**
- * iavf_construct_skb - Allocate skb and populate it
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
- * @size: size of buffer to add to skb
- *
- * This function allocates an skb. It then populates it with the page
- * data from the current receive descriptor, taking care to set up the
- * skb correctly.
- */
-static struct sk_buff *iavf_construct_skb(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer,
- unsigned int size)
-{
- void *va;
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
- unsigned int truesize = SKB_DATA_ALIGN(size);
-#endif
- unsigned int headlen;
- struct sk_buff *skb;
-
- if (!rx_buffer)
- return NULL;
- /* prefetch first cache line of first page */
- va = page_address(rx_buffer->page) + rx_buffer->page_offset;
- net_prefetch(va);
-
- /* allocate a skb to store the frags */
- skb = napi_alloc_skb(&rx_ring->q_vector->napi, IAVF_RX_HDR_SIZE);
- if (unlikely(!skb))
- return NULL;
-
- /* Determine available headroom for copy */
- headlen = size;
- if (headlen > IAVF_RX_HDR_SIZE)
- headlen = eth_get_headlen(skb->dev, va, IAVF_RX_HDR_SIZE);
-
- /* align pull length to size of long to optimize memcpy performance */
- memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
-
- /* update all of the pointers */
- size -= headlen;
- if (size) {
- skb_add_rx_frag(skb, 0, rx_buffer->page,
- rx_buffer->page_offset + headlen,
- size, truesize);
-
- /* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
- } else {
- /* buffer is unused, reset bias back to rx_buffer */
- rx_buffer->pagecnt_bias++;
- }
-
- return skb;
-}
-
/**
* iavf_build_skb - Build skb around an existing buffer
* @rx_ring: Rx descriptor ring to transact packets on
@@ -1500,10 +1426,8 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
/* retrieve a buffer from the ring */
if (skb)
iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
- else if (ring_uses_build_skb(rx_ring))
- skb = iavf_build_skb(rx_ring, rx_buffer, size);
else
- skb = iavf_construct_skb(rx_ring, rx_buffer, size);
+ skb = iavf_build_skb(rx_ring, rx_buffer, size);

/* exit if we failed to retrieve a buffer */
if (!skb) {
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index 22f2df7c460b..a31df5af0473 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -289,8 +289,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
return;

/* Limit maximum frame size when jumbo frames is not enabled */
- if (!(adapter->flags & IAVF_FLAG_LEGACY_RX) &&
- (adapter->netdev->mtu <= ETH_DATA_LEN))
+ if (adapter->netdev->mtu <= ETH_DATA_LEN)
max_frame = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;

vqci->vsi_id = adapter->vsi_res->vsi_id;
--
2.44.0


2024-04-04 15:49:05

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 4/9] slab: introduce kvmalloc_array_node() and kvcalloc_node()

Add NUMA-aware counterparts for kvmalloc_array() and kvcalloc() to be
able to flexibly allocate arrays for a particular node.
Rewrite kvmalloc_array() to kvmalloc_array_node(NUMA_NO_NODE) call.

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/slab.h | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index e53cbfa18325..d1d1fa5e7983 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -774,14 +774,27 @@ static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
return kvmalloc(size, flags | __GFP_ZERO);
}

-static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *
+kvmalloc_array_node(size_t n, size_t size, gfp_t flags, int node)
{
size_t bytes;

if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;

- return kvmalloc(bytes, flags);
+ return kvmalloc_node(bytes, flags, node);
+}
+
+static inline __alloc_size(1, 2) void *
+kvmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+ return kvmalloc_array_node(n, size, flags, NUMA_NO_NODE);
+}
+
+static inline __alloc_size(1, 2) void *
+kvcalloc_node(size_t n, size_t size, gfp_t flags, int node)
+{
+ return kvmalloc_array_node(n, size, flags | __GFP_ZERO, node);
}

static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
--
2.44.0


2024-04-04 15:49:16

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 3/9] iavf: drop page splitting and recycling

As an intermediate step, remove all page splitting/recycling code. Just
always allocate a new page and don't touch its refcount, so that it gets
freed by the core stack later.
Same for the "in-place" recycling, i.e. when an unused buffer gets
assigned to a first needs-refilling descriptor. In some cases, this
was leading to moving up to 63 &iavf_rx_buf structures around the ring
on a per-field basis -- not something wanted on hotpath.
The change allows to greatly simplify certain parts of the code:

Function: add/remove: 0/2 grow/shrink: 0/7 up/down: 0/-744 (-744)

Although the array of &iavf_rx_buf is barely used now and could be
replaced with just page pointer array, don't touch it now to not
complicate replacing it with libie Rx buffer struct later on.
No surprise perf loses up to 30% here, but that regression will
go away once PP lands.
Note that iavf_rx_pg_*() definitions are left to reduce diffstat.
They will be removed with the conversion to Page Pool.

Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 65 --------
drivers/net/ethernet/intel/iavf/iavf_type.h | 2 -
drivers/net/ethernet/intel/iavf/iavf_main.c | 24 +--
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 152 +-----------------
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 8 +-
5 files changed, 10 insertions(+), 241 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 68543efdd29b..e01777531635 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -81,8 +81,6 @@ enum iavf_dyn_idx_t {
BIT_ULL(IAVF_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))

/* Supported Rx Buffer Sizes (a multiple of 128) */
-#define IAVF_RXBUFFER_1536 1536 /* 128B aligned standard Ethernet frame */
-#define IAVF_RXBUFFER_2048 2048
#define IAVF_RXBUFFER_3072 3072 /* Used for large frames w/ padding */
#define IAVF_MAX_RXBUFFER 9728 /* largest size for single descriptor */

@@ -92,57 +90,7 @@ enum iavf_dyn_idx_t {
#define IAVF_RX_DMA_ATTR \
(DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)

-/* Attempt to maximize the headroom available for incoming frames. We
- * use a 2K buffer for receives and need 1536/1534 to store the data for
- * the frame. This leaves us with 512 bytes of room. From that we need
- * to deduct the space needed for the shared info and the padding needed
- * to IP align the frame.
- *
- * Note: For cache line sizes 256 or larger this value is going to end
- * up negative. In these cases we should fall back to the legacy
- * receive path.
- */
-#if (PAGE_SIZE < 8192)
-#define IAVF_2K_TOO_SMALL_WITH_PADDING \
-((NET_SKB_PAD + IAVF_RXBUFFER_1536) > SKB_WITH_OVERHEAD(IAVF_RXBUFFER_2048))
-
-static inline int iavf_compute_pad(int rx_buf_len)
-{
- int page_size, pad_size;
-
- page_size = ALIGN(rx_buf_len, PAGE_SIZE / 2);
- pad_size = SKB_WITH_OVERHEAD(page_size) - rx_buf_len;
-
- return pad_size;
-}
-
-static inline int iavf_skb_pad(void)
-{
- int rx_buf_len;
-
- /* If a 2K buffer cannot handle a standard Ethernet frame then
- * optimize padding for a 3K buffer instead of a 1.5K buffer.
- *
- * For a 3K buffer we need to add enough padding to allow for
- * tailroom due to NET_IP_ALIGN possibly shifting us out of
- * cache-line alignment.
- */
- if (IAVF_2K_TOO_SMALL_WITH_PADDING)
- rx_buf_len = IAVF_RXBUFFER_3072 + SKB_DATA_ALIGN(NET_IP_ALIGN);
- else
- rx_buf_len = IAVF_RXBUFFER_1536;
-
- /* if needed make room for NET_IP_ALIGN */
- rx_buf_len -= NET_IP_ALIGN;
-
- return iavf_compute_pad(rx_buf_len);
-}
-
-#define IAVF_SKB_PAD iavf_skb_pad()
-#else
-#define IAVF_2K_TOO_SMALL_WITH_PADDING false
#define IAVF_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-#endif

/**
* iavf_test_staterr - tests bits in Rx descriptor status and error fields
@@ -265,12 +213,7 @@ struct iavf_tx_buffer {
struct iavf_rx_buffer {
dma_addr_t dma;
struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
__u32 page_offset;
-#else
- __u16 page_offset;
-#endif
- __u16 pagecnt_bias;
};

struct iavf_queue_stats {
@@ -292,8 +235,6 @@ struct iavf_rx_queue_stats {
u64 non_eop_descs;
u64 alloc_page_failed;
u64 alloc_buff_failed;
- u64 page_reuse_count;
- u64 realloc_count;
};

enum iavf_ring_state_t {
@@ -337,7 +278,6 @@ struct iavf_ring {

u16 count; /* Number of descriptors */
u16 reg_idx; /* HW register index of the ring */
- u16 rx_buf_len;

/* used in interrupt processing */
u16 next_to_use;
@@ -373,7 +313,6 @@ struct iavf_ring {
struct iavf_q_vector *q_vector; /* Backreference to associated vector */

struct rcu_head rcu; /* to avoid race on free */
- u16 next_to_alloc;
struct sk_buff *skb; /* When iavf_clean_rx_ring_irq() must
* return before it sees the EOP for
* the current packet, we save that skb
@@ -407,10 +346,6 @@ struct iavf_ring_container {

static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
{
-#if (PAGE_SIZE < 8192)
- if (ring->rx_buf_len > (PAGE_SIZE / 2))
- return 1;
-#endif
return 0;
}

diff --git a/drivers/net/ethernet/intel/iavf/iavf_type.h b/drivers/net/ethernet/intel/iavf/iavf_type.h
index 23ded4fcd94f..f6b09e57abce 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_type.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_type.h
@@ -10,8 +10,6 @@
#include "iavf_adminq.h"
#include "iavf_devids.h"

-#define IAVF_RXQ_CTX_DBUFF_SHIFT 7
-
/* IAVF_MASK is a macro used on 32 bit registers */
#define IAVF_MASK(mask, shift) ((u32)(mask) << (shift))

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 5eb7379956e4..ffb71a62b105 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -715,32 +715,10 @@ static void iavf_configure_tx(struct iavf_adapter *adapter)
**/
static void iavf_configure_rx(struct iavf_adapter *adapter)
{
- unsigned int rx_buf_len = IAVF_RXBUFFER_2048;
struct iavf_hw *hw = &adapter->hw;
- int i;
-
- if (PAGE_SIZE < 8192) {
- struct net_device *netdev = adapter->netdev;

- /* For jumbo frames on systems with 4K pages we have to use
- * an order 1 page, so we might as well increase the size
- * of our Rx buffer to make better use of the available space
- */
- rx_buf_len = IAVF_RXBUFFER_3072;
-
- /* We use a 1536 buffer size for configurations with
- * standard Ethernet mtu. On x86 this gives us enough room
- * for shared info and 192 bytes of padding.
- */
- if (!IAVF_2K_TOO_SMALL_WITH_PADDING &&
- (netdev->mtu <= ETH_DATA_LEN))
- rx_buf_len = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
- }
-
- for (i = 0; i < adapter->num_active_queues; i++) {
+ for (u32 i = 0; i < adapter->num_active_queues; i++)
adapter->rx_rings[i].tail = hw->hw_addr + IAVF_QRX_TAIL1(i);
- adapter->rx_rings[i].rx_buf_len = rx_buf_len;
- }
}

/**
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 4b61675b4548..a14f7f211150 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -715,7 +715,7 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
dma_sync_single_range_for_cpu(rx_ring->dev,
rx_bi->dma,
rx_bi->page_offset,
- rx_ring->rx_buf_len,
+ IAVF_RXBUFFER_3072,
DMA_FROM_DEVICE);

/* free resources associated with mapping */
@@ -724,7 +724,7 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
DMA_FROM_DEVICE,
IAVF_RX_DMA_ATTR);

- __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+ __free_page(rx_bi->page);

rx_bi->page = NULL;
rx_bi->page_offset = 0;
@@ -736,7 +736,6 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
/* Zero out the descriptor ring */
memset(rx_ring->desc, 0, rx_ring->size);

- rx_ring->next_to_alloc = 0;
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
}
@@ -792,7 +791,6 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
goto err;
}

- rx_ring->next_to_alloc = 0;
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;

@@ -812,9 +810,6 @@ static void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
{
rx_ring->next_to_use = val;

- /* update next to alloc since we have filled the ring */
- rx_ring->next_to_alloc = val;
-
/* Force memory writes to complete before letting h/w
* know there are new descriptors to fetch. (Only
* applicable for weak-ordered memory model archs,
@@ -838,12 +833,6 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
struct page *page = bi->page;
dma_addr_t dma;

- /* since we are recycling buffers we should seldom need to alloc */
- if (likely(page)) {
- rx_ring->rx_stats.page_reuse_count++;
- return true;
- }
-
/* alloc new page for storage */
page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
if (unlikely(!page)) {
@@ -870,9 +859,6 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
bi->page = page;
bi->page_offset = IAVF_SKB_PAD;

- /* initialize pagecnt_bias to 1 representing we fully own page */
- bi->pagecnt_bias = 1;
-
return true;
}

@@ -924,7 +910,7 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
/* sync the buffer for use by the device */
dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
bi->page_offset,
- rx_ring->rx_buf_len,
+ IAVF_RXBUFFER_3072,
DMA_FROM_DEVICE);

/* Refresh the desc even if buffer_addrs didn't change
@@ -1102,91 +1088,6 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
return false;
}

-/**
- * iavf_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void iavf_reuse_rx_page(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *old_buff)
-{
- struct iavf_rx_buffer *new_buff;
- u16 nta = rx_ring->next_to_alloc;
-
- new_buff = &rx_ring->rx_bi[nta];
-
- /* update, and store next to alloc */
- nta++;
- rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
- /* transfer page from old buffer to new buffer */
- new_buff->dma = old_buff->dma;
- new_buff->page = old_buff->page;
- new_buff->page_offset = old_buff->page_offset;
- new_buff->pagecnt_bias = old_buff->pagecnt_bias;
-}
-
-/**
- * iavf_can_reuse_rx_page - Determine if this page can be reused by
- * the adapter for another receive
- *
- * @rx_buffer: buffer containing the page
- *
- * If page is reusable, rx_buffer->page_offset is adjusted to point to
- * an unused region in the page.
- *
- * For small pages, @truesize will be a constant value, half the size
- * of the memory at page. We'll attempt to alternate between high and
- * low halves of the page, with one half ready for use by the hardware
- * and the other half being consumed by the stack. We use the page
- * ref count to determine whether the stack has finished consuming the
- * portion of this page that was passed up with a previous packet. If
- * the page ref count is >1, we'll assume the "other" half page is
- * still busy, and this page cannot be reused.
- *
- * For larger pages, @truesize will be the actual space used by the
- * received packet (adjusted upward to an even multiple of the cache
- * line size). This will advance through the page by the amount
- * actually consumed by the received packets while there is still
- * space for a buffer. Each region of larger pages will be used at
- * most once, after which the page will not be reused.
- *
- * In either case, if the page is reusable its refcount is increased.
- **/
-static bool iavf_can_reuse_rx_page(struct iavf_rx_buffer *rx_buffer)
-{
- unsigned int pagecnt_bias = rx_buffer->pagecnt_bias;
- struct page *page = rx_buffer->page;
-
- /* Is any reuse possible? */
- if (!dev_page_is_reusable(page))
- return false;
-
-#if (PAGE_SIZE < 8192)
- /* if we are only owner of page we can reuse it */
- if (unlikely((page_count(page) - pagecnt_bias) > 1))
- return false;
-#else
-#define IAVF_LAST_OFFSET \
- (SKB_WITH_OVERHEAD(PAGE_SIZE) - IAVF_RXBUFFER_2048)
- if (rx_buffer->page_offset > IAVF_LAST_OFFSET)
- return false;
-#endif
-
- /* If we have drained the page fragment pool we need to update
- * the pagecnt_bias and page count so that we fully restock the
- * number of references the driver holds.
- */
- if (unlikely(!pagecnt_bias)) {
- page_ref_add(page, USHRT_MAX);
- rx_buffer->pagecnt_bias = USHRT_MAX;
- }
-
- return true;
-}
-
/**
* iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
* @rx_ring: rx descriptor ring to transact packets on
@@ -1204,24 +1105,13 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
struct sk_buff *skb,
unsigned int size)
{
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
-#endif

if (!size)
return;

skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
rx_buffer->page_offset, size, truesize);
-
- /* page is being used so we must update the page offset */
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
}

/**
@@ -1249,9 +1139,6 @@ static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
size,
DMA_FROM_DEVICE);

- /* We have pulled a buffer for use, so decrement pagecnt_bias */
- rx_buffer->pagecnt_bias--;
-
return rx_buffer;
}

@@ -1269,12 +1156,8 @@ static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
unsigned int size)
{
void *va;
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
-#endif
struct sk_buff *skb;

if (!rx_buffer || !size)
@@ -1292,23 +1175,15 @@ static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
skb_reserve(skb, IAVF_SKB_PAD);
__skb_put(skb, size);

- /* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
-
return skb;
}

/**
- * iavf_put_rx_buffer - Clean up used buffer and either recycle or free
+ * iavf_put_rx_buffer - Unmap used buffer
* @rx_ring: rx descriptor ring to transact packets on
* @rx_buffer: rx buffer to pull data from
*
- * This function will clean up the contents of the rx_buffer. It will
- * either recycle the buffer or unmap it and free the associated resources.
+ * This function will unmap the buffer after it's written by HW.
*/
static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
struct iavf_rx_buffer *rx_buffer)
@@ -1316,18 +1191,9 @@ static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
if (!rx_buffer)
return;

- if (iavf_can_reuse_rx_page(rx_buffer)) {
- /* hand second half of page back to the ring */
- iavf_reuse_rx_page(rx_ring, rx_buffer);
- rx_ring->rx_stats.page_reuse_count++;
- } else {
- /* we are not reusing the buffer so unmap it */
- dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
- iavf_rx_pg_size(rx_ring),
- DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
- __page_frag_cache_drain(rx_buffer->page,
- rx_buffer->pagecnt_bias);
- }
+ /* we are not reusing the buffer so unmap it */
+ dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
+ DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);

/* clear contents of buffer_info */
rx_buffer->page = NULL;
@@ -1432,8 +1298,6 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
/* exit if we failed to retrieve a buffer */
if (!skb) {
rx_ring->rx_stats.alloc_buff_failed++;
- if (rx_buffer && size)
- rx_buffer->pagecnt_bias++;
break;
}

diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index a31df5af0473..f8e9f859a4f1 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -288,10 +288,6 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
if (!vqci)
return;

- /* Limit maximum frame size when jumbo frames is not enabled */
- if (adapter->netdev->mtu <= ETH_DATA_LEN)
- max_frame = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
-
vqci->vsi_id = adapter->vsi_res->vsi_id;
vqci->num_queue_pairs = pairs;
vqpi = vqci->qpair;
@@ -308,9 +304,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
vqpi->rxq.ring_len = adapter->rx_rings[i].count;
vqpi->rxq.dma_ring_addr = adapter->rx_rings[i].dma;
vqpi->rxq.max_pkt_size = max_frame;
- vqpi->rxq.databuffer_size =
- ALIGN(adapter->rx_rings[i].rx_buf_len,
- BIT_ULL(IAVF_RXQ_CTX_DBUFF_SHIFT));
+ vqpi->rxq.databuffer_size = IAVF_RXBUFFER_3072;
if (CRC_OFFLOAD_ALLOWED(adapter))
vqpi->rxq.crc_disable = !!(adapter->netdev->features &
NETIF_F_RXFCS);
--
2.44.0


2024-04-04 15:50:15

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 8/9] iavf: pack iavf_ring more efficiently

Before replacing the Rx buffer management with libie, clean up
&iavf_ring a bit.
There are several fields not used anywhere in the code -- simply remove
them. Move ::tail up to remove a hole. Replace ::arm_wb boolean with
1-bit flag in ::flags to free 1 more byte. Finally, move ::prev_pkt_ctr
out of &iavf_tx_queue_stats -- it doesn't belong there (used for Tx
stall detection). Place it next to the stats on the ring itself to fill
the 4-byte slot.
The result: no holes and all the hot fields fit into the first 64-byte
cacheline.

Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 22 +++------------------
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 12 +++++------
2 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index e01777531635..ed559fa6f214 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -227,7 +227,6 @@ struct iavf_tx_queue_stats {
u64 tx_done_old;
u64 tx_linearize;
u64 tx_force_wb;
- int prev_pkt_ctr;
u64 tx_lost_interrupt;
};

@@ -237,12 +236,6 @@ struct iavf_rx_queue_stats {
u64 alloc_buff_failed;
};

-enum iavf_ring_state_t {
- __IAVF_TX_FDIR_INIT_DONE,
- __IAVF_TX_XPS_INIT_DONE,
- __IAVF_RING_STATE_NBITS /* must be last */
-};
-
/* some useful defines for virtchannel interface, which
* is the only remaining user of header split
*/
@@ -264,10 +257,8 @@ struct iavf_ring {
struct iavf_tx_buffer *tx_bi;
struct iavf_rx_buffer *rx_bi;
};
- DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
- u16 queue_index; /* Queue number of ring */
- u8 dcb_tc; /* Traffic class of ring */
u8 __iomem *tail;
+ u16 queue_index; /* Queue number of ring */

/* high bit set means dynamic, use accessors routines to read/write.
* hardware only supports 2us resolution for the ITR registers.
@@ -277,22 +268,14 @@ struct iavf_ring {
u16 itr_setting;

u16 count; /* Number of descriptors */
- u16 reg_idx; /* HW register index of the ring */

/* used in interrupt processing */
u16 next_to_use;
u16 next_to_clean;

- u8 atr_sample_rate;
- u8 atr_count;
-
- bool ring_active; /* is ring online or not */
- bool arm_wb; /* do something to arm write back */
- u8 packet_stride;
-
u16 flags;
#define IAVF_TXR_FLAGS_WB_ON_ITR BIT(0)
-/* BIT(1) is free, was IAVF_RXR_FLAGS_BUILD_SKB_ENABLED */
+#define IAVF_TXR_FLAGS_ARM_WB BIT(1)
/* BIT(2) is free */
#define IAVF_TXRX_FLAGS_VLAN_TAG_LOC_L2TAG1 BIT(3)
#define IAVF_TXR_FLAGS_VLAN_TAG_LOC_L2TAG2 BIT(4)
@@ -306,6 +289,7 @@ struct iavf_ring {
struct iavf_rx_queue_stats rx_stats;
};

+ int prev_pkt_ctr; /* For Tx stall detection */
unsigned int size; /* length of descriptor ring in bytes */
dma_addr_t dma; /* physical address of ring */

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index a14f7f211150..1a27fa613f6d 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -185,7 +185,7 @@ void iavf_detect_recover_hung(struct iavf_vsi *vsi)
* pending work.
*/
packets = tx_ring->stats.packets & INT_MAX;
- if (tx_ring->tx_stats.prev_pkt_ctr == packets) {
+ if (tx_ring->prev_pkt_ctr == packets) {
iavf_force_wb(vsi, tx_ring->q_vector);
continue;
}
@@ -194,7 +194,7 @@ void iavf_detect_recover_hung(struct iavf_vsi *vsi)
* to iavf_get_tx_pending()
*/
smp_rmb();
- tx_ring->tx_stats.prev_pkt_ctr =
+ tx_ring->prev_pkt_ctr =
iavf_get_tx_pending(tx_ring, true) ? packets : -1;
}
}
@@ -320,7 +320,7 @@ static bool iavf_clean_tx_irq(struct iavf_vsi *vsi,
((j / WB_STRIDE) == 0) && (j > 0) &&
!test_bit(__IAVF_VSI_DOWN, vsi->state) &&
(IAVF_DESC_UNUSED(tx_ring) != tx_ring->count))
- tx_ring->arm_wb = true;
+ tx_ring->flags |= IAVF_TXR_FLAGS_ARM_WB;
}

/* notify netdev of completed buffers */
@@ -675,7 +675,7 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)

tx_ring->next_to_use = 0;
tx_ring->next_to_clean = 0;
- tx_ring->tx_stats.prev_pkt_ctr = -1;
+ tx_ring->prev_pkt_ctr = -1;
return 0;

err:
@@ -1491,8 +1491,8 @@ int iavf_napi_poll(struct napi_struct *napi, int budget)
clean_complete = false;
continue;
}
- arm_wb |= ring->arm_wb;
- ring->arm_wb = false;
+ arm_wb |= !!(ring->flags & IAVF_TXR_FLAGS_ARM_WB);
+ ring->flags &= ~IAVF_TXR_FLAGS_ARM_WB;
}

/* Handle case where we are called by netpoll with a budget of 0 */
--
2.44.0


2024-04-04 16:01:06

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 9/9] iavf: switch to Page Pool

Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
no custom recycling logics, it can easily be switched to using Page
Pool / libeth API instead.
This allows to removing the whole dancing around headroom, HW buffer
size, and page order. All DMA-for-device is now done in the PP core,
for-CPU -- in the libeth helper.
Use skb_mark_for_recycle() to bring back the recycling and restore the
performance. Speaking of performance: on par with the baseline and
faster with the PP optimization series applied. But the memory usage for
1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
every second descriptor.

Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 34 +--
include/linux/net/intel/libie/rx.h | 17 ++
drivers/net/ethernet/intel/iavf/iavf_main.c | 7 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 257 +++++-------------
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 10 +-
5 files changed, 111 insertions(+), 214 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index ed559fa6f214..d7b5587aeb8e 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -80,18 +80,8 @@ enum iavf_dyn_idx_t {
BIT_ULL(IAVF_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \
BIT_ULL(IAVF_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))

-/* Supported Rx Buffer Sizes (a multiple of 128) */
-#define IAVF_RXBUFFER_3072 3072 /* Used for large frames w/ padding */
-#define IAVF_MAX_RXBUFFER 9728 /* largest size for single descriptor */
-
-#define IAVF_PACKET_HDR_PAD (ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))
#define iavf_rx_desc iavf_32byte_rx_desc

-#define IAVF_RX_DMA_ATTR \
- (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
-
-#define IAVF_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-
/**
* iavf_test_staterr - tests bits in Rx descriptor status and error fields
* @rx_desc: pointer to receive descriptor (in le64 format)
@@ -210,12 +200,6 @@ struct iavf_tx_buffer {
u32 tx_flags;
};

-struct iavf_rx_buffer {
- dma_addr_t dma;
- struct page *page;
- __u32 page_offset;
-};
-
struct iavf_queue_stats {
u64 packets;
u64 bytes;
@@ -251,13 +235,18 @@ struct iavf_rx_queue_stats {
struct iavf_ring {
struct iavf_ring *next; /* pointer to next ring in q_vector */
void *desc; /* Descriptor ring memory */
- struct device *dev; /* Used for DMA mapping */
+ union {
+ struct page_pool *pp; /* Used on Rx for buffer management */
+ struct device *dev; /* Used on Tx for DMA mapping */
+ };
struct net_device *netdev; /* netdev ring maps to */
union {
+ struct libeth_fqe *rx_fqes;
struct iavf_tx_buffer *tx_bi;
- struct iavf_rx_buffer *rx_bi;
};
u8 __iomem *tail;
+ u32 truesize;
+
u16 queue_index; /* Queue number of ring */

/* high bit set means dynamic, use accessors routines to read/write.
@@ -305,6 +294,8 @@ struct iavf_ring {
* iavf_clean_rx_ring_irq() is called
* for this ring.
*/
+
+ u32 rx_buf_len;
} ____cacheline_internodealigned_in_smp;

#define IAVF_ITR_ADAPTIVE_MIN_INC 0x0002
@@ -328,13 +319,6 @@ struct iavf_ring_container {
#define iavf_for_each_ring(pos, head) \
for (pos = (head).ring; pos != NULL; pos = pos->next)

-static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
-{
- return 0;
-}
-
-#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
-
bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring);
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
index ab9ffe1e93d8..be6cfb59bc6c 100644
--- a/include/linux/net/intel/libie/rx.h
+++ b/include/linux/net/intel/libie/rx.h
@@ -6,6 +6,23 @@

#include <net/libeth/rx.h>

+/* Rx buffer management */
+
+/* The largest size for a single descriptor as per HW */
+#define LIBIE_MAX_RX_BUF_LEN 9728U
+/* "True" HW-writeable space: minimum from SW and HW values */
+#define LIBIE_RX_BUF_LEN(hr) min_t(u32, LIBETH_RX_PAGE_LEN(hr), \
+ LIBIE_MAX_RX_BUF_LEN)
+
+/* The maximum frame size as per HW (S/G) */
+#define __LIBIE_MAX_RX_FRM_LEN 16382U
+/* ATST, HW can chain up to 5 Rx descriptors */
+#define LIBIE_MAX_RX_FRM_LEN(hr) \
+ min_t(u32, __LIBIE_MAX_RX_FRM_LEN, LIBIE_RX_BUF_LEN(hr) * 5)
+/* Maximum frame size minus LL overhead */
+#define LIBIE_MAX_MTU \
+ (LIBIE_MAX_RX_FRM_LEN(LIBETH_MAX_HEADROOM) - LIBETH_RX_LL_LEN)
+
/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
* bitfield struct.
*/
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index ffb71a62b105..7e0de0a9b883 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1,6 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2013 - 2018 Intel Corporation. */

+#include <linux/net/intel/libie/rx.h>
+
#include "iavf.h"
#include "iavf_prototype.h"
/* All iavf tracepoints are defined by the include below, which must
@@ -45,6 +47,7 @@ MODULE_DEVICE_TABLE(pci, iavf_pci_tbl);
MODULE_ALIAS("i40evf");
MODULE_AUTHOR("Intel Corporation, <[email protected]>");
MODULE_DESCRIPTION("Intel(R) Ethernet Adaptive Virtual Function Network Driver");
+MODULE_IMPORT_NS(LIBETH);
MODULE_IMPORT_NS(LIBIE);
MODULE_LICENSE("GPL v2");

@@ -1586,7 +1589,6 @@ static int iavf_alloc_queues(struct iavf_adapter *adapter)
rx_ring = &adapter->rx_rings[i];
rx_ring->queue_index = i;
rx_ring->netdev = adapter->netdev;
- rx_ring->dev = &adapter->pdev->dev;
rx_ring->count = adapter->rx_desc_count;
rx_ring->itr_setting = IAVF_ITR_RX_DEF;
}
@@ -2613,9 +2615,8 @@ static void iavf_init_config_adapter(struct iavf_adapter *adapter)
iavf_set_ethtool_ops(netdev);
netdev->watchdog_timeo = 5 * HZ;

- /* MTU range: 68 - 9710 */
netdev->min_mtu = ETH_MIN_MTU;
- netdev->max_mtu = IAVF_MAX_RXBUFFER - IAVF_PACKET_HDR_PAD;
+ netdev->max_mtu = LIBIE_MAX_MTU;

if (!is_valid_ether_addr(adapter->hw.mac.addr)) {
dev_info(&pdev->dev, "Invalid MAC address %pM, using random\n",
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 1a27fa613f6d..02cda84bcb9a 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -690,11 +690,8 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
**/
static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
{
- unsigned long bi_size;
- u16 i;
-
/* ring already cleared, nothing to do */
- if (!rx_ring->rx_bi)
+ if (!rx_ring->rx_fqes)
return;

if (rx_ring->skb) {
@@ -702,40 +699,16 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
rx_ring->skb = NULL;
}

- /* Free all the Rx ring sk_buffs */
- for (i = 0; i < rx_ring->count; i++) {
- struct iavf_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+ /* Free all the Rx ring buffers */
+ for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
+ const struct libeth_fqe *rx_fqes = &rx_ring->rx_fqes[i];

- if (!rx_bi->page)
- continue;
+ page_pool_put_full_page(rx_ring->pp, rx_fqes->page, false);

- /* Invalidate cache lines that may have been written to by
- * device so that we avoid corrupting memory.
- */
- dma_sync_single_range_for_cpu(rx_ring->dev,
- rx_bi->dma,
- rx_bi->page_offset,
- IAVF_RXBUFFER_3072,
- DMA_FROM_DEVICE);
-
- /* free resources associated with mapping */
- dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
- iavf_rx_pg_size(rx_ring),
- DMA_FROM_DEVICE,
- IAVF_RX_DMA_ATTR);
-
- __free_page(rx_bi->page);
-
- rx_bi->page = NULL;
- rx_bi->page_offset = 0;
+ if (unlikely(++i == rx_ring->count))
+ i = 0;
}

- bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
- memset(rx_ring->rx_bi, 0, bi_size);
-
- /* Zero out the descriptor ring */
- memset(rx_ring->desc, 0, rx_ring->size);
-
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
}
@@ -748,15 +721,22 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
**/
void iavf_free_rx_resources(struct iavf_ring *rx_ring)
{
+ struct libeth_fq fq = {
+ .fqes = rx_ring->rx_fqes,
+ .pp = rx_ring->pp,
+ };
+
iavf_clean_rx_ring(rx_ring);
- kfree(rx_ring->rx_bi);
- rx_ring->rx_bi = NULL;

if (rx_ring->desc) {
- dma_free_coherent(rx_ring->dev, rx_ring->size,
+ dma_free_coherent(rx_ring->pp->p.dev, rx_ring->size,
rx_ring->desc, rx_ring->dma);
rx_ring->desc = NULL;
}
+
+ libeth_rx_fq_destroy(&fq);
+ rx_ring->rx_fqes = NULL;
+ rx_ring->pp = NULL;
}

/**
@@ -767,26 +747,32 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
**/
int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
{
- struct device *dev = rx_ring->dev;
- int bi_size;
-
- /* warn if we are about to overwrite the pointer */
- WARN_ON(rx_ring->rx_bi);
- bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
- rx_ring->rx_bi = kzalloc(bi_size, GFP_KERNEL);
- if (!rx_ring->rx_bi)
- goto err;
+ struct libeth_fq fq = {
+ .count = rx_ring->count,
+ .buf_len = LIBIE_MAX_RX_BUF_LEN,
+ .nid = NUMA_NO_NODE,
+ };
+ int ret;
+
+ ret = libeth_rx_fq_create(&fq, &rx_ring->q_vector->napi);
+ if (ret)
+ return ret;
+
+ rx_ring->pp = fq.pp;
+ rx_ring->rx_fqes = fq.fqes;
+ rx_ring->truesize = fq.truesize;
+ rx_ring->rx_buf_len = fq.buf_len;

u64_stats_init(&rx_ring->syncp);

/* Round up to nearest 4K */
rx_ring->size = rx_ring->count * sizeof(union iavf_32byte_rx_desc);
rx_ring->size = ALIGN(rx_ring->size, 4096);
- rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
+ rx_ring->desc = dma_alloc_coherent(fq.pp->p.dev, rx_ring->size,
&rx_ring->dma, GFP_KERNEL);

if (!rx_ring->desc) {
- dev_info(dev, "Unable to allocate memory for the Rx descriptor ring, size=%d\n",
+ dev_info(fq.pp->p.dev, "Unable to allocate memory for the Rx descriptor ring, size=%d\n",
rx_ring->size);
goto err;
}
@@ -795,9 +781,12 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
rx_ring->next_to_use = 0;

return 0;
+
err:
- kfree(rx_ring->rx_bi);
- rx_ring->rx_bi = NULL;
+ libeth_rx_fq_destroy(&fq);
+ rx_ring->rx_fqes = NULL;
+ rx_ring->pp = NULL;
+
return -ENOMEM;
}

@@ -819,49 +808,6 @@ static void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
writel(val, rx_ring->tail);
}

-/**
- * iavf_alloc_mapped_page - recycle or make a new page
- * @rx_ring: ring to use
- * @bi: rx_buffer struct to modify
- *
- * Returns true if the page was successfully allocated or
- * reused.
- **/
-static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *bi)
-{
- struct page *page = bi->page;
- dma_addr_t dma;
-
- /* alloc new page for storage */
- page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
- if (unlikely(!page)) {
- rx_ring->rx_stats.alloc_page_failed++;
- return false;
- }
-
- /* map page for use */
- dma = dma_map_page_attrs(rx_ring->dev, page, 0,
- iavf_rx_pg_size(rx_ring),
- DMA_FROM_DEVICE,
- IAVF_RX_DMA_ATTR);
-
- /* if mapping failed free memory back to system since
- * there isn't much point in holding memory we can't use
- */
- if (dma_mapping_error(rx_ring->dev, dma)) {
- __free_pages(page, iavf_rx_pg_order(rx_ring));
- rx_ring->rx_stats.alloc_page_failed++;
- return false;
- }
-
- bi->dma = dma;
- bi->page = page;
- bi->page_offset = IAVF_SKB_PAD;
-
- return true;
-}
-
/**
* iavf_receive_skb - Send a completed packet up the stack
* @rx_ring: rx ring in play
@@ -892,38 +838,37 @@ static void iavf_receive_skb(struct iavf_ring *rx_ring,
**/
bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
{
+ const struct libeth_fq_fp fq = {
+ .pp = rx_ring->pp,
+ .fqes = rx_ring->rx_fqes,
+ .truesize = rx_ring->truesize,
+ .count = rx_ring->count,
+ };
u16 ntu = rx_ring->next_to_use;
union iavf_rx_desc *rx_desc;
- struct iavf_rx_buffer *bi;

/* do nothing if no valid netdev defined */
if (!rx_ring->netdev || !cleaned_count)
return false;

rx_desc = IAVF_RX_DESC(rx_ring, ntu);
- bi = &rx_ring->rx_bi[ntu];

do {
- if (!iavf_alloc_mapped_page(rx_ring, bi))
- goto no_buffers;
+ dma_addr_t addr;

- /* sync the buffer for use by the device */
- dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
- bi->page_offset,
- IAVF_RXBUFFER_3072,
- DMA_FROM_DEVICE);
+ addr = libeth_rx_alloc(&fq, ntu);
+ if (addr == DMA_MAPPING_ERROR)
+ goto no_buffers;

/* Refresh the desc even if buffer_addrs didn't change
* because each write-back erases this info.
*/
- rx_desc->read.pkt_addr = cpu_to_le64(bi->dma + bi->page_offset);
+ rx_desc->read.pkt_addr = cpu_to_le64(addr);

rx_desc++;
- bi++;
ntu++;
if (unlikely(ntu == rx_ring->count)) {
rx_desc = IAVF_RX_DESC(rx_ring, 0);
- bi = rx_ring->rx_bi;
ntu = 0;
}

@@ -942,6 +887,8 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
if (rx_ring->next_to_use != ntu)
iavf_release_rx_desc(rx_ring, ntu);

+ rx_ring->rx_stats.alloc_page_failed++;
+
/* make sure to come back via polling to try again after
* allocation failure
*/
@@ -1090,9 +1037,8 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)

/**
* iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: buffer containing page to add
* @skb: sk_buff to place the data into
+ * @rx_buffer: buffer containing page to add
* @size: packet length from rx_desc
*
* This function will add the data contained in rx_buffer->page to the skb.
@@ -1100,105 +1046,49 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
*
* The function will then update the page offset.
**/
-static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer,
- struct sk_buff *skb,
+static void iavf_add_rx_frag(struct sk_buff *skb,
+ const struct libeth_fqe *rx_buffer,
unsigned int size)
{
- unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
-
- if (!size)
- return;
+ u32 hr = rx_buffer->page->pp->p.offset;

skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
- rx_buffer->page_offset, size, truesize);
-}
-
-/**
- * iavf_get_rx_buffer - Fetch Rx buffer and synchronize data for use
- * @rx_ring: rx descriptor ring to transact packets on
- * @size: size of buffer to add to skb
- *
- * This function will pull an Rx buffer from the ring and synchronize it
- * for use by the CPU.
- */
-static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
- const unsigned int size)
-{
- struct iavf_rx_buffer *rx_buffer;
-
- rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
- prefetchw(rx_buffer->page);
- if (!size)
- return rx_buffer;
-
- /* we are reusing so sync this buffer for CPU use */
- dma_sync_single_range_for_cpu(rx_ring->dev,
- rx_buffer->dma,
- rx_buffer->page_offset,
- size,
- DMA_FROM_DEVICE);
-
- return rx_buffer;
+ rx_buffer->offset + hr, size, rx_buffer->truesize);
}

/**
* iavf_build_skb - Build skb around an existing buffer
- * @rx_ring: Rx descriptor ring to transact packets on
* @rx_buffer: Rx buffer to pull data from
* @size: size of buffer to add to skb
*
* This function builds an skb around an existing Rx buffer, taking care
* to set up the skb correctly and avoid any memcpy overhead.
*/
-static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer,
+static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
unsigned int size)
{
- void *va;
- unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
- SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
+ u32 hr = rx_buffer->page->pp->p.offset;
struct sk_buff *skb;
+ void *va;

- if (!rx_buffer || !size)
- return NULL;
/* prefetch first cache line of first page */
- va = page_address(rx_buffer->page) + rx_buffer->page_offset;
- net_prefetch(va);
+ va = page_address(rx_buffer->page) + rx_buffer->offset;
+ net_prefetch(va + hr);

/* build an skb around the page buffer */
- skb = napi_build_skb(va - IAVF_SKB_PAD, truesize);
+ skb = napi_build_skb(va, rx_buffer->truesize);
if (unlikely(!skb))
return NULL;

+ skb_mark_for_recycle(skb);
+
/* update pointers within the skb to store the data */
- skb_reserve(skb, IAVF_SKB_PAD);
+ skb_reserve(skb, hr);
__skb_put(skb, size);

return skb;
}

-/**
- * iavf_put_rx_buffer - Unmap used buffer
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
- *
- * This function will unmap the buffer after it's written by HW.
- */
-static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer)
-{
- if (!rx_buffer)
- return;
-
- /* we are not reusing the buffer so unmap it */
- dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
- DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
-
- /* clear contents of buffer_info */
- rx_buffer->page = NULL;
-}
-
/**
* iavf_is_non_eop - process handling of non-EOP buffers
* @rx_ring: Rx ring being processed
@@ -1252,7 +1142,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
bool failure = false;

while (likely(total_rx_packets < (unsigned int)budget)) {
- struct iavf_rx_buffer *rx_buffer;
+ struct libeth_fqe *rx_buffer;
union iavf_rx_desc *rx_desc;
unsigned int size;
u16 vlan_tag = 0;
@@ -1287,13 +1177,16 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
size = FIELD_GET(IAVF_RXD_QW1_LENGTH_PBUF_MASK, qword);

iavf_trace(clean_rx_irq, rx_ring, rx_desc, skb);
- rx_buffer = iavf_get_rx_buffer(rx_ring, size);
+
+ rx_buffer = &rx_ring->rx_fqes[rx_ring->next_to_clean];
+ if (!libeth_rx_sync_for_cpu(rx_buffer, size))
+ goto skip_data;

/* retrieve a buffer from the ring */
if (skb)
- iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
+ iavf_add_rx_frag(skb, rx_buffer, size);
else
- skb = iavf_build_skb(rx_ring, rx_buffer, size);
+ skb = iavf_build_skb(rx_buffer, size);

/* exit if we failed to retrieve a buffer */
if (!skb) {
@@ -1301,10 +1194,10 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
break;
}

- iavf_put_rx_buffer(rx_ring, rx_buffer);
+skip_data:
cleaned_count++;

- if (iavf_is_non_eop(rx_ring, rx_desc, skb))
+ if (iavf_is_non_eop(rx_ring, rx_desc, skb) || unlikely(!skb))
continue;

/* ERR_MASK will only have valid bits if EOP set, and
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index f8e9f859a4f1..1e543f6a7c30 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -1,6 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2013 - 2018 Intel Corporation. */

+#include <linux/net/intel/libie/rx.h>
+
#include "iavf.h"
#include "iavf_prototype.h"

@@ -268,13 +270,13 @@ int iavf_get_vf_vlan_v2_caps(struct iavf_adapter *adapter)
void iavf_configure_queues(struct iavf_adapter *adapter)
{
struct virtchnl_vsi_queue_config_info *vqci;
- int i, max_frame = adapter->vf_res->max_mtu;
int pairs = adapter->num_active_queues;
struct virtchnl_queue_pair_info *vqpi;
+ u32 i, max_frame;
size_t len;

- if (max_frame > IAVF_MAX_RXBUFFER || !max_frame)
- max_frame = IAVF_MAX_RXBUFFER;
+ max_frame = LIBIE_MAX_RX_FRM_LEN(adapter->rx_rings->pp->p.offset);
+ max_frame = min_not_zero(adapter->vf_res->max_mtu, max_frame);

if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
/* bail because we already have a command pending */
@@ -304,7 +306,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
vqpi->rxq.ring_len = adapter->rx_rings[i].count;
vqpi->rxq.dma_ring_addr = adapter->rx_rings[i].dma;
vqpi->rxq.max_pkt_size = max_frame;
- vqpi->rxq.databuffer_size = IAVF_RXBUFFER_3072;
+ vqpi->rxq.databuffer_size = adapter->rx_rings[i].rx_buf_len;
if (CRC_OFFLOAD_ALLOWED(adapter))
vqpi->rxq.crc_disable = !!(adapter->netdev->features &
NETIF_F_RXFCS);
--
2.44.0


2024-04-04 16:01:34

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 7/9] libeth: add Rx buffer management

Add a couple intuitive helpers to hide Rx buffer implementation details
in the library and not multiplicate it between drivers. The settings are
sorta optimized for 100G+ NICs, but nothing really HW-specific here.
Use the new page_pool_dev_alloc() to dynamically switch between
split-page and full-page modes depending on MTU, page size, required
headroom etc. For example, on x86_64 with the default driver settings
each page is shared between 2 buffers. Turning on XDP (not in this
series) -> increasing headroom requirement pushes truesize out of 2048
boundary, leading to that each buffer starts getting a full page.
The "ceiling" limit is %PAGE_SIZE, as only order-0 pages are used to
avoid compound overhead. For the above architecture, this means maximum
linear frame size of 3712 w/o XDP.
Not that &libeth_buf_queue is not a complete queue/ring structure for
now, rather a shim, but eventually the libeth-enabled drivers will move
to it, with iavf being the first one.

Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/libeth/Kconfig | 1 +
include/net/libeth/rx.h | 117 ++++++++++++++++++++++
drivers/net/ethernet/intel/libeth/rx.c | 98 ++++++++++++++++++
3 files changed, 216 insertions(+)

diff --git a/drivers/net/ethernet/intel/libeth/Kconfig b/drivers/net/ethernet/intel/libeth/Kconfig
index af970a63c227..480293b71dbc 100644
--- a/drivers/net/ethernet/intel/libeth/Kconfig
+++ b/drivers/net/ethernet/intel/libeth/Kconfig
@@ -3,6 +3,7 @@

config LIBETH
tristate
+ select PAGE_POOL
help
libeth is a common library containing routines shared between several
drivers, but not yet promoted to the generic kernel API.
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index aaf9c2cdf7fd..3db2bda4eab6 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -4,8 +4,125 @@
#ifndef __LIBETH_RX_H
#define __LIBETH_RX_H

+#include <linux/if_vlan.h>
+
+#include <net/page_pool/helpers.h>
#include <net/xdp.h>

+/* Rx buffer management */
+
+/* Space reserved in front of each frame */
+#define LIBETH_SKB_HEADROOM (NET_SKB_PAD + NET_IP_ALIGN)
+/* Maximum headroom for worst-case calculations */
+#define LIBETH_MAX_HEADROOM LIBETH_SKB_HEADROOM
+/* Link layer / L2 overhead: Ethernet, 2 VLAN tags (C + S), FCS */
+#define LIBETH_RX_LL_LEN (ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)
+
+/* Always use order-0 pages */
+#define LIBETH_RX_PAGE_ORDER 0
+/* Pick a sane buffer stride and align to a cacheline boundary */
+#define LIBETH_RX_BUF_STRIDE SKB_DATA_ALIGN(128)
+/* HW-writeable space in one buffer: truesize - headroom/tailroom, aligned */
+#define LIBETH_RX_PAGE_LEN(hr) \
+ ALIGN_DOWN(SKB_MAX_ORDER(hr, LIBETH_RX_PAGE_ORDER), \
+ LIBETH_RX_BUF_STRIDE)
+
+/**
+ * struct libeth_fqe - structure representing an Rx buffer
+ * @page: page holding the buffer
+ * @offset: offset from the page start (to the headroom)
+ * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
+ *
+ * Depending on the MTU, API switches between one-page-per-frame and shared
+ * page model (to conserve memory on bigger-page platforms). In case of the
+ * former, @offset is always 0 and @truesize is always ```PAGE_SIZE```.
+ */
+struct libeth_fqe {
+ struct page *page;
+ u32 offset;
+ u32 truesize;
+} __aligned_largest;
+
+/**
+ * struct libeth_fq - structure representing a buffer queue
+ * @fp: hotpath part of the structure
+ * @pp: &page_pool for buffer management
+ * @fqes: array of Rx buffers
+ * @truesize: size to allocate per buffer, w/overhead
+ * @count: number of descriptors/buffers the queue has
+ * @buf_len: HW-writeable length per each buffer
+ * @nid: ID of the closest NUMA node with memory
+ */
+struct libeth_fq {
+ struct_group_tagged(libeth_fq_fp, fp,
+ struct page_pool *pp;
+ struct libeth_fqe *fqes;
+
+ u32 truesize;
+ u32 count;
+ );
+
+ /* Cold fields */
+ u32 buf_len;
+ int nid;
+};
+
+int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi);
+void libeth_rx_fq_destroy(struct libeth_fq *fq);
+
+/**
+ * libeth_rx_alloc - allocate a new Rx buffer
+ * @fq: buffer queue to allocate for
+ * @i: index of the buffer within the queue
+ *
+ * Return: DMA address to be passed to HW for Rx on successful allocation,
+ * ```DMA_MAPPING_ERROR``` otherwise.
+ */
+static inline dma_addr_t libeth_rx_alloc(const struct libeth_fq_fp *fq, u32 i)
+{
+ struct libeth_fqe *buf = &fq->fqes[i];
+
+ buf->truesize = fq->truesize;
+ buf->page = page_pool_dev_alloc(fq->pp, &buf->offset, &buf->truesize);
+ if (unlikely(!buf->page))
+ return DMA_MAPPING_ERROR;
+
+ return page_pool_get_dma_addr(buf->page) + buf->offset +
+ fq->pp->p.offset;
+}
+
+void libeth_rx_recycle_slow(struct page *page);
+
+/**
+ * libeth_rx_sync_for_cpu - synchronize or recycle buffer post DMA
+ * @fqe: buffer to process
+ * @len: frame length from the descriptor
+ *
+ * Process the buffer after it's written by HW. The regular path is to
+ * synchronize DMA for CPU, but in case of no data it will be immediately
+ * recycled back to its PP.
+ *
+ * Return: true when there's data to process, false otherwise.
+ */
+static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
+ u32 len)
+{
+ struct page *page = fqe->page;
+
+ /* Very rare, but possible case. The most common reason:
+ * the last fragment contained FCS only, which was then
+ * stripped by the HW.
+ */
+ if (unlikely(!len)) {
+ libeth_rx_recycle_slow(page);
+ return false;
+ }
+
+ page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
+
+ return true;
+}
+
/* Converting abstract packet type numbers into a software structure with
* the packet parameters to do O(1) lookup on Rx.
*/
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 86f17e29b47d..a557a1ebcbe5 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -3,6 +3,104 @@

#include <net/libeth/rx.h>

+/* Rx buffer management */
+
+/**
+ * libeth_rx_hw_len - get the actual buffer size to be passed to HW
+ * @pp: &page_pool_params of the netdev to calculate the size for
+ * @max_len: maximum buffer size for a single descriptor
+ *
+ * Return: HW-writeable length per one buffer to pass it to the HW accounting:
+ * MTU the @dev has, HW required alignment, minimum and maximum allowed values,
+ * and system's page size.
+ */
+static u32 libeth_rx_hw_len(const struct page_pool_params *pp, u32 max_len)
+{
+ u32 len;
+
+ len = READ_ONCE(pp->netdev->mtu) + LIBETH_RX_LL_LEN;
+ len = ALIGN(len, LIBETH_RX_BUF_STRIDE);
+ len = min3(len, ALIGN_DOWN(max_len ? : U32_MAX, LIBETH_RX_BUF_STRIDE),
+ pp->max_len);
+
+ return len;
+}
+
+/**
+ * libeth_rx_fq_create - create a PP with the default libeth settings
+ * @fq: buffer queue struct to fill
+ * @napi: &napi_struct covering this PP (no usage outside its poll loops)
+ *
+ * Return: %0 on success, -%errno on failure.
+ */
+int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
+{
+ struct page_pool_params pp = {
+ .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+ .order = LIBETH_RX_PAGE_ORDER,
+ .pool_size = fq->count,
+ .nid = fq->nid,
+ .dev = napi->dev->dev.parent,
+ .netdev = napi->dev,
+ .napi = napi,
+ .dma_dir = DMA_FROM_DEVICE,
+ .offset = LIBETH_SKB_HEADROOM,
+ };
+ struct libeth_fqe *fqes;
+ struct page_pool *pool;
+
+ /* HW-writeable / syncable length per one page */
+ pp.max_len = LIBETH_RX_PAGE_LEN(pp.offset);
+
+ /* HW-writeable length per buffer */
+ fq->buf_len = libeth_rx_hw_len(&pp, fq->buf_len);
+ /* Buffer size to allocate */
+ fq->truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset +
+ fq->buf_len));
+
+ pool = page_pool_create(&pp);
+ if (IS_ERR(pool))
+ return PTR_ERR(pool);
+
+ fqes = kvcalloc_node(fq->count, sizeof(*fqes), GFP_KERNEL, fq->nid);
+ if (!fqes)
+ goto err_buf;
+
+ fq->fqes = fqes;
+ fq->pp = pool;
+
+ return 0;
+
+err_buf:
+ page_pool_destroy(pool);
+
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_create, LIBETH);
+
+/**
+ * libeth_rx_fq_destroy - destroy a &page_pool created by libeth
+ * @fq: buffer queue to process
+ */
+void libeth_rx_fq_destroy(struct libeth_fq *fq)
+{
+ kvfree(fq->fqes);
+ page_pool_destroy(fq->pp);
+}
+EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, LIBETH);
+
+/**
+ * libeth_rx_recycle_slow - recycle a libeth page from the NAPI context
+ * @page: page to recycle
+ *
+ * To be used on exceptions or rare cases not requiring fast inline recycling.
+ */
+void libeth_rx_recycle_slow(struct page *page)
+{
+ page_pool_recycle_direct(page->pp, page);
+}
+EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, LIBETH);
+
/* Converting abstract packet type numbers into a software structure with
* the packet parameters to do O(1) lookup on Rx.
*/
--
2.44.0


2024-04-04 16:01:52

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 5/9] page_pool: constify some read-only function arguments

There are several functions taking pointers to data they don't modify.
This includes statistics fetching, page and page_pool parameters, etc.
Constify the pointers, so that call sites will be able to pass const
pointers as well.
No functional changes, no visible changes in functions sizes.

Reviewed-by: Ilias Apalodimas <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool/types.h | 4 ++--
include/net/page_pool/helpers.h | 10 +++++-----
net/core/page_pool.c | 10 +++++-----
3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 5e43a08d3231..a6ebed002216 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -213,7 +213,7 @@ struct xdp_mem_info;
#ifdef CONFIG_PAGE_POOL
void page_pool_destroy(struct page_pool *pool);
void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
- struct xdp_mem_info *mem);
+ const struct xdp_mem_info *mem);
void page_pool_put_page_bulk(struct page_pool *pool, void **data,
int count);
#else
@@ -223,7 +223,7 @@ static inline void page_pool_destroy(struct page_pool *pool)

static inline void page_pool_use_xdp_mem(struct page_pool *pool,
void (*disconnect)(void *),
- struct xdp_mem_info *mem)
+ const struct xdp_mem_info *mem)
{
}

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 1d397c1a0043..c7bb06750e85 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -58,7 +58,7 @@
/* Deprecated driver-facing API, use netlink instead */
int page_pool_ethtool_stats_get_count(void);
u8 *page_pool_ethtool_stats_get_strings(u8 *data);
-u64 *page_pool_ethtool_stats_get(u64 *data, void *stats);
+u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats);

bool page_pool_get_stats(const struct page_pool *pool,
struct page_pool_stats *stats);
@@ -73,7 +73,7 @@ static inline u8 *page_pool_ethtool_stats_get_strings(u8 *data)
return data;
}

-static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
+static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
{
return data;
}
@@ -204,8 +204,8 @@ static inline void *page_pool_dev_alloc_va(struct page_pool *pool,
* Get the stored dma direction. A driver might decide to store this locally
* and avoid the extra cache line from page_pool to determine the direction.
*/
-static
-inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
+static inline enum dma_data_direction
+page_pool_get_dma_dir(const struct page_pool *pool)
{
return pool->p.dma_dir;
}
@@ -370,7 +370,7 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
* Fetch the DMA address of the page. The page pool to which the page belongs
* must had been created with PP_FLAG_DMA_MAP.
*/
-static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
+static inline dma_addr_t page_pool_get_dma_addr(const struct page *page)
{
dma_addr_t ret = page->dma_addr;

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 4c175091fc0a..273c24429bce 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -123,9 +123,9 @@ int page_pool_ethtool_stats_get_count(void)
}
EXPORT_SYMBOL(page_pool_ethtool_stats_get_count);

-u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
+u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
{
- struct page_pool_stats *pool_stats = stats;
+ const struct page_pool_stats *pool_stats = stats;

*data++ = pool_stats->alloc_stats.fast;
*data++ = pool_stats->alloc_stats.slow;
@@ -383,8 +383,8 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
return page;
}

-static void page_pool_dma_sync_for_device(struct page_pool *pool,
- struct page *page,
+static void page_pool_dma_sync_for_device(const struct page_pool *pool,
+ const struct page *page,
unsigned int dma_sync_size)
{
dma_addr_t dma_addr = page_pool_get_dma_addr(page);
@@ -987,7 +987,7 @@ static void page_pool_release_retry(struct work_struct *wq)
}

void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
- struct xdp_mem_info *mem)
+ const struct xdp_mem_info *mem)
{
refcount_inc(&pool->user_cnt);
pool->disconnect = disconnect;
--
2.44.0


2024-04-04 16:14:26

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v9 6/9] page_pool: add DMA-sync-for-CPU inline helper

Each driver is responsible for syncing buffers written by HW for CPU
before accessing them. Almost each PP-enabled driver uses the same
pattern, which could be shorthanded into a static inline to make driver
code a little bit more compact.
Introduce a simple helper which performs DMA synchronization for the
size passed from the driver. It can be used even when the pool doesn't
manage DMA-syncs-for-device, just make sure the page has a correct DMA
address set via page_pool_set_dma_addr().

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool/helpers.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index c7bb06750e85..873631c79ab1 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -52,6 +52,8 @@
#ifndef _NET_PAGE_POOL_HELPERS_H
#define _NET_PAGE_POOL_HELPERS_H

+#include <linux/dma-mapping.h>
+
#include <net/page_pool/types.h>

#ifdef CONFIG_PAGE_POOL_STATS
@@ -395,6 +397,28 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
return false;
}

+/**
+ * page_pool_dma_sync_for_cpu - sync Rx page for CPU after it's written by HW
+ * @pool: &page_pool the @page belongs to
+ * @page: page to sync
+ * @offset: offset from page start to "hard" start if using PP frags
+ * @dma_sync_size: size of the data written to the page
+ *
+ * Can be used as a shorthand to sync Rx pages before accessing them in the
+ * driver. Caller must ensure the pool was created with ``PP_FLAG_DMA_MAP``.
+ * Note that this version performs DMA sync unconditionally, even if the
+ * associated PP doesn't perform sync-for-device.
+ */
+static inline void page_pool_dma_sync_for_cpu(const struct page_pool *pool,
+ const struct page *page,
+ u32 offset, u32 dma_sync_size)
+{
+ dma_sync_single_range_for_cpu(pool->p.dev,
+ page_pool_get_dma_addr(page),
+ offset + pool->p.offset, dma_sync_size,
+ page_pool_get_dma_dir(pool));
+}
+
static inline bool page_pool_put(struct page_pool *pool)
{
return refcount_dec_and_test(&pool->user_cnt);
--
2.44.0


2024-04-05 10:13:12

by Przemek Kitszel

[permalink] [raw]
Subject: Re: [PATCH net-next v9 4/9] slab: introduce kvmalloc_array_node() and kvcalloc_node()

On 4/4/24 17:43, Alexander Lobakin wrote:
> Add NUMA-aware counterparts for kvmalloc_array() and kvcalloc() to be
> able to flexibly allocate arrays for a particular node.
> Rewrite kvmalloc_array() to kvmalloc_array_node(NUMA_NO_NODE) call.
>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> include/linux/slab.h | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index e53cbfa18325..d1d1fa5e7983 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -774,14 +774,27 @@ static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
> return kvmalloc(size, flags | __GFP_ZERO);
> }
>
> -static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
> +static inline __alloc_size(1, 2) void *
> +kvmalloc_array_node(size_t n, size_t size, gfp_t flags, int node)
> {
> size_t bytes;
>
> if (unlikely(check_mul_overflow(n, size, &bytes)))
> return NULL;
>
> - return kvmalloc(bytes, flags);
> + return kvmalloc_node(bytes, flags, node);
> +}
> +
> +static inline __alloc_size(1, 2) void *
> +kvmalloc_array(size_t n, size_t size, gfp_t flags)
> +{
> + return kvmalloc_array_node(n, size, flags, NUMA_NO_NODE);
> +}
> +
> +static inline __alloc_size(1, 2) void *
> +kvcalloc_node(size_t n, size_t size, gfp_t flags, int node)
> +{
> + return kvmalloc_array_node(n, size, flags | __GFP_ZERO, node);
> }
>
> static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)

Reviewed-by: Przemek Kitszel <[email protected]>

2024-04-05 10:16:27

by Przemek Kitszel

[permalink] [raw]
Subject: Re: [PATCH net-next v9 2/9] iavf: kill "legacy-rx" for good

On 4/4/24 17:43, Alexander Lobakin wrote:
> Ever since build_skb() became stable, the old way with allocating an skb
> for storing the headers separately, which will be then copied manually,
> was slower, less flexible, and thus obsolete.
>
> * It had higher pressure on MM since it actually allocates new pages,
> which then get split and refcount-biased (NAPI page cache);
> * It implies memcpy() of packet headers (40+ bytes per each frame);
> * the actual header length was calculated via eth_get_headlen(), which
> invokes Flow Dissector and thus wastes a bunch of CPU cycles;
> * XDP makes it even more weird since it requires headroom for long and
> also tailroom for some time (since mbuf landed). Take a look at the
> ice driver, which is built around work-arounds to make XDP work with
> it.
>
> Even on some quite low-end hardware (not a common case for 100G NICs) it
> was performing worse.
> The only advantage "legacy-rx" had is that it didn't require any
> reserved headroom and tailroom. But iavf didn't use this, as it always
> splits pages into two halves of 2k, while that save would only be useful
> when striding. And again, XDP effectively removes that sole pro.
>
> There's a train of features to land in IAVF soon: Page Pool, XDP, XSk,
> multi-buffer etc. Each new would require adding more and more Danse
> Macabre for absolutely no reason, besides making hotpath less and less
> effective.
> Remove the "feature" with all the related code. This includes at least
> one very hot branch (typically hit on each new frame), which was either
> always-true or always-false at least for a complete NAPI bulk of 64
> frames, the whole private flags cruft, and so on. Some stats:
>
> Function: add/remove: 0/4 grow/shrink: 0/7 up/down: 0/-721 (-721)
> RO Data: add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-40 (-40)
>
> Reviewed-by: Alexander Duyck <[email protected]>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> drivers/net/ethernet/intel/iavf/iavf.h | 2 +-
> drivers/net/ethernet/intel/iavf/iavf_txrx.h | 27 +---
> .../net/ethernet/intel/iavf/iavf_ethtool.c | 140 ------------------
> drivers/net/ethernet/intel/iavf/iavf_main.c | 10 +-
> drivers/net/ethernet/intel/iavf/iavf_txrx.c | 82 +---------
> .../net/ethernet/intel/iavf/iavf_virtchnl.c | 3 +-
> 6 files changed, 8 insertions(+), 256 deletions(-)
>
Awesome!
Reviewed-by: Przemek Kitszel <[email protected]>

2024-04-05 10:38:58

by Przemek Kitszel

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On 4/4/24 17:44, Alexander Lobakin wrote:
> Add a couple intuitive helpers to hide Rx buffer implementation details
> in the library and not multiplicate it between drivers. The settings are
> sorta optimized for 100G+ NICs, but nothing really HW-specific here.
> Use the new page_pool_dev_alloc() to dynamically switch between
> split-page and full-page modes depending on MTU, page size, required
> headroom etc. For example, on x86_64 with the default driver settings
> each page is shared between 2 buffers. Turning on XDP (not in this
> series) -> increasing headroom requirement pushes truesize out of 2048
> boundary, leading to that each buffer starts getting a full page.
> The "ceiling" limit is %PAGE_SIZE, as only order-0 pages are used to
> avoid compound overhead. For the above architecture, this means maximum
> linear frame size of 3712 w/o XDP.
> Not that &libeth_buf_queue is not a complete queue/ring structure for
> now, rather a shim, but eventually the libeth-enabled drivers will move
> to it, with iavf being the first one.
>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> drivers/net/ethernet/intel/libeth/Kconfig | 1 +
> include/net/libeth/rx.h | 117 ++++++++++++++++++++++
> drivers/net/ethernet/intel/libeth/rx.c | 98 ++++++++++++++++++
> 3 files changed, 216 insertions(+)
>
[...]

> +/**
> + * struct libeth_fqe - structure representing an Rx buffer
> + * @page: page holding the buffer
> + * @offset: offset from the page start (to the headroom)
> + * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
> + *
> + * Depending on the MTU, API switches between one-page-per-frame and shared
> + * page model (to conserve memory on bigger-page platforms). In case of the
> + * former, @offset is always 0 and @truesize is always ```PAGE_SIZE```.
> + */
> +struct libeth_fqe {
> + struct page *page;
> + u32 offset;
> + u32 truesize;
> +} __aligned_largest;
> +
> +/**
> + * struct libeth_fq - structure representing a buffer queue
> + * @fp: hotpath part of the structure
> + * @pp: &page_pool for buffer management
> + * @fqes: array of Rx buffers
> + * @truesize: size to allocate per buffer, w/overhead
> + * @count: number of descriptors/buffers the queue has
> + * @buf_len: HW-writeable length per each buffer
> + * @nid: ID of the closest NUMA node with memory
> + */
> +struct libeth_fq {
> + struct_group_tagged(libeth_fq_fp, fp,
> + struct page_pool *pp;
> + struct libeth_fqe *fqes;
> +
> + u32 truesize;
> + u32 count;
> + );
> +
> + /* Cold fields */
> + u32 buf_len;
> + int nid;
> +};

[...]

Could you please unpack the meaning of `fq` and `fqe` acronyms here?

otherwise the whole series is very good for me, thank you very much!


2024-04-05 10:44:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH net-next v9 4/9] slab: introduce kvmalloc_array_node() and kvcalloc_node()

On 4/4/24 5:43 PM, Alexander Lobakin wrote:
> Add NUMA-aware counterparts for kvmalloc_array() and kvcalloc() to be
> able to flexibly allocate arrays for a particular node.
> Rewrite kvmalloc_array() to kvmalloc_array_node(NUMA_NO_NODE) call.
>
> Signed-off-by: Alexander Lobakin <[email protected]>

Acked-by: Vlastimil Babka <vbabka@suse>

This will however cause some conflicts with alloc tagging series with mm
tree in next and the new wrappers will have to be adjusted.

> ---
> include/linux/slab.h | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index e53cbfa18325..d1d1fa5e7983 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -774,14 +774,27 @@ static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
> return kvmalloc(size, flags | __GFP_ZERO);
> }
>
> -static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
> +static inline __alloc_size(1, 2) void *
> +kvmalloc_array_node(size_t n, size_t size, gfp_t flags, int node)
> {
> size_t bytes;
>
> if (unlikely(check_mul_overflow(n, size, &bytes)))
> return NULL;
>
> - return kvmalloc(bytes, flags);
> + return kvmalloc_node(bytes, flags, node);
> +}
> +
> +static inline __alloc_size(1, 2) void *
> +kvmalloc_array(size_t n, size_t size, gfp_t flags)
> +{
> + return kvmalloc_array_node(n, size, flags, NUMA_NO_NODE);
> +}
> +
> +static inline __alloc_size(1, 2) void *
> +kvcalloc_node(size_t n, size_t size, gfp_t flags, int node)
> +{
> + return kvmalloc_array_node(n, size, flags | __GFP_ZERO, node);
> }
>
> static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)


2024-04-06 04:25:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On Thu, 4 Apr 2024 17:44:00 +0200 Alexander Lobakin wrote:
> +/**
> + * struct libeth_fq - structure representing a buffer queue
> + * @fp: hotpath part of the structure

Second time this happens this week, so maybe some tooling change in 6.9
but apparently kdoc does not want to know about the tagged struct:

include/net/libeth/rx.h:69: warning: Excess struct member 'fp' description in 'libeth_fq'

> + * @pp: &page_pool for buffer management
> + * @fqes: array of Rx buffers
> + * @truesize: size to allocate per buffer, w/overhead
> + * @count: number of descriptors/buffers the queue has
> + * @buf_len: HW-writeable length per each buffer
> + * @nid: ID of the closest NUMA node with memory
> + */
> +struct libeth_fq {
> + struct_group_tagged(libeth_fq_fp, fp,
> + struct page_pool *pp;
> + struct libeth_fqe *fqes;
> +
> + u32 truesize;
> + u32 count;
> + );
> +
> + /* Cold fields */
> + u32 buf_len;
> + int nid;
> +};

2024-04-08 09:12:57

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Przemek Kitszel <[email protected]>
Date: Fri, 5 Apr 2024 12:32:55 +0200

> On 4/4/24 17:44, Alexander Lobakin wrote:
>> Add a couple intuitive helpers to hide Rx buffer implementation details

[...]

>> +struct libeth_fqe {
>> +    struct page        *page;
>> +    u32            offset;
>> +    u32            truesize;
>> +} __aligned_largest;
>> +
>> +/**
>> + * struct libeth_fq - structure representing a buffer queue
>> + * @fp: hotpath part of the structure
>> + * @pp: &page_pool for buffer management
>> + * @fqes: array of Rx buffers
>> + * @truesize: size to allocate per buffer, w/overhead
>> + * @count: number of descriptors/buffers the queue has
>> + * @buf_len: HW-writeable length per each buffer
>> + * @nid: ID of the closest NUMA node with memory
>> + */
>> +struct libeth_fq {
>> +    struct_group_tagged(libeth_fq_fp, fp,
>> +        struct page_pool    *pp;
>> +        struct libeth_fqe    *fqes;
>> +
>> +        u32            truesize;
>> +        u32            count;
>> +    );
>> +
>> +    /* Cold fields */
>> +    u32            buf_len;
>> +    int            nid;
>> +};
>
> [...]
>
> Could you please unpack the meaning of `fq` and `fqe` acronyms here?

Rx:

RQ -- receive queue, on which you get Rx DMA complete descriptors
FQ -- fill queue, the one you fill with free buffers
FQE -- fill queue element, i.e. smth like "iavf_rx_buffer" or whatever

Tx:

SQ -- send queue, the one you fill with buffers to transmit
SQE -- send queue element, i.e. "iavf_tx_buffer"
CQ -- completion queue, on which you get Tx DMA complete descriptors

XDPSQ, XSkRQ etc. -- same as above, but for XDP / XSk

I know that rxq, txq, bufq, complq is more common since it's been used
for years, but I like these "new" ones more :>

>
> otherwise the whole series is very good for me, thank you very much!
>

Thanks,
Olek

2024-04-08 09:13:28

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Jakub Kicinski <[email protected]>
Date: Fri, 5 Apr 2024 21:25:13 -0700

> On Thu, 4 Apr 2024 17:44:00 +0200 Alexander Lobakin wrote:
>> +/**
>> + * struct libeth_fq - structure representing a buffer queue
>> + * @fp: hotpath part of the structure
>
> Second time this happens this week, so maybe some tooling change in 6.9
> but apparently kdoc does not want to know about the tagged struct:
>
> include/net/libeth/rx.h:69: warning: Excess struct member 'fp' description in 'libeth_fq'

Oh no, maybe we should teach kdoc to parse struct_group*()?

>
>> + * @pp: &page_pool for buffer management
>> + * @fqes: array of Rx buffers
>> + * @truesize: size to allocate per buffer, w/overhead
>> + * @count: number of descriptors/buffers the queue has
>> + * @buf_len: HW-writeable length per each buffer
>> + * @nid: ID of the closest NUMA node with memory
>> + */
>> +struct libeth_fq {
>> + struct_group_tagged(libeth_fq_fp, fp,
>> + struct page_pool *pp;
>> + struct libeth_fqe *fqes;
>> +
>> + u32 truesize;
>> + u32 count;
>> + );
>> +
>> + /* Cold fields */
>> + u32 buf_len;
>> + int nid;
>> +};

Thanks,
Olek

2024-04-08 09:47:42

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Alexander Lobakin <[email protected]>
Date: Mon, 8 Apr 2024 11:11:12 +0200

> From: Jakub Kicinski <[email protected]>
> Date: Fri, 5 Apr 2024 21:25:13 -0700
>
>> On Thu, 4 Apr 2024 17:44:00 +0200 Alexander Lobakin wrote:
>>> +/**
>>> + * struct libeth_fq - structure representing a buffer queue
>>> + * @fp: hotpath part of the structure
>>
>> Second time this happens this week, so maybe some tooling change in 6.9
>> but apparently kdoc does not want to know about the tagged struct:
>>
>> include/net/libeth/rx.h:69: warning: Excess struct member 'fp' description in 'libeth_fq'
>
> Oh no, maybe we should teach kdoc to parse struct_group*()?

scripts/kernel-doc apparently can handle them...

+ Kees

>
>>
>>> + * @pp: &page_pool for buffer management
>>> + * @fqes: array of Rx buffers
>>> + * @truesize: size to allocate per buffer, w/overhead
>>> + * @count: number of descriptors/buffers the queue has
>>> + * @buf_len: HW-writeable length per each buffer
>>> + * @nid: ID of the closest NUMA node with memory
>>> + */
>>> +struct libeth_fq {
>>> + struct_group_tagged(libeth_fq_fp, fp,
>>> + struct page_pool *pp;
>>> + struct libeth_fqe *fqes;
>>> +
>>> + u32 truesize;
>>> + u32 count;
>>> + );
>>> +
>>> + /* Cold fields */
>>> + u32 buf_len;
>>> + int nid;
>>> +};

Thanks,
Olek

2024-04-09 10:59:32

by Przemek Kitszel

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On 4/8/24 11:09, Alexander Lobakin wrote:
> From: Przemek Kitszel <[email protected]>
> Date: Fri, 5 Apr 2024 12:32:55 +0200
>
>> On 4/4/24 17:44, Alexander Lobakin wrote:
>>> Add a couple intuitive helpers to hide Rx buffer implementation details
>
> [...]
>
>>> +struct libeth_fqe {
>>> +    struct page        *page;
>>> +    u32            offset;
>>> +    u32            truesize;
>>> +} __aligned_largest;
>>> +
>>> +/**
>>> + * struct libeth_fq - structure representing a buffer queue
>>> + * @fp: hotpath part of the structure
>>> + * @pp: &page_pool for buffer management
>>> + * @fqes: array of Rx buffers
>>> + * @truesize: size to allocate per buffer, w/overhead
>>> + * @count: number of descriptors/buffers the queue has
>>> + * @buf_len: HW-writeable length per each buffer
>>> + * @nid: ID of the closest NUMA node with memory
>>> + */
>>> +struct libeth_fq {
>>> +    struct_group_tagged(libeth_fq_fp, fp,
>>> +        struct page_pool    *pp;
>>> +        struct libeth_fqe    *fqes;
>>> +
>>> +        u32            truesize;
>>> +        u32            count;
>>> +    );
>>> +
>>> +    /* Cold fields */
>>> +    u32            buf_len;
>>> +    int            nid;
>>> +};
>>
>> [...]
>>
>> Could you please unpack the meaning of `fq` and `fqe` acronyms here?
>
> Rx:
>
> RQ -- receive queue, on which you get Rx DMA complete descriptors
> FQ -- fill queue, the one you fill with free buffers
> FQE -- fill queue element, i.e. smth like "iavf_rx_buffer" or whatever
>
> Tx:
>
> SQ -- send queue, the one you fill with buffers to transmit
> SQE -- send queue element, i.e. "iavf_tx_buffer"
> CQ -- completion queue, on which you get Tx DMA complete descriptors
>
> XDPSQ, XSkRQ etc. -- same as above, but for XDP / XSk
>
> I know that rxq, txq, bufq, complq is more common since it's been used
> for years, but I like these "new" ones more :>
>

Thank you, that sounds right. If you happen to sent v10, a bit of code
comment with this info would be useful ;)

>>
>> otherwise the whole series is very good for me, thank you very much!
>>
>
> Thanks,
> Olek


2024-04-09 17:01:10

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On Mon, Apr 08, 2024 at 11:45:32AM +0200, Alexander Lobakin wrote:
> From: Alexander Lobakin <[email protected]>
> Date: Mon, 8 Apr 2024 11:11:12 +0200
>
> > From: Jakub Kicinski <[email protected]>
> > Date: Fri, 5 Apr 2024 21:25:13 -0700
> >
> >> On Thu, 4 Apr 2024 17:44:00 +0200 Alexander Lobakin wrote:
> >>> +/**
> >>> + * struct libeth_fq - structure representing a buffer queue
> >>> + * @fp: hotpath part of the structure
> >>
> >> Second time this happens this week, so maybe some tooling change in 6.9
> >> but apparently kdoc does not want to know about the tagged struct:
> >>
> >> include/net/libeth/rx.h:69: warning: Excess struct member 'fp' description in 'libeth_fq'
> >
> > Oh no, maybe we should teach kdoc to parse struct_group*()?
>
> scripts/kernel-doc apparently can handle them...
>
> + Kees
>

Ah, hm, scripts/kernel-doc throws away the early arguments of
struct_group_tagged, but I suspect it needs to create a synthetic member
for the tag. i.e. instead of:

struct_group_tagged(tag, name, members...)

becoming

members...

it needs to become

struct tag name;
members...

It seems this is the first place anyone has tried to document the tagged
struct name! :)

Does this work? I haven't tested it...

diff --git a/scripts/kernel-doc b/scripts/kernel-doc
index 967f1abb0edb..64a19228d5dd 100755
--- a/scripts/kernel-doc
+++ b/scripts/kernel-doc
@@ -1151,7 +1151,8 @@ sub dump_struct($$) {
# - first eat non-declaration parameters and rewrite for final match
# - then remove macro, outer parens, and trailing semicolon
$members =~ s/\bstruct_group\s*\(([^,]*,)/STRUCT_GROUP(/gos;
- $members =~ s/\bstruct_group_(attr|tagged)\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
+ $members =~ s/\bstruct_group_attr\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
+ $members =~ s/\bstruct_group_tagged\s*\(([^,]*,)([^,]*,)/struct $1 $2; STRUCT_GROUP(/gos;
$members =~ s/\b__struct_group\s*\(([^,]*,){3}/STRUCT_GROUP(/gos;
$members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;


> >
> >>
> >>> + * @pp: &page_pool for buffer management
> >>> + * @fqes: array of Rx buffers
> >>> + * @truesize: size to allocate per buffer, w/overhead
> >>> + * @count: number of descriptors/buffers the queue has
> >>> + * @buf_len: HW-writeable length per each buffer
> >>> + * @nid: ID of the closest NUMA node with memory
> >>> + */
> >>> +struct libeth_fq {
> >>> + struct_group_tagged(libeth_fq_fp, fp,
> >>> + struct page_pool *pp;
> >>> + struct libeth_fqe *fqes;
> >>> +
> >>> + u32 truesize;
> >>> + u32 count;
> >>> + );
> >>> +
> >>> + /* Cold fields */
> >>> + u32 buf_len;
> >>> + int nid;
> >>> +};
>
> Thanks,
> Olek

--
Kees Cook

2024-04-10 11:54:23

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Przemek Kitszel <[email protected]>
Date: Tue, 9 Apr 2024 12:58:33 +0200

> On 4/8/24 11:09, Alexander Lobakin wrote:
>> From: Przemek Kitszel <[email protected]>
>> Date: Fri, 5 Apr 2024 12:32:55 +0200
>>
>>> On 4/4/24 17:44, Alexander Lobakin wrote:
>>>> Add a couple intuitive helpers to hide Rx buffer implementation details
>>
>> [...]
>>
>>>> +struct libeth_fqe {
>>>> +    struct page        *page;
>>>> +    u32            offset;
>>>> +    u32            truesize;
>>>> +} __aligned_largest;
>>>> +
>>>> +/**
>>>> + * struct libeth_fq - structure representing a buffer queue
>>>> + * @fp: hotpath part of the structure
>>>> + * @pp: &page_pool for buffer management
>>>> + * @fqes: array of Rx buffers
>>>> + * @truesize: size to allocate per buffer, w/overhead
>>>> + * @count: number of descriptors/buffers the queue has
>>>> + * @buf_len: HW-writeable length per each buffer
>>>> + * @nid: ID of the closest NUMA node with memory
>>>> + */
>>>> +struct libeth_fq {
>>>> +    struct_group_tagged(libeth_fq_fp, fp,
>>>> +        struct page_pool    *pp;
>>>> +        struct libeth_fqe    *fqes;
>>>> +
>>>> +        u32            truesize;
>>>> +        u32            count;
>>>> +    );
>>>> +
>>>> +    /* Cold fields */
>>>> +    u32            buf_len;
>>>> +    int            nid;
>>>> +};
>>>
>>> [...]
>>>
>>> Could you please unpack the meaning of `fq` and `fqe` acronyms here?
>>
>> Rx:
>>
>> RQ -- receive queue, on which you get Rx DMA complete descriptors
>> FQ -- fill queue, the one you fill with free buffers
>>    FQE -- fill queue element, i.e. smth like "iavf_rx_buffer" or whatever
>>
>> Tx:
>>
>> SQ -- send queue, the one you fill with buffers to transmit
>>    SQE -- send queue element, i.e. "iavf_tx_buffer"
>> CQ -- completion queue, on which you get Tx DMA complete descriptors
>>
>> XDPSQ, XSkRQ etc. -- same as above, but for XDP / XSk
>>
>> I know that rxq, txq, bufq, complq is more common since it's been used
>> for years, but I like these "new" ones more :>
>>
>
> Thank you, that sounds right. If you happen to sent v10, a bit of code
> comment with this info would be useful ;)

The current kdoc in front of each struct and function declaration is not
enough? :D

Thanks,
Olek

2024-04-10 13:09:00

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Przemek Kitszel <[email protected]>
Date: Wed, 10 Apr 2024 15:01:32 +0200

> On 4/10/24 13:49, Alexander Lobakin wrote:
>> From: Przemek Kitszel <[email protected]>
>> Date: Tue, 9 Apr 2024 12:58:33 +0200
>>
>>> On 4/8/24 11:09, Alexander Lobakin wrote:
>>>> From: Przemek Kitszel <[email protected]>
>>>> Date: Fri, 5 Apr 2024 12:32:55 +0200
>>>>
>>>>> On 4/4/24 17:44, Alexander Lobakin wrote:
>>>>>> Add a couple intuitive helpers to hide Rx buffer implementation
>>>>>> details
>>>>
>>>> [...]
>>>>
>>>>>> +struct libeth_fqe {
>>>>>> +    struct page        *page;
>>>>>> +    u32            offset;
>>>>>> +    u32            truesize;
>>>>>> +} __aligned_largest;
>>>>>> +
>>>>>> +/**
>>>>>> + * struct libeth_fq - structure representing a buffer queue
>>>>>> + * @fp: hotpath part of the structure
>>>>>> + * @pp: &page_pool for buffer management
>>>>>> + * @fqes: array of Rx buffers
>>>>>> + * @truesize: size to allocate per buffer, w/overhead
>>>>>> + * @count: number of descriptors/buffers the queue has
>>>>>> + * @buf_len: HW-writeable length per each buffer
>>>>>> + * @nid: ID of the closest NUMA node with memory
>>>>>> + */
>>>>>> +struct libeth_fq {
>>>>>> +    struct_group_tagged(libeth_fq_fp, fp,
>>>>>> +        struct page_pool    *pp;
>>>>>> +        struct libeth_fqe    *fqes;
>>>>>> +
>>>>>> +        u32            truesize;
>>>>>> +        u32            count;
>>>>>> +    );
>>>>>> +
>>>>>> +    /* Cold fields */
>>>>>> +    u32            buf_len;
>>>>>> +    int            nid;
>>>>>> +};
>>>>>
>>>>> [...]
>>>>>
>>>>> Could you please unpack the meaning of `fq` and `fqe` acronyms here?
>>>>
>>>> Rx:
>>>>
>>>> RQ -- receive queue, on which you get Rx DMA complete descriptors
>>>> FQ -- fill queue, the one you fill with free buffers
>>>>     FQE -- fill queue element, i.e. smth like "iavf_rx_buffer" or
>>>> whatever
>>>>
>>>> Tx:
>>>>
>>>> SQ -- send queue, the one you fill with buffers to transmit
>>>>     SQE -- send queue element, i.e. "iavf_tx_buffer"
>>>> CQ -- completion queue, on which you get Tx DMA complete descriptors
>>>>
>>>> XDPSQ, XSkRQ etc. -- same as above, but for XDP / XSk
>>>>
>>>> I know that rxq, txq, bufq, complq is more common since it's been used
>>>> for years, but I like these "new" ones more :>
>>>>
>>>
>>> Thank you, that sounds right. If you happen to sent v10, a bit of code
>>> comment with this info would be useful ;)
>>
>> The current kdoc in front of each struct and function declaration is not
>> enough? :D
>>
>> Thanks,
>> Olek
>
> I've asked my initial question just after reading it thrice, without
> your reply `FQE` was as meaningful as `ABC`

In the commit:

+ * struct libeth_fqe - structure representing an Rx buffer

:z

or you want me to expand FQ, FQE etc. abbrevs in the kdoc?

Thanks,
Olek

2024-04-10 13:30:23

by Przemek Kitszel

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On 4/10/24 15:01, Alexander Lobakin wrote:
> From: Przemek Kitszel <[email protected]>
> Date: Wed, 10 Apr 2024 15:01:32 +0200
>
>> On 4/10/24 13:49, Alexander Lobakin wrote:
>>> From: Przemek Kitszel <[email protected]>
>>> Date: Tue, 9 Apr 2024 12:58:33 +0200
>>>
>>>> On 4/8/24 11:09, Alexander Lobakin wrote:
>>>>> From: Przemek Kitszel <[email protected]>
>>>>> Date: Fri, 5 Apr 2024 12:32:55 +0200
>>>>>
>>>>>> On 4/4/24 17:44, Alexander Lobakin wrote:
>>>>>>> Add a couple intuitive helpers to hide Rx buffer implementation
>>>>>>> details
>>>>>
>>>>> [...]
>>>>>
>>>>>>> +struct libeth_fqe {
>>>>>>> +    struct page        *page;
>>>>>>> +    u32            offset;
>>>>>>> +    u32            truesize;
>>>>>>> +} __aligned_largest;
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * struct libeth_fq - structure representing a buffer queue
>>>>>>> + * @fp: hotpath part of the structure
>>>>>>> + * @pp: &page_pool for buffer management
>>>>>>> + * @fqes: array of Rx buffers
>>>>>>> + * @truesize: size to allocate per buffer, w/overhead
>>>>>>> + * @count: number of descriptors/buffers the queue has
>>>>>>> + * @buf_len: HW-writeable length per each buffer
>>>>>>> + * @nid: ID of the closest NUMA node with memory
>>>>>>> + */
>>>>>>> +struct libeth_fq {
>>>>>>> +    struct_group_tagged(libeth_fq_fp, fp,
>>>>>>> +        struct page_pool    *pp;
>>>>>>> +        struct libeth_fqe    *fqes;
>>>>>>> +
>>>>>>> +        u32            truesize;
>>>>>>> +        u32            count;
>>>>>>> +    );
>>>>>>> +
>>>>>>> +    /* Cold fields */
>>>>>>> +    u32            buf_len;
>>>>>>> +    int            nid;
>>>>>>> +};
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>> Could you please unpack the meaning of `fq` and `fqe` acronyms here?
>>>>>
>>>>> Rx:
>>>>>
>>>>> RQ -- receive queue, on which you get Rx DMA complete descriptors
>>>>> FQ -- fill queue, the one you fill with free buffers
>>>>>     FQE -- fill queue element, i.e. smth like "iavf_rx_buffer" or
>>>>> whatever
>>>>>
>>>>> Tx:
>>>>>
>>>>> SQ -- send queue, the one you fill with buffers to transmit
>>>>>     SQE -- send queue element, i.e. "iavf_tx_buffer"
>>>>> CQ -- completion queue, on which you get Tx DMA complete descriptors
>>>>>
>>>>> XDPSQ, XSkRQ etc. -- same as above, but for XDP / XSk
>>>>>
>>>>> I know that rxq, txq, bufq, complq is more common since it's been used
>>>>> for years, but I like these "new" ones more :>
>>>>>
>>>>
>>>> Thank you, that sounds right. If you happen to sent v10, a bit of code
>>>> comment with this info would be useful ;)
>>>
>>> The current kdoc in front of each struct and function declaration is not
>>> enough? :D
>>>
>>> Thanks,
>>> Olek
>>
>> I've asked my initial question just after reading it thrice, without
>> your reply `FQE` was as meaningful as `ABC`
>
> In the commit:
>
> + * struct libeth_fqe - structure representing an Rx buffer
>
> :z
>
> or you want me to expand FQ, FQE etc. abbrevs in the kdoc?

yes, please very much!

could be as an additional comment at the top of the file too,
with that you could reuse the naming convention later in the file

>
> Thanks,
> Olek


2024-04-10 13:43:54

by Przemek Kitszel

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On 4/10/24 13:49, Alexander Lobakin wrote:
> From: Przemek Kitszel <[email protected]>
> Date: Tue, 9 Apr 2024 12:58:33 +0200
>
>> On 4/8/24 11:09, Alexander Lobakin wrote:
>>> From: Przemek Kitszel <[email protected]>
>>> Date: Fri, 5 Apr 2024 12:32:55 +0200
>>>
>>>> On 4/4/24 17:44, Alexander Lobakin wrote:
>>>>> Add a couple intuitive helpers to hide Rx buffer implementation details
>>>
>>> [...]
>>>
>>>>> +struct libeth_fqe {
>>>>> +    struct page        *page;
>>>>> +    u32            offset;
>>>>> +    u32            truesize;
>>>>> +} __aligned_largest;
>>>>> +
>>>>> +/**
>>>>> + * struct libeth_fq - structure representing a buffer queue
>>>>> + * @fp: hotpath part of the structure
>>>>> + * @pp: &page_pool for buffer management
>>>>> + * @fqes: array of Rx buffers
>>>>> + * @truesize: size to allocate per buffer, w/overhead
>>>>> + * @count: number of descriptors/buffers the queue has
>>>>> + * @buf_len: HW-writeable length per each buffer
>>>>> + * @nid: ID of the closest NUMA node with memory
>>>>> + */
>>>>> +struct libeth_fq {
>>>>> +    struct_group_tagged(libeth_fq_fp, fp,
>>>>> +        struct page_pool    *pp;
>>>>> +        struct libeth_fqe    *fqes;
>>>>> +
>>>>> +        u32            truesize;
>>>>> +        u32            count;
>>>>> +    );
>>>>> +
>>>>> +    /* Cold fields */
>>>>> +    u32            buf_len;
>>>>> +    int            nid;
>>>>> +};
>>>>
>>>> [...]
>>>>
>>>> Could you please unpack the meaning of `fq` and `fqe` acronyms here?
>>>
>>> Rx:
>>>
>>> RQ -- receive queue, on which you get Rx DMA complete descriptors
>>> FQ -- fill queue, the one you fill with free buffers
>>>    FQE -- fill queue element, i.e. smth like "iavf_rx_buffer" or whatever
>>>
>>> Tx:
>>>
>>> SQ -- send queue, the one you fill with buffers to transmit
>>>    SQE -- send queue element, i.e. "iavf_tx_buffer"
>>> CQ -- completion queue, on which you get Tx DMA complete descriptors
>>>
>>> XDPSQ, XSkRQ etc. -- same as above, but for XDP / XSk
>>>
>>> I know that rxq, txq, bufq, complq is more common since it's been used
>>> for years, but I like these "new" ones more :>
>>>
>>
>> Thank you, that sounds right. If you happen to sent v10, a bit of code
>> comment with this info would be useful ;)
>
> The current kdoc in front of each struct and function declaration is not
> enough? :D
>
> Thanks,
> Olek

I've asked my initial question just after reading it thrice, without
your reply `FQE` was as meaningful as `ABC`

2024-04-10 13:46:58

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Kees Cook <[email protected]>
Date: Tue, 9 Apr 2024 09:17:53 -0700

> On Mon, Apr 08, 2024 at 11:45:32AM +0200, Alexander Lobakin wrote:
>> From: Alexander Lobakin <[email protected]>
>> Date: Mon, 8 Apr 2024 11:11:12 +0200
>>
>>> From: Jakub Kicinski <[email protected]>
>>> Date: Fri, 5 Apr 2024 21:25:13 -0700
>>>
>>>> On Thu, 4 Apr 2024 17:44:00 +0200 Alexander Lobakin wrote:
>>>>> +/**
>>>>> + * struct libeth_fq - structure representing a buffer queue
>>>>> + * @fp: hotpath part of the structure
>>>>
>>>> Second time this happens this week, so maybe some tooling change in 6.9
>>>> but apparently kdoc does not want to know about the tagged struct:
>>>>
>>>> include/net/libeth/rx.h:69: warning: Excess struct member 'fp' description in 'libeth_fq'
>>>
>>> Oh no, maybe we should teach kdoc to parse struct_group*()?
>>
>> scripts/kernel-doc apparently can handle them...
>>
>> + Kees
>>
>
> Ah, hm, scripts/kernel-doc throws away the early arguments of
> struct_group_tagged, but I suspect it needs to create a synthetic member
> for the tag. i.e. instead of:
>
> struct_group_tagged(tag, name, members...)
>
> becoming
>
> members...
>
> it needs to become
>
> struct tag name;
> members...
>
> It seems this is the first place anyone has tried to document the tagged
> struct name! :)

It makes sense and TBH I expected kdoc to warn that an element
description is missing :D

>
> Does this work? I haven't tested it...
>
> diff --git a/scripts/kernel-doc b/scripts/kernel-doc
> index 967f1abb0edb..64a19228d5dd 100755
> --- a/scripts/kernel-doc
> +++ b/scripts/kernel-doc
> @@ -1151,7 +1151,8 @@ sub dump_struct($$) {
> # - first eat non-declaration parameters and rewrite for final match
> # - then remove macro, outer parens, and trailing semicolon
> $members =~ s/\bstruct_group\s*\(([^,]*,)/STRUCT_GROUP(/gos;
> - $members =~ s/\bstruct_group_(attr|tagged)\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
> + $members =~ s/\bstruct_group_attr\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
> + $members =~ s/\bstruct_group_tagged\s*\(([^,]*,)([^,]*,)/struct $1 $2; STRUCT_GROUP(/gos;

This one does not. We need to exclude ','s from the groups...
+ $members =~
s/\bstruct_group_tagged\s*\(([^,]*),([^,]*),/struct $1 $2;
STRUCT_GROUP(/gos;

That one is fine. $members:

include/net/libeth/rx.h:91: warning: struct libeth_fq_fp fp;
STRUCT_GROUP( struct page_pool *pp; struct libeth_fqe
*fqes; u32 truesize; u32 count;
); enum libeth_fqe_type type:2; bool hsplit:1;
bool xdp:1; u32 buf_len; int
nid;

So you almost fixed it :D

Which tree this should go through? Should I include this patch to this
series with libeth or it's better to push this through kees/linux and
then pull to net-next?


> $members =~ s/\b__struct_group\s*\(([^,]*,){3}/STRUCT_GROUP(/gos;
> $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;

Thanks,
Olek

2024-04-11 00:55:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On Wed, 10 Apr 2024 15:36:13 +0200 Alexander Lobakin wrote:
> Which tree this should go through? Should I include this patch to this
> series with libeth or it's better to push this through kees/linux and
> then pull to net-next?

I think doc tree is a strong candidate, or at least we should not
merge without consulting Jon. Please post and we'll figure it out.

The question someone may ask, however, is whether it causes new
warnings to appear?

2024-04-11 09:10:51

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

From: Jakub Kicinski <[email protected]>
Date: Wed, 10 Apr 2024 17:54:24 -0700

> On Wed, 10 Apr 2024 15:36:13 +0200 Alexander Lobakin wrote:
>> Which tree this should go through? Should I include this patch to this
>> series with libeth or it's better to push this through kees/linux and
>> then pull to net-next?
>
> I think doc tree is a strong candidate, or at least we should not
> merge without consulting Jon. Please post and we'll figure it out.

Can this series go simultaneously or it needs to wait for the fix first?

>
> The question someone may ask, however, is whether it causes new
> warnings to appear?

I tested `make W=12 KDOCFLAGS=-Wall all` yesterday and haven't noticed
any new issues, although expected them.

Thanks,
Olek

2024-04-11 14:09:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v9 7/9] libeth: add Rx buffer management

On Thu, 11 Apr 2024 11:07:24 +0200 Alexander Lobakin wrote:
> > I think doc tree is a strong candidate, or at least we should not
> > merge without consulting Jon. Please post and we'll figure it out.
>
> Can this series go simultaneously or it needs to wait for the fix first?

You can send both maybe just mention under the --- that "this one will
generate a known kdoc warning, I'll be fixing kdoc script separately".

> > The question someone may ask, however, is whether it causes new
> > warnings to appear?
>
> I tested `make W=12 KDOCFLAGS=-Wall all` yesterday and haven't noticed
> any new issues, although expected them.

Surprising but nice.