2023-05-25 13:21:50

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
further buffer model changes, shake it up a bit. Notably:

1. Cache more variables on the stack.
DMA device, Rx page size, NTC -- these are the most common things
used all throughout the hotpath, often in loops on each iteration.
Instead of fetching (or even calculating, as with the page size) them
from the ring all the time, cache them on the stack at the beginning
of the NAPI polling callback. NTC will be written back at the end,
the rest are used read-only, so no sync needed.
2. Don't move the recycled buffers around the ring.
The idea of passing the page of the right-now-recycled-buffer to a
different buffer, in this case, the first one that needs to be
allocated, moreover, on each new frame, is fundamentally wrong. It
involves a few o' fetches, branches and then writes (and one Rx
buffer struct is at least 32 bytes) where they're completely unneeded,
but gives no good -- the result is the same as if we'd recycle it
inplace, at the same position where it was used. So drop this and let
the main refilling function take care of all the buffers, which were
processed and now need to be recycled/refilled.
3. Don't allocate with %GPF_ATOMIC on ifup.
This involved introducing the @gfp parameter to a couple functions.
Doesn't change anything for Rx -> softirq.
4. 1 budget unit == 1 descriptor, not skb.
There could be underflow when receiving a lot of fragmented frames.
If each of them would consist of 2 frags, it means that we'd process
64 descriptors at the point where we pass the 32th skb to the stack.
But the driver would count that only as a half, which could make NAPI
re-enable interrupts prematurely and create unnecessary CPU load.
5. Shortcut !size case.
It's super rare, but possible -- for example, if the last buffer of
the fragmented frame contained only FCS, which was then stripped by
the HW. Instead of checking for size several times when processing,
quickly reuse the buffer and jump to the skb fields part.
6. Refill the ring after finishing the polling loop.
Previously, the loop wasn't starting a new iteration after the 64th
desc, meaning that we were always leaving 16 buffers non-refilled
until the next NAPI poll. It's better to refill them while they're
still hot, so do that right after exiting the loop as well.
For a full cycle of 64 descs, there will be 4 refills of 16 descs
from now on.

Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)

+ up to 2% performance.

Signed-off-by: Alexander Lobakin <[email protected]>
---
drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 259 +++++++++-----------
drivers/net/ethernet/intel/iavf/iavf_txrx.h | 3 +-
3 files changed, 114 insertions(+), 150 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index a5a6c9861a93..ade32aa1ed78 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1236,7 +1236,7 @@ static void iavf_configure(struct iavf_adapter *adapter)
for (i = 0; i < adapter->num_active_queues; i++) {
struct iavf_ring *ring = &adapter->rx_rings[i];

- iavf_alloc_rx_buffers(ring, IAVF_DESC_UNUSED(ring));
+ iavf_alloc_rx_buffers(ring);
}
}

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index a7121dc5c32b..fd08ce67380e 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -736,7 +736,6 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
/* Zero out the descriptor ring */
memset(rx_ring->desc, 0, rx_ring->size);

- rx_ring->next_to_alloc = 0;
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
}
@@ -792,7 +791,6 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
goto err;
}

- rx_ring->next_to_alloc = 0;
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;

@@ -812,9 +810,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
{
rx_ring->next_to_use = val;

- /* update next to alloc since we have filled the ring */
- rx_ring->next_to_alloc = val;
-
/* Force memory writes to complete before letting h/w
* know there are new descriptors to fetch. (Only
* applicable for weak-ordered memory model archs,
@@ -828,12 +823,17 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
* iavf_alloc_mapped_page - recycle or make a new page
* @rx_ring: ring to use
* @bi: rx_buffer struct to modify
+ * @dev: device used for DMA mapping
+ * @order: page order to allocate
+ * @gfp: GFP mask to allocate page
*
* Returns true if the page was successfully allocated or
* reused.
**/
static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *bi)
+ struct iavf_rx_buffer *bi,
+ struct device *dev, u32 order,
+ gfp_t gfp)
{
struct page *page = bi->page;
dma_addr_t dma;
@@ -845,23 +845,21 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
}

/* alloc new page for storage */
- page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
+ page = __dev_alloc_pages(gfp, order);
if (unlikely(!page)) {
rx_ring->rx_stats.alloc_page_failed++;
return false;
}

/* map page for use */
- dma = dma_map_page_attrs(rx_ring->dev, page, 0,
- iavf_rx_pg_size(rx_ring),
- DMA_FROM_DEVICE,
- IAVF_RX_DMA_ATTR);
+ dma = dma_map_page_attrs(dev, page, 0, PAGE_SIZE << order,
+ DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);

/* if mapping failed free memory back to system since
* there isn't much point in holding memory we can't use
*/
- if (dma_mapping_error(rx_ring->dev, dma)) {
- __free_pages(page, iavf_rx_pg_order(rx_ring));
+ if (dma_mapping_error(dev, dma)) {
+ __free_pages(page, order);
rx_ring->rx_stats.alloc_page_failed++;
return false;
}
@@ -898,32 +896,36 @@ static void iavf_receive_skb(struct iavf_ring *rx_ring,
}

/**
- * iavf_alloc_rx_buffers - Replace used receive buffers
+ * __iavf_alloc_rx_buffers - Replace used receive buffers
* @rx_ring: ring to place buffers on
- * @cleaned_count: number of buffers to replace
+ * @to_refill: number of buffers to replace
+ * @gfp: GFP mask to allocate pages
*
- * Returns false if all allocations were successful, true if any fail
+ * Returns 0 if all allocations were successful or the number of buffers left
+ * to refill in case of an allocation failure.
**/
-bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
+static u32 __iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u32 to_refill,
+ gfp_t gfp)
{
- u16 ntu = rx_ring->next_to_use;
+ u32 order = iavf_rx_pg_order(rx_ring);
+ struct device *dev = rx_ring->dev;
+ u32 ntu = rx_ring->next_to_use;
union iavf_rx_desc *rx_desc;
struct iavf_rx_buffer *bi;

/* do nothing if no valid netdev defined */
- if (!rx_ring->netdev || !cleaned_count)
- return false;
+ if (unlikely(!rx_ring->netdev || !to_refill))
+ return 0;

rx_desc = IAVF_RX_DESC(rx_ring, ntu);
bi = &rx_ring->rx_bi[ntu];

do {
- if (!iavf_alloc_mapped_page(rx_ring, bi))
- goto no_buffers;
+ if (!iavf_alloc_mapped_page(rx_ring, bi, dev, order, gfp))
+ break;

/* sync the buffer for use by the device */
- dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
- bi->page_offset,
+ dma_sync_single_range_for_device(dev, bi->dma, bi->page_offset,
rx_ring->rx_buf_len,
DMA_FROM_DEVICE);

@@ -943,23 +945,17 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)

/* clear the status bits for the next_to_use descriptor */
rx_desc->wb.qword1.status_error_len = 0;
-
- cleaned_count--;
- } while (cleaned_count);
+ } while (--to_refill);

if (rx_ring->next_to_use != ntu)
iavf_release_rx_desc(rx_ring, ntu);

- return false;
-
-no_buffers:
- if (rx_ring->next_to_use != ntu)
- iavf_release_rx_desc(rx_ring, ntu);
+ return to_refill;
+}

- /* make sure to come back via polling to try again after
- * allocation failure
- */
- return true;
+void iavf_alloc_rx_buffers(struct iavf_ring *rxr)
+{
+ __iavf_alloc_rx_buffers(rxr, IAVF_DESC_UNUSED(rxr), GFP_KERNEL);
}

/**
@@ -1104,32 +1100,6 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
return false;
}

-/**
- * iavf_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void iavf_reuse_rx_page(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *old_buff)
-{
- struct iavf_rx_buffer *new_buff;
- u16 nta = rx_ring->next_to_alloc;
-
- new_buff = &rx_ring->rx_bi[nta];
-
- /* update, and store next to alloc */
- nta++;
- rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
- /* transfer page from old buffer to new buffer */
- new_buff->dma = old_buff->dma;
- new_buff->page = old_buff->page;
- new_buff->page_offset = old_buff->page_offset;
- new_buff->pagecnt_bias = old_buff->pagecnt_bias;
-}
-
/**
* iavf_can_reuse_rx_page - Determine if this page can be reused by
* the adapter for another receive
@@ -1191,30 +1161,26 @@ static bool iavf_can_reuse_rx_page(struct iavf_rx_buffer *rx_buffer)

/**
* iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: buffer containing page to add
* @skb: sk_buff to place the data into
+ * @rx_buffer: buffer containing page to add
* @size: packet length from rx_desc
+ * @pg_size: Rx buffer page size
*
* This function will add the data contained in rx_buffer->page to the skb.
* It will just attach the page as a frag to the skb.
*
* The function will then update the page offset.
**/
-static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
+static void iavf_add_rx_frag(struct sk_buff *skb,
struct iavf_rx_buffer *rx_buffer,
- struct sk_buff *skb,
- unsigned int size)
+ u32 size, u32 pg_size)
{
#if (PAGE_SIZE < 8192)
- unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
+ unsigned int truesize = pg_size / 2;
#else
unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
#endif

- if (!size)
- return;
-
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
rx_buffer->page_offset, size, truesize);

@@ -1224,63 +1190,47 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
#else
rx_buffer->page_offset += truesize;
#endif
+
+ /* We have pulled a buffer for use, so decrement pagecnt_bias */
+ rx_buffer->pagecnt_bias--;
}

/**
- * iavf_get_rx_buffer - Fetch Rx buffer and synchronize data for use
- * @rx_ring: rx descriptor ring to transact packets on
- * @size: size of buffer to add to skb
+ * iavf_sync_rx_buffer - Synchronize received data for use
+ * @dev: device used for DMA mapping
+ * @buf: Rx buffer containing the data
+ * @size: size of the received data
*
- * This function will pull an Rx buffer from the ring and synchronize it
- * for use by the CPU.
+ * This function will synchronize the Rx buffer for use by the CPU.
*/
-static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
- const unsigned int size)
+static void iavf_sync_rx_buffer(struct device *dev, struct iavf_rx_buffer *buf,
+ u32 size)
{
- struct iavf_rx_buffer *rx_buffer;
-
- rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
- prefetchw(rx_buffer->page);
- if (!size)
- return rx_buffer;
-
- /* we are reusing so sync this buffer for CPU use */
- dma_sync_single_range_for_cpu(rx_ring->dev,
- rx_buffer->dma,
- rx_buffer->page_offset,
- size,
+ dma_sync_single_range_for_cpu(dev, buf->dma, buf->page_offset, size,
DMA_FROM_DEVICE);
-
- /* We have pulled a buffer for use, so decrement pagecnt_bias */
- rx_buffer->pagecnt_bias--;
-
- return rx_buffer;
}

/**
* iavf_build_skb - Build skb around an existing buffer
- * @rx_ring: Rx descriptor ring to transact packets on
- * @rx_buffer: Rx buffer to pull data from
- * @size: size of buffer to add to skb
+ * @rx_buffer: Rx buffer with the data
+ * @size: size of the data
+ * @pg_size: size of the Rx page
*
* This function builds an skb around an existing Rx buffer, taking care
* to set up the skb correctly and avoid any memcpy overhead.
*/
-static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer,
- unsigned int size)
+static struct sk_buff *iavf_build_skb(struct iavf_rx_buffer *rx_buffer,
+ u32 size, u32 pg_size)
{
void *va;
#if (PAGE_SIZE < 8192)
- unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
+ unsigned int truesize = pg_size / 2;
#else
unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
#endif
struct sk_buff *skb;

- if (!rx_buffer || !size)
- return NULL;
/* prefetch first cache line of first page */
va = page_address(rx_buffer->page) + rx_buffer->page_offset;
net_prefetch(va);
@@ -1301,36 +1251,33 @@ static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
rx_buffer->page_offset += truesize;
#endif

+ rx_buffer->pagecnt_bias--;
+
return skb;
}

/**
- * iavf_put_rx_buffer - Clean up used buffer and either recycle or free
+ * iavf_put_rx_buffer - Recycle or free used buffer
* @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
+ * @dev: device used for DMA mapping
+ * @rx_buffer: Rx buffer to handle
+ * @pg_size: Rx page size
*
- * This function will clean up the contents of the rx_buffer. It will
- * either recycle the buffer or unmap it and free the associated resources.
+ * Either recycle the buffer if possible or unmap and free the page.
*/
-static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
- struct iavf_rx_buffer *rx_buffer)
+static void iavf_put_rx_buffer(struct iavf_ring *rx_ring, struct device *dev,
+ struct iavf_rx_buffer *rx_buffer, u32 pg_size)
{
- if (!rx_buffer)
- return;
-
if (iavf_can_reuse_rx_page(rx_buffer)) {
- /* hand second half of page back to the ring */
- iavf_reuse_rx_page(rx_ring, rx_buffer);
rx_ring->rx_stats.page_reuse_count++;
- } else {
- /* we are not reusing the buffer so unmap it */
- dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
- iavf_rx_pg_size(rx_ring),
- DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
- __page_frag_cache_drain(rx_buffer->page,
- rx_buffer->pagecnt_bias);
+ return;
}

+ /* we are not reusing the buffer so unmap it */
+ dma_unmap_page_attrs(dev, rx_buffer->dma, pg_size,
+ DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
+ __page_frag_cache_drain(rx_buffer->page, rx_buffer->pagecnt_bias);
+
/* clear contents of buffer_info */
rx_buffer->page = NULL;
}
@@ -1350,14 +1297,6 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
union iavf_rx_desc *rx_desc,
struct sk_buff *skb)
{
- u32 ntc = rx_ring->next_to_clean + 1;
-
- /* fetch, update, and store next to clean */
- ntc = (ntc < rx_ring->count) ? ntc : 0;
- rx_ring->next_to_clean = ntc;
-
- prefetch(IAVF_RX_DESC(rx_ring, ntc));
-
/* if we are the last buffer then there is nothing else to do */
#define IAVF_RXD_EOF BIT(IAVF_RX_DESC_STATUS_EOF_SHIFT)
if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))
@@ -1383,11 +1322,16 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
{
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+ const gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
+ u32 to_refill = IAVF_DESC_UNUSED(rx_ring);
+ u32 pg_size = iavf_rx_pg_size(rx_ring);
struct sk_buff *skb = rx_ring->skb;
- u16 cleaned_count = IAVF_DESC_UNUSED(rx_ring);
- bool failure = false;
+ struct device *dev = rx_ring->dev;
+ u32 ntc = rx_ring->next_to_clean;
+ u32 ring_size = rx_ring->count;
+ u32 cleaned_count = 0;

- while (likely(total_rx_packets < (unsigned int)budget)) {
+ while (likely(cleaned_count < budget)) {
struct iavf_rx_buffer *rx_buffer;
union iavf_rx_desc *rx_desc;
unsigned int size;
@@ -1396,13 +1340,11 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
u64 qword;

/* return some buffers to hardware, one at a time is too slow */
- if (cleaned_count >= IAVF_RX_BUFFER_WRITE) {
- failure = failure ||
- iavf_alloc_rx_buffers(rx_ring, cleaned_count);
- cleaned_count = 0;
- }
+ if (to_refill >= IAVF_RX_BUFFER_WRITE)
+ to_refill = __iavf_alloc_rx_buffers(rx_ring, to_refill,
+ gfp);

- rx_desc = IAVF_RX_DESC(rx_ring, rx_ring->next_to_clean);
+ rx_desc = IAVF_RX_DESC(rx_ring, ntc);

/* status_error_len will always be zero for unused descriptors
* because it's cleared in cleanup, and overlaps with hdr_addr
@@ -1424,24 +1366,38 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
IAVF_RXD_QW1_LENGTH_PBUF_SHIFT;

iavf_trace(clean_rx_irq, rx_ring, rx_desc, skb);
- rx_buffer = iavf_get_rx_buffer(rx_ring, size);
+ rx_buffer = &rx_ring->rx_bi[ntc];
+
+ /* Very rare, but possible case. The most common reason:
+ * the last fragment contained FCS only, which was then
+ * stripped by the HW.
+ */
+ if (unlikely(!size))
+ goto skip_data;
+
+ iavf_sync_rx_buffer(dev, rx_buffer, size);

/* retrieve a buffer from the ring */
if (skb)
- iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
+ iavf_add_rx_frag(skb, rx_buffer, size, pg_size);
else
- skb = iavf_build_skb(rx_ring, rx_buffer, size);
+ skb = iavf_build_skb(rx_buffer, size, pg_size);

/* exit if we failed to retrieve a buffer */
if (!skb) {
rx_ring->rx_stats.alloc_buff_failed++;
- if (rx_buffer && size)
- rx_buffer->pagecnt_bias++;
break;
}

- iavf_put_rx_buffer(rx_ring, rx_buffer);
+skip_data:
+ iavf_put_rx_buffer(rx_ring, dev, rx_buffer, pg_size);
+
cleaned_count++;
+ to_refill++;
+ if (unlikely(++ntc == ring_size))
+ ntc = 0;
+
+ prefetch(IAVF_RX_DESC(rx_ring, ntc));

if (iavf_is_non_eop(rx_ring, rx_desc, skb))
continue;
@@ -1488,8 +1444,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
total_rx_packets++;
}

+ rx_ring->next_to_clean = ntc;
rx_ring->skb = skb;

+ if (to_refill >= IAVF_RX_BUFFER_WRITE) {
+ to_refill = __iavf_alloc_rx_buffers(rx_ring, to_refill, gfp);
+ /* guarantee a trip back through this routine if there was
+ * a failure
+ */
+ if (unlikely(to_refill))
+ cleaned_count = budget;
+ }
+
u64_stats_update_begin(&rx_ring->syncp);
rx_ring->stats.packets += total_rx_packets;
rx_ring->stats.bytes += total_rx_bytes;
@@ -1497,8 +1463,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
rx_ring->q_vector->rx.total_packets += total_rx_packets;
rx_ring->q_vector->rx.total_bytes += total_rx_bytes;

- /* guarantee a trip back through this routine if there was a failure */
- return failure ? budget : (int)total_rx_packets;
+ return cleaned_count;
}

static inline u32 iavf_buildreg_itr(const int type, u16 itr)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 234e189c1987..9c6661a6edf2 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -383,7 +383,6 @@ struct iavf_ring {
struct iavf_q_vector *q_vector; /* Backreference to associated vector */

struct rcu_head rcu; /* to avoid race on free */
- u16 next_to_alloc;
struct sk_buff *skb; /* When iavf_clean_rx_ring_irq() must
* return before it sees the EOP for
* the current packet, we save that skb
@@ -426,7 +425,7 @@ static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)

#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))

-bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
+void iavf_alloc_rx_buffers(struct iavf_ring *rxr);
netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
void iavf_clean_tx_ring(struct iavf_ring *tx_ring);
void iavf_clean_rx_ring(struct iavf_ring *rx_ring);
--
2.40.1



2023-05-30 16:27:20

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

On Thu, 2023-05-25 at 14:57 +0200, Alexander Lobakin wrote:
> The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
> further buffer model changes, shake it up a bit. Notably:
>
> 1. Cache more variables on the stack.
> DMA device, Rx page size, NTC -- these are the most common things
> used all throughout the hotpath, often in loops on each iteration.
> Instead of fetching (or even calculating, as with the page size) them
> from the ring all the time, cache them on the stack at the beginning
> of the NAPI polling callback. NTC will be written back at the end,
> the rest are used read-only, so no sync needed.

The advantage of this is going to vary based on the attribute. One of
the reasons why I left most of this on the ring is because the section
of the ring most of these variables were meant to be read-mostly and
shouldn't have resulted in any additional overhead versus accessing
them from the stack.

> 2. Don't move the recycled buffers around the ring.
> The idea of passing the page of the right-now-recycled-buffer to a
> different buffer, in this case, the first one that needs to be
> allocated, moreover, on each new frame, is fundamentally wrong. It
> involves a few o' fetches, branches and then writes (and one Rx
> buffer struct is at least 32 bytes) where they're completely unneeded,
> but gives no good -- the result is the same as if we'd recycle it
> inplace, at the same position where it was used. So drop this and let
> the main refilling function take care of all the buffers, which were
> processed and now need to be recycled/refilled.

The next_to_alloc logic was put in place to deal with systems that are
experiencing memory issues. Specifically what can end up happening is
that the ring can stall due to failing memory allocations and the
memory can get stuck on the ring. For that reason we were essentially
defragmenting the buffers when we started suffering memory pressure so
that they could be reusued and/or freed following immediate use.

Basically what you are trading off is some exception handling for
performance by removing it.

> 3. Don't allocate with %GPF_ATOMIC on ifup.
> This involved introducing the @gfp parameter to a couple functions.
> Doesn't change anything for Rx -> softirq.

Any specific reason for this? Just wondering if this is meant to
address some sort of memory pressure issue since it basically just
means the allocation can go out and try to free other memory.

> 4. 1 budget unit == 1 descriptor, not skb.
> There could be underflow when receiving a lot of fragmented frames.
> If each of them would consist of 2 frags, it means that we'd process
> 64 descriptors at the point where we pass the 32th skb to the stack.
> But the driver would count that only as a half, which could make NAPI
> re-enable interrupts prematurely and create unnecessary CPU load.

Not sure I agree with this. The problem is the overhead for an skb
going up the stack versus a fragment are pretty signficant. Keep in
mind that most of the overhead for a single buffer occurs w/
napi_gro_receive and is not actually at the driver itself. The whole
point of the budget is to meter out units of work, not to keep you in
the busy loop. This starts looking like the old code where the Intel
drivers were returning either budget or 0 instead of supporting the
middle ground.

> 5. Shortcut !size case.
> It's super rare, but possible -- for example, if the last buffer of
> the fragmented frame contained only FCS, which was then stripped by
> the HW. Instead of checking for size several times when processing,
> quickly reuse the buffer and jump to the skb fields part.
> 6. Refill the ring after finishing the polling loop.
> Previously, the loop wasn't starting a new iteration after the 64th
> desc, meaning that we were always leaving 16 buffers non-refilled
> until the next NAPI poll. It's better to refill them while they're
> still hot, so do that right after exiting the loop as well.
> For a full cycle of 64 descs, there will be 4 refills of 16 descs
> from now on.
>
> Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)
>
> + up to 2% performance.
>

What is the test you saw the 2% performance improvement in? Is it
something XDP related or a full stack test?

> Signed-off-by: Alexander Lobakin <[email protected]>

Also one thing I am not a huge fan of is a patch that is really a
patchset onto itself. With all 6 items called out here I would have
preferred to see this as 6 patches as it would have been easier to
review.

> ---
> drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +-
> drivers/net/ethernet/intel/iavf/iavf_txrx.c | 259 +++++++++-----------
> drivers/net/ethernet/intel/iavf/iavf_txrx.h | 3 +-
> 3 files changed, 114 insertions(+), 150 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> index a5a6c9861a93..ade32aa1ed78 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> @@ -1236,7 +1236,7 @@ static void iavf_configure(struct iavf_adapter *adapter)
> for (i = 0; i < adapter->num_active_queues; i++) {
> struct iavf_ring *ring = &adapter->rx_rings[i];
>
> - iavf_alloc_rx_buffers(ring, IAVF_DESC_UNUSED(ring));
> + iavf_alloc_rx_buffers(ring);
> }
> }
>
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> index a7121dc5c32b..fd08ce67380e 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> @@ -736,7 +736,6 @@ void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
> /* Zero out the descriptor ring */
> memset(rx_ring->desc, 0, rx_ring->size);
>
> - rx_ring->next_to_alloc = 0;
> rx_ring->next_to_clean = 0;
> rx_ring->next_to_use = 0;
> }
> @@ -792,7 +791,6 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
> goto err;
> }
>
> - rx_ring->next_to_alloc = 0;
> rx_ring->next_to_clean = 0;
> rx_ring->next_to_use = 0;
>
> @@ -812,9 +810,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
> {
> rx_ring->next_to_use = val;
>
> - /* update next to alloc since we have filled the ring */
> - rx_ring->next_to_alloc = val;
> -
> /* Force memory writes to complete before letting h/w
> * know there are new descriptors to fetch. (Only
> * applicable for weak-ordered memory model archs,
> @@ -828,12 +823,17 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
> * iavf_alloc_mapped_page - recycle or make a new page
> * @rx_ring: ring to use
> * @bi: rx_buffer struct to modify
> + * @dev: device used for DMA mapping
> + * @order: page order to allocate
> + * @gfp: GFP mask to allocate page
> *
> * Returns true if the page was successfully allocated or
> * reused.
> **/
> static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
> - struct iavf_rx_buffer *bi)
> + struct iavf_rx_buffer *bi,
> + struct device *dev, u32 order,
> + gfp_t gfp)
> {
> struct page *page = bi->page;
> dma_addr_t dma;
> @@ -845,23 +845,21 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
> }
>
> /* alloc new page for storage */
> - page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
> + page = __dev_alloc_pages(gfp, order);
> if (unlikely(!page)) {
> rx_ring->rx_stats.alloc_page_failed++;
> return false;
> }
>
> /* map page for use */
> - dma = dma_map_page_attrs(rx_ring->dev, page, 0,
> - iavf_rx_pg_size(rx_ring),
> - DMA_FROM_DEVICE,
> - IAVF_RX_DMA_ATTR);
> + dma = dma_map_page_attrs(dev, page, 0, PAGE_SIZE << order,
> + DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
>
> /* if mapping failed free memory back to system since
> * there isn't much point in holding memory we can't use
> */
> - if (dma_mapping_error(rx_ring->dev, dma)) {
> - __free_pages(page, iavf_rx_pg_order(rx_ring));
> + if (dma_mapping_error(dev, dma)) {
> + __free_pages(page, order);
> rx_ring->rx_stats.alloc_page_failed++;
> return false;
> }
> @@ -898,32 +896,36 @@ static void iavf_receive_skb(struct iavf_ring *rx_ring,
> }
>
> /**
> - * iavf_alloc_rx_buffers - Replace used receive buffers
> + * __iavf_alloc_rx_buffers - Replace used receive buffers
> * @rx_ring: ring to place buffers on
> - * @cleaned_count: number of buffers to replace
> + * @to_refill: number of buffers to replace
> + * @gfp: GFP mask to allocate pages
> *
> - * Returns false if all allocations were successful, true if any fail
> + * Returns 0 if all allocations were successful or the number of buffers left
> + * to refill in case of an allocation failure.
> **/
> -bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
> +static u32 __iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u32 to_refill,
> + gfp_t gfp)
> {
> - u16 ntu = rx_ring->next_to_use;
> + u32 order = iavf_rx_pg_order(rx_ring);
> + struct device *dev = rx_ring->dev;
> + u32 ntu = rx_ring->next_to_use;
> union iavf_rx_desc *rx_desc;
> struct iavf_rx_buffer *bi;
>
> /* do nothing if no valid netdev defined */
> - if (!rx_ring->netdev || !cleaned_count)
> - return false;
> + if (unlikely(!rx_ring->netdev || !to_refill))
> + return 0;
>
> rx_desc = IAVF_RX_DESC(rx_ring, ntu);
> bi = &rx_ring->rx_bi[ntu];
>
> do {
> - if (!iavf_alloc_mapped_page(rx_ring, bi))
> - goto no_buffers;
> + if (!iavf_alloc_mapped_page(rx_ring, bi, dev, order, gfp))
> + break;
>
> /* sync the buffer for use by the device */
> - dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
> - bi->page_offset,
> + dma_sync_single_range_for_device(dev, bi->dma, bi->page_offset,
> rx_ring->rx_buf_len,
> DMA_FROM_DEVICE);
>
> @@ -943,23 +945,17 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
>
> /* clear the status bits for the next_to_use descriptor */
> rx_desc->wb.qword1.status_error_len = 0;
> -
> - cleaned_count--;
> - } while (cleaned_count);
> + } while (--to_refill);

Just a nit. You might want to break this up into two statements like I
had before. I know some people within Intel weren't a huge fan of when
I used to do that kind of thing all the time in loops where I would do
the decrement and test in one line.. :)

>
> if (rx_ring->next_to_use != ntu)
> iavf_release_rx_desc(rx_ring, ntu);
>
> - return false;
> -
> -no_buffers:
> - if (rx_ring->next_to_use != ntu)
> - iavf_release_rx_desc(rx_ring, ntu);
> + return to_refill;
> +}
>
> - /* make sure to come back via polling to try again after
> - * allocation failure
> - */
> - return true;
> +void iavf_alloc_rx_buffers(struct iavf_ring *rxr)
> +{
> + __iavf_alloc_rx_buffers(rxr, IAVF_DESC_UNUSED(rxr), GFP_KERNEL);
> }
>
> /**
> @@ -1104,32 +1100,6 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
> return false;
> }
>
> -/**
> - * iavf_reuse_rx_page - page flip buffer and store it back on the ring
> - * @rx_ring: rx descriptor ring to store buffers on
> - * @old_buff: donor buffer to have page reused
> - *
> - * Synchronizes page for reuse by the adapter
> - **/
> -static void iavf_reuse_rx_page(struct iavf_ring *rx_ring,
> - struct iavf_rx_buffer *old_buff)
> -{
> - struct iavf_rx_buffer *new_buff;
> - u16 nta = rx_ring->next_to_alloc;
> -
> - new_buff = &rx_ring->rx_bi[nta];
> -
> - /* update, and store next to alloc */
> - nta++;
> - rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> -
> - /* transfer page from old buffer to new buffer */
> - new_buff->dma = old_buff->dma;
> - new_buff->page = old_buff->page;
> - new_buff->page_offset = old_buff->page_offset;
> - new_buff->pagecnt_bias = old_buff->pagecnt_bias;
> -}
> -
> /**
> * iavf_can_reuse_rx_page - Determine if this page can be reused by
> * the adapter for another receive
> @@ -1191,30 +1161,26 @@ static bool iavf_can_reuse_rx_page(struct iavf_rx_buffer *rx_buffer)
>
> /**
> * iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
> - * @rx_ring: rx descriptor ring to transact packets on
> - * @rx_buffer: buffer containing page to add
> * @skb: sk_buff to place the data into
> + * @rx_buffer: buffer containing page to add
> * @size: packet length from rx_desc
> + * @pg_size: Rx buffer page size
> *
> * This function will add the data contained in rx_buffer->page to the skb.
> * It will just attach the page as a frag to the skb.
> *
> * The function will then update the page offset.
> **/
> -static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
> +static void iavf_add_rx_frag(struct sk_buff *skb,
> struct iavf_rx_buffer *rx_buffer,
> - struct sk_buff *skb,
> - unsigned int size)
> + u32 size, u32 pg_size)
> {
> #if (PAGE_SIZE < 8192)
> - unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
> + unsigned int truesize = pg_size / 2;
> #else
> unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
> #endif
>
> - if (!size)
> - return;
> -
> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
> rx_buffer->page_offset, size, truesize);
>
> @@ -1224,63 +1190,47 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
> #else
> rx_buffer->page_offset += truesize;
> #endif
> +
> + /* We have pulled a buffer for use, so decrement pagecnt_bias */
> + rx_buffer->pagecnt_bias--;
> }
>
> /**
> - * iavf_get_rx_buffer - Fetch Rx buffer and synchronize data for use
> - * @rx_ring: rx descriptor ring to transact packets on
> - * @size: size of buffer to add to skb
> + * iavf_sync_rx_buffer - Synchronize received data for use
> + * @dev: device used for DMA mapping
> + * @buf: Rx buffer containing the data
> + * @size: size of the received data
> *
> - * This function will pull an Rx buffer from the ring and synchronize it
> - * for use by the CPU.
> + * This function will synchronize the Rx buffer for use by the CPU.
> */
> -static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
> - const unsigned int size)
> +static void iavf_sync_rx_buffer(struct device *dev, struct iavf_rx_buffer *buf,
> + u32 size)
> {
> - struct iavf_rx_buffer *rx_buffer;
> -
> - rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> - prefetchw(rx_buffer->page);
> - if (!size)
> - return rx_buffer;
> -
> - /* we are reusing so sync this buffer for CPU use */
> - dma_sync_single_range_for_cpu(rx_ring->dev,
> - rx_buffer->dma,
> - rx_buffer->page_offset,
> - size,
> + dma_sync_single_range_for_cpu(dev, buf->dma, buf->page_offset, size,
> DMA_FROM_DEVICE);
> -
> - /* We have pulled a buffer for use, so decrement pagecnt_bias */
> - rx_buffer->pagecnt_bias--;
> -
> - return rx_buffer;
> }
>
> /**
> * iavf_build_skb - Build skb around an existing buffer
> - * @rx_ring: Rx descriptor ring to transact packets on
> - * @rx_buffer: Rx buffer to pull data from
> - * @size: size of buffer to add to skb
> + * @rx_buffer: Rx buffer with the data
> + * @size: size of the data
> + * @pg_size: size of the Rx page
> *
> * This function builds an skb around an existing Rx buffer, taking care
> * to set up the skb correctly and avoid any memcpy overhead.
> */
> -static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
> - struct iavf_rx_buffer *rx_buffer,
> - unsigned int size)
> +static struct sk_buff *iavf_build_skb(struct iavf_rx_buffer *rx_buffer,
> + u32 size, u32 pg_size)
> {
> void *va;
> #if (PAGE_SIZE < 8192)
> - unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
> + unsigned int truesize = pg_size / 2;
> #else
> unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
> #endif
> struct sk_buff *skb;
>
> - if (!rx_buffer || !size)
> - return NULL;
> /* prefetch first cache line of first page */
> va = page_address(rx_buffer->page) + rx_buffer->page_offset;
> net_prefetch(va);
> @@ -1301,36 +1251,33 @@ static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
> rx_buffer->page_offset += truesize;
> #endif
>
> + rx_buffer->pagecnt_bias--;
> +
> return skb;
> }
>
> /**
> - * iavf_put_rx_buffer - Clean up used buffer and either recycle or free
> + * iavf_put_rx_buffer - Recycle or free used buffer
> * @rx_ring: rx descriptor ring to transact packets on
> - * @rx_buffer: rx buffer to pull data from
> + * @dev: device used for DMA mapping
> + * @rx_buffer: Rx buffer to handle
> + * @pg_size: Rx page size
> *
> - * This function will clean up the contents of the rx_buffer. It will
> - * either recycle the buffer or unmap it and free the associated resources.
> + * Either recycle the buffer if possible or unmap and free the page.
> */
> -static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
> - struct iavf_rx_buffer *rx_buffer)
> +static void iavf_put_rx_buffer(struct iavf_ring *rx_ring, struct device *dev,
> + struct iavf_rx_buffer *rx_buffer, u32 pg_size)
> {
> - if (!rx_buffer)
> - return;
> -
> if (iavf_can_reuse_rx_page(rx_buffer)) {
> - /* hand second half of page back to the ring */
> - iavf_reuse_rx_page(rx_ring, rx_buffer);
> rx_ring->rx_stats.page_reuse_count++;
> - } else {
> - /* we are not reusing the buffer so unmap it */
> - dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
> - iavf_rx_pg_size(rx_ring),
> - DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> - __page_frag_cache_drain(rx_buffer->page,
> - rx_buffer->pagecnt_bias);
> + return;
> }
>
> + /* we are not reusing the buffer so unmap it */
> + dma_unmap_page_attrs(dev, rx_buffer->dma, pg_size,
> + DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> + __page_frag_cache_drain(rx_buffer->page, rx_buffer->pagecnt_bias);
> +
> /* clear contents of buffer_info */
> rx_buffer->page = NULL;
> }
> @@ -1350,14 +1297,6 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
> union iavf_rx_desc *rx_desc,
> struct sk_buff *skb)
> {
> - u32 ntc = rx_ring->next_to_clean + 1;
> -
> - /* fetch, update, and store next to clean */
> - ntc = (ntc < rx_ring->count) ? ntc : 0;
> - rx_ring->next_to_clean = ntc;
> -
> - prefetch(IAVF_RX_DESC(rx_ring, ntc));
> -
> /* if we are the last buffer then there is nothing else to do */
> #define IAVF_RXD_EOF BIT(IAVF_RX_DESC_STATUS_EOF_SHIFT)
> if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))

You may want to see if you can get rid of this function entirely,
perhaps you do in a later patch. This function was added for ixgbe back
in the day to allow us to place the skb back in the ring for the RSC
based workloads where we had to deal with interleaved frames in the Rx
path.

For example, one question here would be why are we passing skb? It
isn't used as far as I can tell.

> @@ -1383,11 +1322,16 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
> static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> {
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> + const gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
> + u32 to_refill = IAVF_DESC_UNUSED(rx_ring);
> + u32 pg_size = iavf_rx_pg_size(rx_ring);
> struct sk_buff *skb = rx_ring->skb;
> - u16 cleaned_count = IAVF_DESC_UNUSED(rx_ring);
> - bool failure = false;
> + struct device *dev = rx_ring->dev;
> + u32 ntc = rx_ring->next_to_clean;
> + u32 ring_size = rx_ring->count;
> + u32 cleaned_count = 0;
>
> - while (likely(total_rx_packets < (unsigned int)budget)) {
> + while (likely(cleaned_count < budget)) {
> struct iavf_rx_buffer *rx_buffer;
> union iavf_rx_desc *rx_desc;
> unsigned int size;
> @@ -1396,13 +1340,11 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> u64 qword;
>
> /* return some buffers to hardware, one at a time is too slow */
> - if (cleaned_count >= IAVF_RX_BUFFER_WRITE) {
> - failure = failure ||
> - iavf_alloc_rx_buffers(rx_ring, cleaned_count);
> - cleaned_count = 0;
> - }
> + if (to_refill >= IAVF_RX_BUFFER_WRITE)
> + to_refill = __iavf_alloc_rx_buffers(rx_ring, to_refill,
> + gfp);
>
> - rx_desc = IAVF_RX_DESC(rx_ring, rx_ring->next_to_clean);
> + rx_desc = IAVF_RX_DESC(rx_ring, ntc);
>
> /* status_error_len will always be zero for unused descriptors
> * because it's cleared in cleanup, and overlaps with hdr_addr
> @@ -1424,24 +1366,38 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> IAVF_RXD_QW1_LENGTH_PBUF_SHIFT;
>
> iavf_trace(clean_rx_irq, rx_ring, rx_desc, skb);
> - rx_buffer = iavf_get_rx_buffer(rx_ring, size);
> + rx_buffer = &rx_ring->rx_bi[ntc];
> +
> + /* Very rare, but possible case. The most common reason:
> + * the last fragment contained FCS only, which was then
> + * stripped by the HW.
> + */
> + if (unlikely(!size))
> + goto skip_data;
> +
> + iavf_sync_rx_buffer(dev, rx_buffer, size);
>
> /* retrieve a buffer from the ring */
> if (skb)
> - iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
> + iavf_add_rx_frag(skb, rx_buffer, size, pg_size);
> else
> - skb = iavf_build_skb(rx_ring, rx_buffer, size);
> + skb = iavf_build_skb(rx_buffer, size, pg_size);
>
> /* exit if we failed to retrieve a buffer */
> if (!skb) {
> rx_ring->rx_stats.alloc_buff_failed++;
> - if (rx_buffer && size)
> - rx_buffer->pagecnt_bias++;
> break;
> }
>
> - iavf_put_rx_buffer(rx_ring, rx_buffer);
> +skip_data:
> + iavf_put_rx_buffer(rx_ring, dev, rx_buffer, pg_size);
> +
> cleaned_count++;
> + to_refill++;
> + if (unlikely(++ntc == ring_size))
> + ntc = 0;
> +
> + prefetch(IAVF_RX_DESC(rx_ring, ntc));
>
> if (iavf_is_non_eop(rx_ring, rx_desc, skb))
> continue;
> @@ -1488,8 +1444,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> total_rx_packets++;
> }
>
> + rx_ring->next_to_clean = ntc;
> rx_ring->skb = skb;
>
> + if (to_refill >= IAVF_RX_BUFFER_WRITE) {
> + to_refill = __iavf_alloc_rx_buffers(rx_ring, to_refill, gfp);
> + /* guarantee a trip back through this routine if there was
> + * a failure
> + */
> + if (unlikely(to_refill))
> + cleaned_count = budget;
> + }
> +
> u64_stats_update_begin(&rx_ring->syncp);
> rx_ring->stats.packets += total_rx_packets;
> rx_ring->stats.bytes += total_rx_bytes;
> @@ -1497,8 +1463,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> rx_ring->q_vector->rx.total_packets += total_rx_packets;
> rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
>
> - /* guarantee a trip back through this routine if there was a failure */
> - return failure ? budget : (int)total_rx_packets;
> + return cleaned_count;
> }
>
> static inline u32 iavf_buildreg_itr(const int type, u16 itr)
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> index 234e189c1987..9c6661a6edf2 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> @@ -383,7 +383,6 @@ struct iavf_ring {
> struct iavf_q_vector *q_vector; /* Backreference to associated vector */
>
> struct rcu_head rcu; /* to avoid race on free */
> - u16 next_to_alloc;
> struct sk_buff *skb; /* When iavf_clean_rx_ring_irq() must
> * return before it sees the EOP for
> * the current packet, we save that skb
> @@ -426,7 +425,7 @@ static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
>
> #define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
>
> -bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
> +void iavf_alloc_rx_buffers(struct iavf_ring *rxr);
> netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
> void iavf_clean_tx_ring(struct iavf_ring *tx_ring);
> void iavf_clean_rx_ring(struct iavf_ring *rx_ring);


2023-05-31 11:19:36

by Maciej Fijalkowski

[permalink] [raw]
Subject: Re: [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

On Tue, May 30, 2023 at 09:18:40AM -0700, Alexander H Duyck wrote:

FWIW I agree with what Alex is saying over here.

> On Thu, 2023-05-25 at 14:57 +0200, Alexander Lobakin wrote:
> > The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
> > further buffer model changes, shake it up a bit. Notably:
> >
> > 1. Cache more variables on the stack.
> > DMA device, Rx page size, NTC -- these are the most common things
> > used all throughout the hotpath, often in loops on each iteration.
> > Instead of fetching (or even calculating, as with the page size) them
> > from the ring all the time, cache them on the stack at the beginning
> > of the NAPI polling callback. NTC will be written back at the end,
> > the rest are used read-only, so no sync needed.
>
> The advantage of this is going to vary based on the attribute. One of
> the reasons why I left most of this on the ring is because the section
> of the ring most of these variables were meant to be read-mostly and
> shouldn't have resulted in any additional overhead versus accessing
> them from the stack.

I believe it depends on ring struct layout which vary across our drivers,
no? On ice using making more usage of stack as described above improved
perf.

>
> > 2. Don't move the recycled buffers around the ring.
> > The idea of passing the page of the right-now-recycled-buffer to a
> > different buffer, in this case, the first one that needs to be
> > allocated, moreover, on each new frame, is fundamentally wrong. It
> > involves a few o' fetches, branches and then writes (and one Rx
> > buffer struct is at least 32 bytes) where they're completely unneeded,
> > but gives no good -- the result is the same as if we'd recycle it
> > inplace, at the same position where it was used. So drop this and let
> > the main refilling function take care of all the buffers, which were
> > processed and now need to be recycled/refilled.
>
> The next_to_alloc logic was put in place to deal with systems that are
> experiencing memory issues. Specifically what can end up happening is
> that the ring can stall due to failing memory allocations and the
> memory can get stuck on the ring. For that reason we were essentially
> defragmenting the buffers when we started suffering memory pressure so
> that they could be reusued and/or freed following immediate use.
>
> Basically what you are trading off is some exception handling for
> performance by removing it.

With all of the mix of the changes this patch carries, I find it hard to
follow from description which parts of diff I should be looking at.

>
> > 3. Don't allocate with %GPF_ATOMIC on ifup.
> > This involved introducing the @gfp parameter to a couple functions.
> > Doesn't change anything for Rx -> softirq.
>
> Any specific reason for this? Just wondering if this is meant to
> address some sort of memory pressure issue since it basically just
> means the allocation can go out and try to free other memory.
>
> > 4. 1 budget unit == 1 descriptor, not skb.
> > There could be underflow when receiving a lot of fragmented frames.
> > If each of them would consist of 2 frags, it means that we'd process
> > 64 descriptors at the point where we pass the 32th skb to the stack.
> > But the driver would count that only as a half, which could make NAPI
> > re-enable interrupts prematurely and create unnecessary CPU load.
>
> Not sure I agree with this. The problem is the overhead for an skb
> going up the stack versus a fragment are pretty signficant. Keep in
> mind that most of the overhead for a single buffer occurs w/
> napi_gro_receive and is not actually at the driver itself. The whole
> point of the budget is to meter out units of work, not to keep you in
> the busy loop. This starts looking like the old code where the Intel
> drivers were returning either budget or 0 instead of supporting the
> middle ground.
>
> > 5. Shortcut !size case.
> > It's super rare, but possible -- for example, if the last buffer of
> > the fragmented frame contained only FCS, which was then stripped by
> > the HW. Instead of checking for size several times when processing,
> > quickly reuse the buffer and jump to the skb fields part.
> > 6. Refill the ring after finishing the polling loop.
> > Previously, the loop wasn't starting a new iteration after the 64th
> > desc, meaning that we were always leaving 16 buffers non-refilled
> > until the next NAPI poll. It's better to refill them while they're
> > still hot, so do that right after exiting the loop as well.
> > For a full cycle of 64 descs, there will be 4 refills of 16 descs
> > from now on.
> >
> > Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)
> >
> > + up to 2% performance.
> >
>
> What is the test you saw the 2% performance improvement in? Is it
> something XDP related or a full stack test?

+1, can you say more about that measurement?

>
> > Signed-off-by: Alexander Lobakin <[email protected]>
>
> Also one thing I am not a huge fan of is a patch that is really a
> patchset onto itself. With all 6 items called out here I would have
> preferred to see this as 6 patches as it would have been easier to
> review.

+1

>
> > ---
> > drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +-
> > drivers/net/ethernet/intel/iavf/iavf_txrx.c | 259 +++++++++-----------
> > drivers/net/ethernet/intel/iavf/iavf_txrx.h | 3 +-
> > 3 files changed, 114 insertions(+), 150 deletions(-)
> >

(...)

> > }
> > @@ -1350,14 +1297,6 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
> > union iavf_rx_desc *rx_desc,
> > struct sk_buff *skb)
> > {
> > - u32 ntc = rx_ring->next_to_clean + 1;
> > -
> > - /* fetch, update, and store next to clean */
> > - ntc = (ntc < rx_ring->count) ? ntc : 0;
> > - rx_ring->next_to_clean = ntc;
> > -
> > - prefetch(IAVF_RX_DESC(rx_ring, ntc));
> > -
> > /* if we are the last buffer then there is nothing else to do */
> > #define IAVF_RXD_EOF BIT(IAVF_RX_DESC_STATUS_EOF_SHIFT)
> > if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))
>
> You may want to see if you can get rid of this function entirely,
> perhaps you do in a later patch. This function was added for ixgbe back
> in the day to allow us to place the skb back in the ring for the RSC
> based workloads where we had to deal with interleaved frames in the Rx
> path.
>
> For example, one question here would be why are we passing skb? It
> isn't used as far as I can tell.
>
this was used back when skb was stored within the Rx buffer and now we
just store skb on Rx ring struct, so good catch, this arg is redundant.

I'll go and take a look at code on v3.

2023-05-31 15:45:59

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Maciej Fijalkowski <[email protected]>
Date: Wed, 31 May 2023 13:14:12 +0200

> On Tue, May 30, 2023 at 09:18:40AM -0700, Alexander H Duyck wrote:
>
> FWIW I agree with what Alex is saying over here.

There are 2 Alexes, "choose wisely" :P

>
>> On Thu, 2023-05-25 at 14:57 +0200, Alexander Lobakin wrote:
>>> The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
>>> further buffer model changes, shake it up a bit. Notably:
>>>
>>> 1. Cache more variables on the stack.
>>> DMA device, Rx page size, NTC -- these are the most common things
>>> used all throughout the hotpath, often in loops on each iteration.
>>> Instead of fetching (or even calculating, as with the page size) them
>>> from the ring all the time, cache them on the stack at the beginning
>>> of the NAPI polling callback. NTC will be written back at the end,
>>> the rest are used read-only, so no sync needed.
>>
>> The advantage of this is going to vary based on the attribute. One of
>> the reasons why I left most of this on the ring is because the section
>> of the ring most of these variables were meant to be read-mostly and
>> shouldn't have resulted in any additional overhead versus accessing
>> them from the stack.
>
> I believe it depends on ring struct layout which vary across our drivers,
> no? On ice using making more usage of stack as described above improved
> perf.

It's +/- the same, most layout changes usually come with us moving stuff
around to optimize paddings and cachelines lol. Here's the same as with
ice, I don't think it's driver specific to get some positive results
from shortcutting more hotties. The sole time I was surprised is when
you were getting worse results storing xdp_buff on the stack vs on the
ring :D

>
>>
>>> 2. Don't move the recycled buffers around the ring.
>>> The idea of passing the page of the right-now-recycled-buffer to a
>>> different buffer, in this case, the first one that needs to be
>>> allocated, moreover, on each new frame, is fundamentally wrong. It
>>> involves a few o' fetches, branches and then writes (and one Rx
>>> buffer struct is at least 32 bytes) where they're completely unneeded,
>>> but gives no good -- the result is the same as if we'd recycle it
>>> inplace, at the same position where it was used. So drop this and let
>>> the main refilling function take care of all the buffers, which were
>>> processed and now need to be recycled/refilled.
>>
>> The next_to_alloc logic was put in place to deal with systems that are
>> experiencing memory issues. Specifically what can end up happening is
>> that the ring can stall due to failing memory allocations and the
>> memory can get stuck on the ring. For that reason we were essentially
>> defragmenting the buffers when we started suffering memory pressure so
>> that they could be reusued and/or freed following immediate use.
>>
>> Basically what you are trading off is some exception handling for
>> performance by removing it.
>
> With all of the mix of the changes this patch carries, I find it hard to
> follow from description which parts of diff I should be looking at.

Huge piece removed before add_rx_frag_blah.

>
>>
>>> 3. Don't allocate with %GPF_ATOMIC on ifup.
>>> This involved introducing the @gfp parameter to a couple functions.
>>> Doesn't change anything for Rx -> softirq.

[...]

>>> + up to 2% performance.
>>>
>>
>> What is the test you saw the 2% performance improvement in? Is it
>> something XDP related or a full stack test?
>
> +1, can you say more about that measurement?

My prev reply to Alex.

>
>>
>>> Signed-off-by: Alexander Lobakin <[email protected]>
>>
>> Also one thing I am not a huge fan of is a patch that is really a
>> patchset onto itself. With all 6 items called out here I would have
>> preferred to see this as 6 patches as it would have been easier to
>> review.
>
> +1

+1 :D

[...]

>>> /* if we are the last buffer then there is nothing else to do */
>>> #define IAVF_RXD_EOF BIT(IAVF_RX_DESC_STATUS_EOF_SHIFT)
>>> if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))
>>
>> You may want to see if you can get rid of this function entirely,
>> perhaps you do in a later patch. This function was added for ixgbe back
>> in the day to allow us to place the skb back in the ring for the RSC
>> based workloads where we had to deal with interleaved frames in the Rx
>> path.
>>
>> For example, one question here would be why are we passing skb? It
>> isn't used as far as I can tell.
>>
> this was used back when skb was stored within the Rx buffer and now we
> just store skb on Rx ring struct, so good catch, this arg is redundant.

Also prev reply. I'm removing it later in the series hehe.

>
> I'll go and take a look at code on v3.

No changes apart fixing OcteonTX2 compilation =\

Thanks,
Olek

2023-05-31 15:46:21

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander H Duyck <[email protected]>
Date: Tue, 30 May 2023 09:18:40 -0700

> On Thu, 2023-05-25 at 14:57 +0200, Alexander Lobakin wrote:
>> The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
>> further buffer model changes, shake it up a bit. Notably:
>>
>> 1. Cache more variables on the stack.
>> DMA device, Rx page size, NTC -- these are the most common things
>> used all throughout the hotpath, often in loops on each iteration.
>> Instead of fetching (or even calculating, as with the page size) them
>> from the ring all the time, cache them on the stack at the beginning
>> of the NAPI polling callback. NTC will be written back at the end,
>> the rest are used read-only, so no sync needed.
>
> The advantage of this is going to vary based on the attribute. One of
> the reasons why I left most of this on the ring is because the section
> of the ring most of these variables were meant to be read-mostly and
> shouldn't have resulted in any additional overhead versus accessing
> them from the stack.

But not all of these variables are read-only. E.g. NTC is often
modified. Page size was calculated per descriptor, but could be once a
poll cycle starts, and so on.

>
>> 2. Don't move the recycled buffers around the ring.
>> The idea of passing the page of the right-now-recycled-buffer to a
>> different buffer, in this case, the first one that needs to be
>> allocated, moreover, on each new frame, is fundamentally wrong. It
>> involves a few o' fetches, branches and then writes (and one Rx
>> buffer struct is at least 32 bytes) where they're completely unneeded,
>> but gives no good -- the result is the same as if we'd recycle it
>> inplace, at the same position where it was used. So drop this and let
>> the main refilling function take care of all the buffers, which were
>> processed and now need to be recycled/refilled.
>
> The next_to_alloc logic was put in place to deal with systems that are
> experiencing memory issues. Specifically what can end up happening is
> that the ring can stall due to failing memory allocations and the
> memory can get stuck on the ring. For that reason we were essentially
> defragmenting the buffers when we started suffering memory pressure so
> that they could be reusued and/or freed following immediate use.
>
> Basically what you are trading off is some exception handling for
> performance by removing it.

I'm not sure this helps a lot, but OTOH this really slows down things,
esp. given that this code is run all the time, not only when a memory
allocation fail happens.

>
>> 3. Don't allocate with %GPF_ATOMIC on ifup.
>> This involved introducing the @gfp parameter to a couple functions.
>> Doesn't change anything for Rx -> softirq.
>
> Any specific reason for this? Just wondering if this is meant to
> address some sort of memory pressure issue since it basically just
> means the allocation can go out and try to free other memory.

Yes, I'm no MM expert, but I've seen plenty of times messages from the
MM folks that ATOMIC shouldn't be used in non-atomic contexts. Atomic
allocation is able to grab memory from some sort of critical reservs and
all that, and the less we touch them, the better. Outside of atomic
contexts they should not be touched.

>
>> 4. 1 budget unit == 1 descriptor, not skb.
>> There could be underflow when receiving a lot of fragmented frames.
>> If each of them would consist of 2 frags, it means that we'd process
>> 64 descriptors at the point where we pass the 32th skb to the stack.
>> But the driver would count that only as a half, which could make NAPI
>> re-enable interrupts prematurely and create unnecessary CPU load.
>
> Not sure I agree with this. The problem is the overhead for an skb
> going up the stack versus a fragment are pretty signficant. Keep in
> mind that most of the overhead for a single buffer occurs w/
> napi_gro_receive and is not actually at the driver itself. The whole
> point of the budget is to meter out units of work, not to keep you in
> the busy loop. This starts looking like the old code where the Intel
> drivers were returning either budget or 0 instead of supporting the
> middle ground.

No, certainly not this :D

The point of budget is to limit the amount of time drivers can spend on
cleaning their rings. Making skb the unit makes the unit very logical
and flexible, but I'd say it should always be solid. Imagine you get a
frame which got spanned across 5 buffers. You spend x5 time (roughly) to
build an skb and pass it up the stack vs when you get a linear frame in
one buffer, but according to your logics both of these cases count as 1
unit, while the amount of time spent differs significantly. I can't say
that's fair enough.

>
>> 5. Shortcut !size case.
>> It's super rare, but possible -- for example, if the last buffer of
>> the fragmented frame contained only FCS, which was then stripped by
>> the HW. Instead of checking for size several times when processing,
>> quickly reuse the buffer and jump to the skb fields part.
>> 6. Refill the ring after finishing the polling loop.
>> Previously, the loop wasn't starting a new iteration after the 64th
>> desc, meaning that we were always leaving 16 buffers non-refilled
>> until the next NAPI poll. It's better to refill them while they're
>> still hot, so do that right after exiting the loop as well.
>> For a full cycle of 64 descs, there will be 4 refills of 16 descs
>> from now on.
>>
>> Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)
>>
>> + up to 2% performance.
>>
>
> What is the test you saw the 2% performance improvement in? Is it
> something XDP related or a full stack test?

Not XDP, it's not present in this driver at this point :D
Stack test, but without usercopy overhead. Trafgen bombs the NIC, the
driver builds skbs and passes it up the stack, the stack does GRO etc,
and then the frames get dropped on IP input because there's no socket.

>
>> Signed-off-by: Alexander Lobakin <[email protected]>
>
> Also one thing I am not a huge fan of is a patch that is really a
> patchset onto itself. With all 6 items called out here I would have
> preferred to see this as 6 patches as it would have been easier to
> review.

Agree BTW, I'm not a fan of this patch either. I wasn't sure what to do
with it, as splitting it into 6 explodes the series into a monster, but
proceeding without it increases diffstat and complicates things later
on. I'll try the latter, but will see. 17 patches is not the End of Days
after all.

>
>> ---
>> drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +-
>> drivers/net/ethernet/intel/iavf/iavf_txrx.c | 259 +++++++++-----------
>> drivers/net/ethernet/intel/iavf/iavf_txrx.h | 3 +-
>> 3 files changed, 114 insertions(+), 150 deletions(-)

[...]

>> @@ -943,23 +945,17 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
>>
>> /* clear the status bits for the next_to_use descriptor */
>> rx_desc->wb.qword1.status_error_len = 0;
>> -
>> - cleaned_count--;
>> - } while (cleaned_count);
>> + } while (--to_refill);
>
> Just a nit. You might want to break this up into two statements like I
> had before. I know some people within Intel weren't a huge fan of when
> I used to do that kind of thing all the time in loops where I would do
> the decrement and test in one line.. :)

Should I please them or do it as I want to? :D I realize from the
compiler's PoV it's most likely the same, but dunno, why not.

>
>>
>> if (rx_ring->next_to_use != ntu)
>> iavf_release_rx_desc(rx_ring, ntu);
>>
>> - return false;

[...]

>> /* if we are the last buffer then there is nothing else to do */
>> #define IAVF_RXD_EOF BIT(IAVF_RX_DESC_STATUS_EOF_SHIFT)
>> if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))
>
> You may want to see if you can get rid of this function entirely,
> perhaps you do in a later patch. This function was added for ixgbe back
> in the day to allow us to place the skb back in the ring for the RSC
> based workloads where we had to deal with interleaved frames in the Rx
> path.
>
> For example, one question here would be why are we passing skb? It
> isn't used as far as I can tell.

Yes, I'm optimizing all this out later in the series. I was surprised
just as much as you when I saw skb getting passed to do nothing ._.

[...]

Thanks for the detailed reviews, stuff that Intel often lacks :s :D

Olek

2023-05-31 17:40:43

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

On Wed, May 31, 2023 at 8:14 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander H Duyck <[email protected]>
> Date: Tue, 30 May 2023 09:18:40 -0700
>
> > On Thu, 2023-05-25 at 14:57 +0200, Alexander Lobakin wrote:
> >> The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
> >> further buffer model changes, shake it up a bit. Notably:
> >>
> >> 1. Cache more variables on the stack.
> >> DMA device, Rx page size, NTC -- these are the most common things
> >> used all throughout the hotpath, often in loops on each iteration.
> >> Instead of fetching (or even calculating, as with the page size) them
> >> from the ring all the time, cache them on the stack at the beginning
> >> of the NAPI polling callback. NTC will be written back at the end,
> >> the rest are used read-only, so no sync needed.
> >
> > The advantage of this is going to vary based on the attribute. One of
> > the reasons why I left most of this on the ring is because the section
> > of the ring most of these variables were meant to be read-mostly and
> > shouldn't have resulted in any additional overhead versus accessing
> > them from the stack.
>
> But not all of these variables are read-only. E.g. NTC is often
> modified. Page size was calculated per descriptor, but could be once a
> poll cycle starts, and so on.

Yeah, the ntc should be carried in the stack. The only reason for
using the ring variable was because in the case of ixgbe we had to do
some tricks with it to deal with RSC as we were either accessing ntc
or the buffer pointed to by the descriptor. I think most of that code
has been removed for i40e though.

> >
> >> 2. Don't move the recycled buffers around the ring.
> >> The idea of passing the page of the right-now-recycled-buffer to a
> >> different buffer, in this case, the first one that needs to be
> >> allocated, moreover, on each new frame, is fundamentally wrong. It
> >> involves a few o' fetches, branches and then writes (and one Rx
> >> buffer struct is at least 32 bytes) where they're completely unneeded,
> >> but gives no good -- the result is the same as if we'd recycle it
> >> inplace, at the same position where it was used. So drop this and let
> >> the main refilling function take care of all the buffers, which were
> >> processed and now need to be recycled/refilled.
> >
> > The next_to_alloc logic was put in place to deal with systems that are
> > experiencing memory issues. Specifically what can end up happening is
> > that the ring can stall due to failing memory allocations and the
> > memory can get stuck on the ring. For that reason we were essentially
> > defragmenting the buffers when we started suffering memory pressure so
> > that they could be reusued and/or freed following immediate use.
> >
> > Basically what you are trading off is some exception handling for
> > performance by removing it.
>
> I'm not sure this helps a lot, but OTOH this really slows down things,
> esp. given that this code is run all the time, not only when a memory
> allocation fail happens.

The thing is when things break they have to be fixed. For our purposes
since you are getting rid of the recycling anyway it is going to be
ripped out one way or the other. Just wanted to provide a bit of
history as to why it is being done that way.

> >
> >> 3. Don't allocate with %GPF_ATOMIC on ifup.
> >> This involved introducing the @gfp parameter to a couple functions.
> >> Doesn't change anything for Rx -> softirq.
> >
> > Any specific reason for this? Just wondering if this is meant to
> > address some sort of memory pressure issue since it basically just
> > means the allocation can go out and try to free other memory.
>
> Yes, I'm no MM expert, but I've seen plenty of times messages from the
> MM folks that ATOMIC shouldn't be used in non-atomic contexts. Atomic
> allocation is able to grab memory from some sort of critical reservs and
> all that, and the less we touch them, the better. Outside of atomic
> contexts they should not be touched.

For our purposes though the Rx path is more-or-less always in
interrupt context. That is why it had defaulted to just always using
GFP_ATOMIC. For your purposes you could probably leave it that way
since you are going to be pulling out most of this code anyway.

> >
> >> 4. 1 budget unit == 1 descriptor, not skb.
> >> There could be underflow when receiving a lot of fragmented frames.
> >> If each of them would consist of 2 frags, it means that we'd process
> >> 64 descriptors at the point where we pass the 32th skb to the stack.
> >> But the driver would count that only as a half, which could make NAPI
> >> re-enable interrupts prematurely and create unnecessary CPU load.
> >
> > Not sure I agree with this. The problem is the overhead for an skb
> > going up the stack versus a fragment are pretty signficant. Keep in
> > mind that most of the overhead for a single buffer occurs w/
> > napi_gro_receive and is not actually at the driver itself. The whole
> > point of the budget is to meter out units of work, not to keep you in
> > the busy loop. This starts looking like the old code where the Intel
> > drivers were returning either budget or 0 instead of supporting the
> > middle ground.
>
> No, certainly not this :D
>
> The point of budget is to limit the amount of time drivers can spend on
> cleaning their rings. Making skb the unit makes the unit very logical
> and flexible, but I'd say it should always be solid. Imagine you get a
> frame which got spanned across 5 buffers. You spend x5 time (roughly) to
> build an skb and pass it up the stack vs when you get a linear frame in
> one buffer, but according to your logics both of these cases count as 1
> unit, while the amount of time spent differs significantly. I can't say
> that's fair enough.

I would say it is. Like I said most of the overhead is the stack, not
the driver. So if we are cleaning 5 descriptors but only processing
one skb then I would say it is only one unit in terms of budget. This
is one of the reasons why we don't charge Tx to the NAPI budget. Tx
clean up is extremely lightweight as it is only freeing memory, and in
cases of Tx and Rx being mixed can essentially be folded in as Tx
buffers could be reused for Rx.

If we are wanting to increase the work being done per poll it would
make more sense to stick to interrupts and force it to backlog more
packets per interrupt so that it is processing 64 skbs per call.

> >
> >> 5. Shortcut !size case.
> >> It's super rare, but possible -- for example, if the last buffer of
> >> the fragmented frame contained only FCS, which was then stripped by
> >> the HW. Instead of checking for size several times when processing,
> >> quickly reuse the buffer and jump to the skb fields part.
> >> 6. Refill the ring after finishing the polling loop.
> >> Previously, the loop wasn't starting a new iteration after the 64th
> >> desc, meaning that we were always leaving 16 buffers non-refilled
> >> until the next NAPI poll. It's better to refill them while they're
> >> still hot, so do that right after exiting the loop as well.
> >> For a full cycle of 64 descs, there will be 4 refills of 16 descs
> >> from now on.
> >>
> >> Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)
> >>
> >> + up to 2% performance.
> >>
> >
> > What is the test you saw the 2% performance improvement in? Is it
> > something XDP related or a full stack test?
>
> Not XDP, it's not present in this driver at this point :D
> Stack test, but without usercopy overhead. Trafgen bombs the NIC, the
> driver builds skbs and passes it up the stack, the stack does GRO etc,
> and then the frames get dropped on IP input because there's no socket.

So one thing you might want to look at would be a full stack test w/
something such as netperf versus optimizing for a drop only test.
Otherwise that can lead to optimizations that will actually hurt
driver performance in the long run.

> >
> >> Signed-off-by: Alexander Lobakin <[email protected]>
> >
> > Also one thing I am not a huge fan of is a patch that is really a
> > patchset onto itself. With all 6 items called out here I would have
> > preferred to see this as 6 patches as it would have been easier to
> > review.
>
> Agree BTW, I'm not a fan of this patch either. I wasn't sure what to do
> with it, as splitting it into 6 explodes the series into a monster, but
> proceeding without it increases diffstat and complicates things later
> on. I'll try the latter, but will see. 17 patches is not the End of Days
> after all.

One thing you may want to consider to condense some of these patches
would be to look at possibly combining patches 4 and 5 which disable
recycling and use a full 4K page. It seems like of those patches one
ends up redoing the other since so many of the dma_sync calls are
updated in both.

> >
> >> ---
> >> drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +-
> >> drivers/net/ethernet/intel/iavf/iavf_txrx.c | 259 +++++++++-----------
> >> drivers/net/ethernet/intel/iavf/iavf_txrx.h | 3 +-
> >> 3 files changed, 114 insertions(+), 150 deletions(-)
>
> [...]
>
> >> @@ -943,23 +945,17 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
> >>
> >> /* clear the status bits for the next_to_use descriptor */
> >> rx_desc->wb.qword1.status_error_len = 0;
> >> -
> >> - cleaned_count--;
> >> - } while (cleaned_count);
> >> + } while (--to_refill);
> >
> > Just a nit. You might want to break this up into two statements like I
> > had before. I know some people within Intel weren't a huge fan of when
> > I used to do that kind of thing all the time in loops where I would do
> > the decrement and test in one line.. :)
>
> Should I please them or do it as I want to? :D I realize from the
> compiler's PoV it's most likely the same, but dunno, why not.

If nobody internally is bugging you about it then I am fine with it. I
just know back during my era people would complain about that from a
maintainability perspective. I guess I got trained to catch those kind
of things as a result.

> >
> >>
> >> if (rx_ring->next_to_use != ntu)
> >> iavf_release_rx_desc(rx_ring, ntu);
> >>
> >> - return false;
>
> [...]
>
> >> /* if we are the last buffer then there is nothing else to do */
> >> #define IAVF_RXD_EOF BIT(IAVF_RX_DESC_STATUS_EOF_SHIFT)
> >> if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))
> >
> > You may want to see if you can get rid of this function entirely,
> > perhaps you do in a later patch. This function was added for ixgbe back
> > in the day to allow us to place the skb back in the ring for the RSC
> > based workloads where we had to deal with interleaved frames in the Rx
> > path.
> >
> > For example, one question here would be why are we passing skb? It
> > isn't used as far as I can tell.
>
> Yes, I'm optimizing all this out later in the series. I was surprised
> just as much as you when I saw skb getting passed to do nothing ._.

The funny part for me is that it is like reviewing code written via a
game of telephone. I recognize the code but have to think about it
since there are all the bits of changes and such from the original
ixgbe.

> [...]
>
> Thanks for the detailed reviews, stuff that Intel often lacks :s :D

No problem, it was the least I could do since I am responsible for so
much of this code in the earlier drivers anyway. If nothing else I
figured I could provide a bit of history on why some of this was the
way it was.

2023-06-02 14:25:44

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander Duyck <[email protected]>
Date: Wed, 31 May 2023 10:22:18 -0700

> On Wed, May 31, 2023 at 8:14 AM Alexander Lobakin
> <[email protected]> wrote:

[...]

>> But not all of these variables are read-only. E.g. NTC is often
>> modified. Page size was calculated per descriptor, but could be once a
>> poll cycle starts, and so on.
>
> Yeah, the ntc should be carried in the stack. The only reason for
> using the ring variable was because in the case of ixgbe we had to do
> some tricks with it to deal with RSC as we were either accessing ntc
> or the buffer pointed to by the descriptor. I think most of that code
> has been removed for i40e though.

IAVF was forked off ixgbe as per Jesse's statement :D

[...]

>>> Any specific reason for this? Just wondering if this is meant to
>>> address some sort of memory pressure issue since it basically just
>>> means the allocation can go out and try to free other memory.
>>
>> Yes, I'm no MM expert, but I've seen plenty of times messages from the
>> MM folks that ATOMIC shouldn't be used in non-atomic contexts. Atomic
>> allocation is able to grab memory from some sort of critical reservs and
>> all that, and the less we touch them, the better. Outside of atomic
>> contexts they should not be touched.
>
> For our purposes though the Rx path is more-or-less always in
> interrupt context. That is why it had defaulted to just always using
> GFP_ATOMIC. For your purposes you could probably leave it that way
> since you are going to be pulling out most of this code anyway.

That's for Rx path, but don't forget that the initial allocation on ifup
is done in the process context. That's what the maintainers and
reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.

[...]

>> The point of budget is to limit the amount of time drivers can spend on
>> cleaning their rings. Making skb the unit makes the unit very logical
>> and flexible, but I'd say it should always be solid. Imagine you get a
>> frame which got spanned across 5 buffers. You spend x5 time (roughly) to
>> build an skb and pass it up the stack vs when you get a linear frame in
>> one buffer, but according to your logics both of these cases count as 1
>> unit, while the amount of time spent differs significantly. I can't say
>> that's fair enough.
>
> I would say it is. Like I said most of the overhead is the stack, not
> the driver. So if we are cleaning 5 descriptors but only processing
> one skb then I would say it is only one unit in terms of budget. This
> is one of the reasons why we don't charge Tx to the NAPI budget. Tx
> clean up is extremely lightweight as it is only freeing memory, and in
> cases of Tx and Rx being mixed can essentially be folded in as Tx
> buffers could be reused for Rx.
>
> If we are wanting to increase the work being done per poll it would
> make more sense to stick to interrupts and force it to backlog more
> packets per interrupt so that it is processing 64 skbs per call.

Oh, I feel like I'm starting to agree :D OK, then the following doesn't
really get out of my head: why do we store skb pointer on the ring then,
if we count 1 skb as 1 unit, so that we won't leave the loop until the
EOP? Only to handle allocation failures? But skb is already allocated at
this point... <confused>

[...]

>>> What is the test you saw the 2% performance improvement in? Is it
>>> something XDP related or a full stack test?
>>
>> Not XDP, it's not present in this driver at this point :D
>> Stack test, but without usercopy overhead. Trafgen bombs the NIC, the
>> driver builds skbs and passes it up the stack, the stack does GRO etc,
>> and then the frames get dropped on IP input because there's no socket.
>
> So one thing you might want to look at would be a full stack test w/
> something such as netperf versus optimizing for a drop only test.
> Otherwise that can lead to optimizations that will actually hurt
> driver performance in the long run.

I was doing some netperf (or that Microsoft's tool, don't remember the
name) tests, but the problem is that usercopy is such a bottleneck, so
that you don't notice any optimizations or regressions most of time.
Also, userspace tools usually just pass huge payload chunks and then the
drivers GSO them into MTU-sized frames, so you always get line rate and
that's it. Short frames or interleave/imix (randomly-mix-sized) are the
most stressful from my experience and are able to show actual outcome.

>
>>>
>>>> Signed-off-by: Alexander Lobakin <[email protected]>
>>>
>>> Also one thing I am not a huge fan of is a patch that is really a
>>> patchset onto itself. With all 6 items called out here I would have
>>> preferred to see this as 6 patches as it would have been easier to
>>> review.
>>
>> Agree BTW, I'm not a fan of this patch either. I wasn't sure what to do
>> with it, as splitting it into 6 explodes the series into a monster, but
>> proceeding without it increases diffstat and complicates things later
>> on. I'll try the latter, but will see. 17 patches is not the End of Days
>> after all.
>
> One thing you may want to consider to condense some of these patches
> would be to look at possibly combining patches 4 and 5 which disable
> recycling and use a full 4K page. It seems like of those patches one
> ends up redoing the other since so many of the dma_sync calls are
> updated in both.

Or maybe I'll move this one into the subsequent series, since it's only
pt. 1 of Rx optimizations. There's also the second commit, but it's
probably as messy as this one and these two could be just converted into
a series.

[...]

>>> Just a nit. You might want to break this up into two statements like I
>>> had before. I know some people within Intel weren't a huge fan of when
>>> I used to do that kind of thing all the time in loops where I would do
>>> the decrement and test in one line.. :)
>>
>> Should I please them or do it as I want to? :D I realize from the
>> compiler's PoV it's most likely the same, but dunno, why not.
>
> If nobody internally is bugging you about it then I am fine with it. I
> just know back during my era people would complain about that from a
> maintainability perspective. I guess I got trained to catch those kind
> of things as a result.

Haha understand. I usually say: "please some good arguments or I didn't
hear this", maybe that's why nobody complained on `--var` yet :D

[...]

>> Yes, I'm optimizing all this out later in the series. I was surprised
>> just as much as you when I saw skb getting passed to do nothing ._.
>
> The funny part for me is that it is like reviewing code written via a
> game of telephone. I recognize the code but have to think about it
> since there are all the bits of changes and such from the original
> ixgbe.

Lots of things are still recognizable even in IDPF. That's how this
series was born... :D

>
>> [...]
>>
>> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
>
> No problem, it was the least I could do since I am responsible for so
> much of this code in the earlier drivers anyway. If nothing else I
> figured I could provide a bit of history on why some of this was the
> way it was.
These history bits are nice and interesting to read actually! And also
useful since they give some context and understanding of what is
obsolete and can be removed/changed.

Thanks,
Olek

2023-06-02 15:09:30

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

On Fri, Jun 2, 2023 at 7:00 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
> Date: Wed, 31 May 2023 10:22:18 -0700
>
> > On Wed, May 31, 2023 at 8:14 AM Alexander Lobakin
> > <[email protected]> wrote:
>
> [...]
>
> >> But not all of these variables are read-only. E.g. NTC is often
> >> modified. Page size was calculated per descriptor, but could be once a
> >> poll cycle starts, and so on.
> >
> > Yeah, the ntc should be carried in the stack. The only reason for
> > using the ring variable was because in the case of ixgbe we had to do
> > some tricks with it to deal with RSC as we were either accessing ntc
> > or the buffer pointed to by the descriptor. I think most of that code
> > has been removed for i40e though.
>
> IAVF was forked off ixgbe as per Jesse's statement :D

Yes, but point is they are forked off the same driver and this code
has fallen a bit behind i40e. Really both should probably have been
updated at the same time.

The fact is everything since igb is more or less based on the same
design. I just kept tweaking it as I moved from one driver to the
next. So in terms of refactoring to use a common library you could
probably go back that far without too much trouble. The only
exceptions to all that are fm10k and igbvf which while being similar
also have some significant design differences that might make it a bit
more difficult.

> [...]
>
> >>> Any specific reason for this? Just wondering if this is meant to
> >>> address some sort of memory pressure issue since it basically just
> >>> means the allocation can go out and try to free other memory.
> >>
> >> Yes, I'm no MM expert, but I've seen plenty of times messages from the
> >> MM folks that ATOMIC shouldn't be used in non-atomic contexts. Atomic
> >> allocation is able to grab memory from some sort of critical reservs and
> >> all that, and the less we touch them, the better. Outside of atomic
> >> contexts they should not be touched.
> >
> > For our purposes though the Rx path is more-or-less always in
> > interrupt context. That is why it had defaulted to just always using
> > GFP_ATOMIC. For your purposes you could probably leave it that way
> > since you are going to be pulling out most of this code anyway.
>
> That's for Rx path, but don't forget that the initial allocation on ifup
> is done in the process context. That's what the maintainers and
> reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.

I can see that for the static values like the queue vectors and rings,
however for the buffers themselves, but I don't see the point in doing
that for the regular buffer allocations. Basically it is adding
overhead for something that should have minimal impact as it usually
happens early on during boot when the memory should be free anyway so
GFP_ATOMIC vs GFP_KERNEL wouldn't have much impact in either case

> [...]
>
> >> The point of budget is to limit the amount of time drivers can spend on
> >> cleaning their rings. Making skb the unit makes the unit very logical
> >> and flexible, but I'd say it should always be solid. Imagine you get a
> >> frame which got spanned across 5 buffers. You spend x5 time (roughly) to
> >> build an skb and pass it up the stack vs when you get a linear frame in
> >> one buffer, but according to your logics both of these cases count as 1
> >> unit, while the amount of time spent differs significantly. I can't say
> >> that's fair enough.
> >
> > I would say it is. Like I said most of the overhead is the stack, not
> > the driver. So if we are cleaning 5 descriptors but only processing
> > one skb then I would say it is only one unit in terms of budget. This
> > is one of the reasons why we don't charge Tx to the NAPI budget. Tx
> > clean up is extremely lightweight as it is only freeing memory, and in
> > cases of Tx and Rx being mixed can essentially be folded in as Tx
> > buffers could be reused for Rx.
> >
> > If we are wanting to increase the work being done per poll it would
> > make more sense to stick to interrupts and force it to backlog more
> > packets per interrupt so that it is processing 64 skbs per call.
>
> Oh, I feel like I'm starting to agree :D OK, then the following doesn't
> really get out of my head: why do we store skb pointer on the ring then,
> if we count 1 skb as 1 unit, so that we won't leave the loop until the
> EOP? Only to handle allocation failures? But skb is already allocated at
> this point... <confused>

The skb is there to essentially hold the frags. Keep in mind that when
ixgbe was coded up XDP didn't exist yet.

I think there are drivers that are already getting away from this,
such as mvneta, by storing an xdp_buff instead of an skb. In theory we
could do away with most of this and just use a shared_info structure,
but since that exists in the first frag we still need a pointer to the
first frag as well.

Also multi-frag frames are typically not that likely on a normal
network as most of the frames are less than 1514B in length. In
addition as I mentioned before a jumbo frame workload will be less
demanding since the frame rates are so much lower. So when I coded
this up I had optimized for the non-fragged case with the fragmented
case being more of an afterthought needed mostly as exception
handling.

> [...]
>
> >>> What is the test you saw the 2% performance improvement in? Is it
> >>> something XDP related or a full stack test?
> >>
> >> Not XDP, it's not present in this driver at this point :D
> >> Stack test, but without usercopy overhead. Trafgen bombs the NIC, the
> >> driver builds skbs and passes it up the stack, the stack does GRO etc,
> >> and then the frames get dropped on IP input because there's no socket.
> >
> > So one thing you might want to look at would be a full stack test w/
> > something such as netperf versus optimizing for a drop only test.
> > Otherwise that can lead to optimizations that will actually hurt
> > driver performance in the long run.
>
> I was doing some netperf (or that Microsoft's tool, don't remember the
> name) tests, but the problem is that usercopy is such a bottleneck, so
> that you don't notice any optimizations or regressions most of time.
> Also, userspace tools usually just pass huge payload chunks and then the
> drivers GSO them into MTU-sized frames, so you always get line rate and
> that's it. Short frames or interleave/imix (randomly-mix-sized) are the
> most stressful from my experience and are able to show actual outcome.

That is kind of what I figured. So one thing to watch out for is
stating performance improvements without providing context on what
exactly it is you are doing to see that gain. So essentially what we
have is a microbenchmark that is seeing the gain.

Admittedly my goto used to be IPv4 routing since that exercised both
the Tx and Rx path for much the same reason. However one thing you
need to keep in mind is that if you cannot see a gain in the
full-stack test odds are most users may not notice much of an impact.

> >
> >>>
> >>>> Signed-off-by: Alexander Lobakin <[email protected]>
> >>>
> >>> Also one thing I am not a huge fan of is a patch that is really a
> >>> patchset onto itself. With all 6 items called out here I would have
> >>> preferred to see this as 6 patches as it would have been easier to
> >>> review.
> >>
> >> Agree BTW, I'm not a fan of this patch either. I wasn't sure what to do
> >> with it, as splitting it into 6 explodes the series into a monster, but
> >> proceeding without it increases diffstat and complicates things later
> >> on. I'll try the latter, but will see. 17 patches is not the End of Days
> >> after all.
> >
> > One thing you may want to consider to condense some of these patches
> > would be to look at possibly combining patches 4 and 5 which disable
> > recycling and use a full 4K page. It seems like of those patches one
> > ends up redoing the other since so many of the dma_sync calls are
> > updated in both.
>
> Or maybe I'll move this one into the subsequent series, since it's only
> pt. 1 of Rx optimizations. There's also the second commit, but it's
> probably as messy as this one and these two could be just converted into
> a series.
>
> [...]
>
> >>> Just a nit. You might want to break this up into two statements like I
> >>> had before. I know some people within Intel weren't a huge fan of when
> >>> I used to do that kind of thing all the time in loops where I would do
> >>> the decrement and test in one line.. :)
> >>
> >> Should I please them or do it as I want to? :D I realize from the
> >> compiler's PoV it's most likely the same, but dunno, why not.
> >
> > If nobody internally is bugging you about it then I am fine with it. I
> > just know back during my era people would complain about that from a
> > maintainability perspective. I guess I got trained to catch those kind
> > of things as a result.
>
> Haha understand. I usually say: "please some good arguments or I didn't
> hear this", maybe that's why nobody complained on `--var` yet :D

Either that or they were already worn down by the time you started
adding this type of stuff.. :)

The one I used to do that would really drive people nuts was:
for (i = loop_count; i--;)

It is more efficient since I don't have to do the comparison to the
loop counter, but it is definitely counterintuitive to run loops
backwards like that. I tried to break myself of the habit of using
those sort of loops anywhere that wasn't performance critical such as
driver init.

> [...]
>
> >> Yes, I'm optimizing all this out later in the series. I was surprised
> >> just as much as you when I saw skb getting passed to do nothing ._.
> >
> > The funny part for me is that it is like reviewing code written via a
> > game of telephone. I recognize the code but have to think about it
> > since there are all the bits of changes and such from the original
> > ixgbe.
>
> Lots of things are still recognizable even in IDPF. That's how this
> series was born... :D

Yep, now the question is how many drivers can be pulled into using
this library. The issue is going to be all the extra features and
workarounds outside of your basic Tx/Rx will complicate the code since
all the drivers implement them a bit differently. One of the reasons
for not consolidating them was to allow for performance optimizing for
each driver. By combining them you are going to likely need to add a
number of new conditional paths to the fast path.


> >
> >> [...]
> >>
> >> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
> >
> > No problem, it was the least I could do since I am responsible for so
> > much of this code in the earlier drivers anyway. If nothing else I
> > figured I could provide a bit of history on why some of this was the
> > way it was.
> These history bits are nice and interesting to read actually! And also
> useful since they give some context and understanding of what is
> obsolete and can be removed/changed.

Yeah, it is easiest to do these sort of refactors when you have
somebody to answer the "why" of most of this. I recall going through
this when I was refactoring the igb/ixgbe drivers back in the day and
having to purge the dead e1000 code throughout. Of course, after this
refactor it will be all yours right?.. :D

2023-06-02 16:21:53

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander Duyck <[email protected]>
Date: Fri, 2 Jun 2023 08:04:53 -0700

> On Fri, Jun 2, 2023 at 7:00 AM Alexander Lobakin
> <[email protected]> wrote:

[...]

>> That's for Rx path, but don't forget that the initial allocation on ifup
>> is done in the process context. That's what the maintainers and
>> reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.
>
> I can see that for the static values like the queue vectors and rings,
> however for the buffers themselves, but I don't see the point in doing
> that for the regular buffer allocations. Basically it is adding
> overhead for something that should have minimal impact as it usually
> happens early on during boot when the memory should be free anyway so
> GFP_ATOMIC vs GFP_KERNEL wouldn't have much impact in either case

Queue vectors and rings get allocated earlier than buffers, on device
probing :D ifup happens later and it depends on the networking scripts
etc. -- now every init system enables all the interfaces when booting up
like systemd does. Plus, ifdowns-ifups can occur during the normal
system functioning -- resets, XDP setup/remove, Ethtool configuration,
and so on. I wouldn't say Rx buffer allocation happens only on early boot.

[...]

>> Oh, I feel like I'm starting to agree :D OK, then the following doesn't
>> really get out of my head: why do we store skb pointer on the ring then,
>> if we count 1 skb as 1 unit, so that we won't leave the loop until the
>> EOP? Only to handle allocation failures? But skb is already allocated at
>> this point... <confused>
>
> The skb is there to essentially hold the frags. Keep in mind that when
> ixgbe was coded up XDP didn't exist yet.

Ok, maybe I phrased it badly.
If we don't stop the loop until skb is passed up the stack, how we can
go out of the loop with an unfinished skb? Previously, I thought lots of
drivers do that, as you may exhaust your budget prior to reaching the
last fragment, so you'll get back to the skb on the next poll.
But if we count 1 skb as budget unit, not descriptor, how we can end up
breaking the loop prior to finishing the skb? I can imagine only one
situation: HW gave us some buffers, but still processes the EOP buffer,
so we don't have any more descriptors to process, but the skb is still
unfinished. But sounds weird TBH, I thought HW processes frames
"atomically", i.e. it doesn't give you buffers until they hold the whole
frame :D

>
> I think there are drivers that are already getting away from this,
> such as mvneta, by storing an xdp_buff instead of an skb. In theory we
> could do away with most of this and just use a shared_info structure,
> but since that exists in the first frag we still need a pointer to the
> first frag as well.

ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
than skb, but also carries the frags, since frags is a part of shinfo,
not skb.
It's totally fine and we'll end up doing the same here, my question was
as I explained below.

>
> Also multi-frag frames are typically not that likely on a normal
> network as most of the frames are less than 1514B in length. In
> addition as I mentioned before a jumbo frame workload will be less
> demanding since the frame rates are so much lower. So when I coded
> this up I had optimized for the non-fragged case with the fragmented
> case being more of an afterthought needed mostly as exception
> handling.

[...]

> That is kind of what I figured. So one thing to watch out for is
> stating performance improvements without providing context on what
> exactly it is you are doing to see that gain. So essentially what we
> have is a microbenchmark that is seeing the gain.
>
> Admittedly my goto used to be IPv4 routing since that exercised both
> the Tx and Rx path for much the same reason. However one thing you
> need to keep in mind is that if you cannot see a gain in the
> full-stack test odds are most users may not notice much of an impact.

Yeah sure. I think more than a half of optimizations in such drivers
nowadays is unnoticeable to end users :D

[...]

> Either that or they were already worn down by the time you started
> adding this type of stuff.. :)
>
> The one I used to do that would really drive people nuts was:
> for (i = loop_count; i--;)

Oh, nice one! Never thought of something like that hehe.

>
> It is more efficient since I don't have to do the comparison to the
> loop counter, but it is definitely counterintuitive to run loops
> backwards like that. I tried to break myself of the habit of using
> those sort of loops anywhere that wasn't performance critical such as
> driver init.

[...]

> Yep, now the question is how many drivers can be pulled into using
> this library. The issue is going to be all the extra features and
> workarounds outside of your basic Tx/Rx will complicate the code since
> all the drivers implement them a bit differently. One of the reasons
> for not consolidating them was to allow for performance optimizing for
> each driver. By combining them you are going to likely need to add a
> number of new conditional paths to the fast path.

When I was counting the number of spots in the Rx polling function that
need to have switch-cases/ifs in order to be able to merge the code
(e.g. parsing the descriptors), it was something around 4-5 (per
packet). So it can only be figured out during the testing whether adding
new branches actually hurts there.
XDP is relatively easy to unify in one place, most of code is
software-only. Ring structures are also easy to merge, wrapping a couple
driver-specific pointers into static inline accessors. Hotpath in
general is the easiest part, that's why I started from it.

But anyway, I'd say if one day I'd have to choice whether to remove 400
locs per driver with having -1% in synthetic tests or not do that at
all, I'd go for the former. As discussed above, it's very unlikely for
such changes to hurt real workloads, esp. with usercopy.

>
>
>>>
>>>> [...]
>>>>
>>>> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
>>>
>>> No problem, it was the least I could do since I am responsible for so
>>> much of this code in the earlier drivers anyway. If nothing else I
>>> figured I could provide a bit of history on why some of this was the
>>> way it was.
>> These history bits are nice and interesting to read actually! And also
>> useful since they give some context and understanding of what is
>> obsolete and can be removed/changed.
>
> Yeah, it is easiest to do these sort of refactors when you have
> somebody to answer the "why" of most of this. I recall going through
> this when I was refactoring the igb/ixgbe drivers back in the day and
> having to purge the dead e1000 code throughout. Of course, after this
> refactor it will be all yours right?.. :D

Nope, maybe from the git-blame PoV only :p

Thanks,
Olek

2023-06-02 18:19:15

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
> Date: Fri, 2 Jun 2023 08:04:53 -0700
>
> > On Fri, Jun 2, 2023 at 7:00 AM Alexander Lobakin
> > <[email protected]> wrote:
>
> [...]
>
> >> That's for Rx path, but don't forget that the initial allocation on ifup
> >> is done in the process context. That's what the maintainers and
> >> reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.
> >
> > I can see that for the static values like the queue vectors and rings,
> > however for the buffers themselves, but I don't see the point in doing
> > that for the regular buffer allocations. Basically it is adding
> > overhead for something that should have minimal impact as it usually
> > happens early on during boot when the memory should be free anyway so
> > GFP_ATOMIC vs GFP_KERNEL wouldn't have much impact in either case
>
> Queue vectors and rings get allocated earlier than buffers, on device
> probing :D ifup happens later and it depends on the networking scripts
> etc. -- now every init system enables all the interfaces when booting up
> like systemd does. Plus, ifdowns-ifups can occur during the normal
> system functioning -- resets, XDP setup/remove, Ethtool configuration,
> and so on. I wouldn't say Rx buffer allocation happens only on early boot.

I agree it isn't only on early boot, but that is the most common case.
The rings and such tend to get allocated early and unless there is a
need for some unforeseen change the rings typically are not modified.

I just don't see the point of special casing the allocations since if
they fail we will be turning around and just immediately calling the
GFP_ATOMIC version within 2 seconds anyway to try and fill out the
empty rings.

> >> Oh, I feel like I'm starting to agree :D OK, then the following doesn't
> >> really get out of my head: why do we store skb pointer on the ring then,
> >> if we count 1 skb as 1 unit, so that we won't leave the loop until the
> >> EOP? Only to handle allocation failures? But skb is already allocated at
> >> this point... <confused>
> >
> > The skb is there to essentially hold the frags. Keep in mind that when
> > ixgbe was coded up XDP didn't exist yet.
>
> Ok, maybe I phrased it badly.
> If we don't stop the loop until skb is passed up the stack, how we can
> go out of the loop with an unfinished skb? Previously, I thought lots of
> drivers do that, as you may exhaust your budget prior to reaching the
> last fragment, so you'll get back to the skb on the next poll.
> But if we count 1 skb as budget unit, not descriptor, how we can end up
> breaking the loop prior to finishing the skb? I can imagine only one
> situation: HW gave us some buffers, but still processes the EOP buffer,
> so we don't have any more descriptors to process, but the skb is still
> unfinished. But sounds weird TBH, I thought HW processes frames
> "atomically", i.e. it doesn't give you buffers until they hold the whole
> frame :D

The problem is the frames aren't necessarily written back atomically.
One big issue is descriptor write back. The hardware will try to cache
line optimize things in order to improve performance. It is possible
for a single frame to straddle either side of a cache line. As a
result the first half may be written back, the driver then processes
that cache line, and finds the next one isn't populated while the
hardware is collecting enough descriptors to write back the next one.

It is also one of the reasons why I went to so much effort to prevent
us from writing to the descriptor ring in the cleanup paths. You never
know when you might be processing an earlier frame and accidently
wander into a section that is in the process of being written. I think
that is addressed now mostly through the use of completion queues
instead of the single ring that used to process both work and
completions.

> >
> > I think there are drivers that are already getting away from this,
> > such as mvneta, by storing an xdp_buff instead of an skb. In theory we
> > could do away with most of this and just use a shared_info structure,
> > but since that exists in the first frag we still need a pointer to the
> > first frag as well.
>
> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
> than skb, but also carries the frags, since frags is a part of shinfo,
> not skb.
> It's totally fine and we'll end up doing the same here, my question was
> as I explained below.

Okay. I haven't looked at ice that closely so I wasn't aware of that.

> >
> > Also multi-frag frames are typically not that likely on a normal
> > network as most of the frames are less than 1514B in length. In
> > addition as I mentioned before a jumbo frame workload will be less
> > demanding since the frame rates are so much lower. So when I coded
> > this up I had optimized for the non-fragged case with the fragmented
> > case being more of an afterthought needed mostly as exception
> > handling.
>
> [...]
>
> > That is kind of what I figured. So one thing to watch out for is
> > stating performance improvements without providing context on what
> > exactly it is you are doing to see that gain. So essentially what we
> > have is a microbenchmark that is seeing the gain.
> >
> > Admittedly my goto used to be IPv4 routing since that exercised both
> > the Tx and Rx path for much the same reason. However one thing you
> > need to keep in mind is that if you cannot see a gain in the
> > full-stack test odds are most users may not notice much of an impact.
>
> Yeah sure. I think more than a half of optimizations in such drivers
> nowadays is unnoticeable to end users :D
>
> [...]
>
> > Either that or they were already worn down by the time you started
> > adding this type of stuff.. :)
> >
> > The one I used to do that would really drive people nuts was:
> > for (i = loop_count; i--;)
>
> Oh, nice one! Never thought of something like that hehe.
>
> >
> > It is more efficient since I don't have to do the comparison to the
> > loop counter, but it is definitely counterintuitive to run loops
> > backwards like that. I tried to break myself of the habit of using
> > those sort of loops anywhere that wasn't performance critical such as
> > driver init.
>
> [...]
>
> > Yep, now the question is how many drivers can be pulled into using
> > this library. The issue is going to be all the extra features and
> > workarounds outside of your basic Tx/Rx will complicate the code since
> > all the drivers implement them a bit differently. One of the reasons
> > for not consolidating them was to allow for performance optimizing for
> > each driver. By combining them you are going to likely need to add a
> > number of new conditional paths to the fast path.
>
> When I was counting the number of spots in the Rx polling function that
> need to have switch-cases/ifs in order to be able to merge the code
> (e.g. parsing the descriptors), it was something around 4-5 (per
> packet). So it can only be figured out during the testing whether adding
> new branches actually hurts there.

The other thing is you may want to double check CPU(s) you are
expected to support as last I knew switch statements were still
expensive due to all the old spectre/meltdown workarounds.

> XDP is relatively easy to unify in one place, most of code is
> software-only. Ring structures are also easy to merge, wrapping a couple
> driver-specific pointers into static inline accessors. Hotpath in
> general is the easiest part, that's why I started from it.
>
> But anyway, I'd say if one day I'd have to choice whether to remove 400
> locs per driver with having -1% in synthetic tests or not do that at
> all, I'd go for the former. As discussed above, it's very unlikely for
> such changes to hurt real workloads, esp. with usercopy.

+1 assuming no serious regressions.

> >
> >
> >>>
> >>>> [...]
> >>>>
> >>>> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
> >>>
> >>> No problem, it was the least I could do since I am responsible for so
> >>> much of this code in the earlier drivers anyway. If nothing else I
> >>> figured I could provide a bit of history on why some of this was the
> >>> way it was.
> >> These history bits are nice and interesting to read actually! And also
> >> useful since they give some context and understanding of what is
> >> obsolete and can be removed/changed.
> >
> > Yeah, it is easiest to do these sort of refactors when you have
> > somebody to answer the "why" of most of this. I recall going through
> > this when I was refactoring the igb/ixgbe drivers back in the day and
> > having to purge the dead e1000 code throughout. Of course, after this
> > refactor it will be all yours right?.. :D
>
> Nope, maybe from the git-blame PoV only :p
>
> Thanks,
> Olek

2023-06-06 13:20:24

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander Duyck <[email protected]>
Date: Fri, 2 Jun 2023 10:50:02 -0700

Sorry for the silence, had sorta long weekend :p

> On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
> <[email protected]> wrote:

[...]

>> Ok, maybe I phrased it badly.
>> If we don't stop the loop until skb is passed up the stack, how we can
>> go out of the loop with an unfinished skb? Previously, I thought lots of
>> drivers do that, as you may exhaust your budget prior to reaching the
>> last fragment, so you'll get back to the skb on the next poll.
>> But if we count 1 skb as budget unit, not descriptor, how we can end up
>> breaking the loop prior to finishing the skb? I can imagine only one
>> situation: HW gave us some buffers, but still processes the EOP buffer,
>> so we don't have any more descriptors to process, but the skb is still
>> unfinished. But sounds weird TBH, I thought HW processes frames
>> "atomically", i.e. it doesn't give you buffers until they hold the whole
>> frame :D
>
> The problem is the frames aren't necessarily written back atomically.
> One big issue is descriptor write back. The hardware will try to cache
> line optimize things in order to improve performance. It is possible
> for a single frame to straddle either side of a cache line. As a
> result the first half may be written back, the driver then processes
> that cache line, and finds the next one isn't populated while the
> hardware is collecting enough descriptors to write back the next one.

Ah okay, that's was I was suspecting. So it's not atomic and
skb/xdp_buff is stored on the ring to handle such cases, not budget
exhausting.
Thanks for the detailed explanation. I feel 1 skb = 1 unit more logical
optimal to me now :D

>
> It is also one of the reasons why I went to so much effort to prevent
> us from writing to the descriptor ring in the cleanup paths. You never
> know when you might be processing an earlier frame and accidently
> wander into a section that is in the process of being written. I think
> that is addressed now mostly through the use of completion queues
> instead of the single ring that used to process both work and
> completions.

Completion rings are neat, you totally avoid writing anything to HW on
Rx polling and vice versa, no descriptor read on refilling. My
preference is to not refill anything on NAPI and do a separate workqueue
for that, esp. given that most NICs nowadays have "refill me please"
interrupt.
Please don't look at the idpf code, IIRC from what I've been told they
do it the "old" way and touch both receive and refill queues on Rx
polling :s :D

>> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
>> than skb, but also carries the frags, since frags is a part of shinfo,
>> not skb.
>> It's totally fine and we'll end up doing the same here, my question was
>> as I explained below.
>
> Okay. I haven't looked at ice that closely so I wasn't aware of that.

No prob, just FYI. This moves us one step closer to passing something
more lightweight than skb up the stack in non-extreme cases, so that the
stack will take care of it when GROing :)

>>> Yep, now the question is how many drivers can be pulled into using
>>> this library. The issue is going to be all the extra features and
>>> workarounds outside of your basic Tx/Rx will complicate the code since
>>> all the drivers implement them a bit differently. One of the reasons
>>> for not consolidating them was to allow for performance optimizing for
>>> each driver. By combining them you are going to likely need to add a
>>> number of new conditional paths to the fast path.
>>
>> When I was counting the number of spots in the Rx polling function that
>> need to have switch-cases/ifs in order to be able to merge the code
>> (e.g. parsing the descriptors), it was something around 4-5 (per
>> packet). So it can only be figured out during the testing whether adding
>> new branches actually hurts there.
>
> The other thing is you may want to double check CPU(s) you are
> expected to support as last I knew switch statements were still
> expensive due to all the old spectre/meltdown workarounds.
Wait, are switch-cases also affected? I wasn't aware of that. For sure I
didn't even consider using ops/indirect calls, but switch-cases... I saw
lots o'times people replacing indirections with switch-cases, what's the
point otherwise :D

Thanks,
Olek

2023-06-06 14:37:16

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

On Tue, Jun 6, 2023 at 5:49 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
> Date: Fri, 2 Jun 2023 10:50:02 -0700
>
> Sorry for the silence, had sorta long weekend :p
>
> > On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
> > <[email protected]> wrote:
>

[...]

> > The other thing is you may want to double check CPU(s) you are
> > expected to support as last I knew switch statements were still
> > expensive due to all the old spectre/meltdown workarounds.
> Wait, are switch-cases also affected? I wasn't aware of that. For sure I
> didn't even consider using ops/indirect calls, but switch-cases... I saw
> lots o'times people replacing indirections with switch-cases, what's the
> point otherwise :D
>
> Thanks,
> Olek

I think it all depends on how the compiler implements them. I know
switch cases used to be done as indirections in some cases. It is
possible they did away with that in the compilers some time ago or
came up with a fix to work around them. I just remember back when that
came up people were going through and replacing switch cases with
if/else blocks in the case of performance sensitive workloads. Odds
are there have been workarounds added to address that, but it is just
something to be aware of.

Thanks,

- Alex