2024-04-15 09:53:24

by Paul Barker

[permalink] [raw]
Subject: [net-next RFC v3 0/7] Improve GbEth performance on Renesas RZ/G2L and related SoCs

This series aims to improve performance of the GbEth IP in the Renesas
RZ/G2L SoC family and the RZ/G3S SoC, which use the ravb driver. Along
the way, we do some refactoring and ensure that napi_complete_done() is
used in accordance with the NAPI documentation for both GbEth and R-Car
code paths.

Much of the performance improvement comes from enabling SW IRQ
Coalescing for all SoCs using the GbEth IP, and NAPI Threaded mode for
single core SoCs using the GbEth IP. These can be enabled/disabled at
runtime via sysfs, but our goal is to set sensible defaults which get
good performance on the affected SoCs.

The rest of the performance improvement comes from using a page pool to
allocate RX buffers, and reducing the allocation size from >8kB to 2kB.

The overall performance impact of this patch series seen in testing with
iperf3 is as follows (see patches 5-7 for more detailed results):
* RZ/G2L:
* TCP TX: +1.8% bandwidth
* TCP RX: +1% bandwidth at 47% less CPU load
* UDP RX: +1% bandwidth at 26% less CPU load

* RZ/G2UL:
* TCP TX: +37% bandwidth
* TCP RX: +43% bandwidth
* UDP TX: -8% bandwidth
* UDP RX: +32500% bandwidth (!)

* RZ/G3S:
* TCP TX: +25% bandwidth
* TCP RX: +76% bandwidth
* UDP TX: -9% bandwidth
* UDP RX: +37900% bandwidth (!)

* RZ/Five:
* TCP TX: +18% bandwidth
* TCP RX: +212% bandwidth
* UDP TX: +2% bandwidth
* UDP RX: +inf bandwidth (test no longer crashes)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Fixing the crash in UDP RX testing for RZ/Five is a cumulative effect of
patches 1, 2, 5 & 6 so this is very difficult to break out as a bugfix
for backporting.

This series depends on my recent ravb bugfix patches [1] which are not
yet merged.

[1]: https://lore.kernel.org/all/[email protected]/

Changes v2->v3:
* Incorporated feedback on RFC v2 from Sergey.
* Split out bugfixes and rebased. This changed the order of what was
the first 5 patches of v2 and things look a little different so I've
not picked up Reviewed-by tags from v2.
* Further refactoring and tidy up of RX ring refill and
ravb_rx_gbeth().
* Switched to using a page pool to allocate RX buffers.
* Re-tested and provided updated performance figures.

Changes v1->v2:
* Marked as RFC as the series depends on unmerged patches.
* Refactored R-Car code paths as well as GbEth code paths.
* Updated references to the patches this series depends on.

Paul Barker (7):
net: ravb: Simplify poll & receive functions
net: ravb: Align poll function with NAPI docs
net: ravb: Refactor RX ring refill
net: ravb: Refactor GbEth RX code path
net: ravb: Enable SW IRQ Coalescing for GbEth
net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP
net: ravb: Allocate RX buffers via page pool

drivers/net/ethernet/renesas/ravb.h | 13 +-
drivers/net/ethernet/renesas/ravb_main.c | 430 +++++++++++------------
2 files changed, 221 insertions(+), 222 deletions(-)

--
2.39.2



2024-04-15 09:57:04

by Paul Barker

[permalink] [raw]
Subject: [net-next RFC v3 4/7] net: ravb: Refactor GbEth RX code path

We can reduce code duplication in ravb_rx_gbeth() and add comments to
make the code flow easier to understand.

Signed-off-by: Paul Barker <[email protected]>
---
drivers/net/ethernet/renesas/ravb_main.c | 70 ++++++++++++------------
1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index baa01bd81f2d..12618171f6d5 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -818,47 +818,47 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
stats->rx_missed_errors++;
} else {
die_dt = desc->die_dt & 0xF0;
- switch (die_dt) {
- case DT_FSINGLE:
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
+ skb = ravb_get_skb_gbeth(ndev, entry, desc);
+ if (die_dt == DT_FSINGLE || die_dt == DT_FSTART) {
+ /* Start of packet:
+ * Set initial data length.
+ */
skb_put(skb, desc_len);
+
+ /* Save this SKB if the packet spans multiple
+ * descriptors.
+ */
+ if (die_dt == DT_FSTART)
+ priv->rx_1st_skb = skb;
+ } else {
+ /* Continuing a packet:
+ * Move data into the saved SKB.
+ */
+ skb_copy_to_linear_data_offset(priv->rx_1st_skb,
+ priv->rx_1st_skb->len,
+ skb->data,
+ desc_len);
+ skb_put(priv->rx_1st_skb, desc_len);
+ dev_kfree_skb(skb);
+
+ /* Set skb to point at the whole packet so that
+ * we only need one code path for finishing a
+ * packet.
+ */
+ skb = priv->rx_1st_skb;
+ }
+
+ if (die_dt == DT_FSINGLE || die_dt == DT_FEND) {
+ /* Finishing a packet:
+ * Determine protocol & checksum, hand off to
+ * NAPI and update our stats.
+ */
skb->protocol = eth_type_trans(skb, ndev);
if (ndev->features & NETIF_F_RXCSUM)
ravb_rx_csum_gbeth(skb);
+ stats->rx_bytes += skb->len;
napi_gro_receive(&priv->napi[q], skb);
rx_packets++;
- stats->rx_bytes += desc_len;
- break;
- case DT_FSTART:
- priv->rx_1st_skb = ravb_get_skb_gbeth(ndev, entry, desc);
- skb_put(priv->rx_1st_skb, desc_len);
- break;
- case DT_FMID:
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
- skb_copy_to_linear_data_offset(priv->rx_1st_skb,
- priv->rx_1st_skb->len,
- skb->data,
- desc_len);
- skb_put(priv->rx_1st_skb, desc_len);
- dev_kfree_skb(skb);
- break;
- case DT_FEND:
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
- skb_copy_to_linear_data_offset(priv->rx_1st_skb,
- priv->rx_1st_skb->len,
- skb->data,
- desc_len);
- skb_put(priv->rx_1st_skb, desc_len);
- dev_kfree_skb(skb);
- priv->rx_1st_skb->protocol =
- eth_type_trans(priv->rx_1st_skb, ndev);
- if (ndev->features & NETIF_F_RXCSUM)
- ravb_rx_csum_gbeth(priv->rx_1st_skb);
- stats->rx_bytes += priv->rx_1st_skb->len;
- napi_gro_receive(&priv->napi[q],
- priv->rx_1st_skb);
- rx_packets++;
- break;
}
}
}
--
2.39.2


2024-04-15 09:58:10

by Paul Barker

[permalink] [raw]
Subject: [net-next RFC v3 6/7] net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP

NAPI Threaded mode (along with the previously enabled SW IRQ Coalescing)
is required to improve network stack performance for single core SoCs
using the GbEth IP (currently the RZ/G2L SoC family and the RZ/G3S SoC).

This patch gives the following improvements during testing with iperf3.

* RZ/G2UL:
* TCP TX: +32% bandwidth (638Mbps -> 841Mbps)
* TXP RX: +8.8% bandwidth (667Mbps -> 726Mbps)
* UDP RX: +104% bandwidth (53Mbps -> 108Mbps)

* RZ/G3S:
* TCP TX: 29% bandwidth (529Mbps -> 681Mbps)
* UDP RX: +1290% bandwidth (6.46Mbps -> 90Mbps)

* RZ/Five:
* UDP RX: Test no longer crashes (0 -> 20 Mbps)

This patch gives the following reductions in performance in the same
testing:

* RZ/G2UL:
* UDP TX: -7.5% bandwidth (594Mbps -> 549Mbps)

* RZ/G3S:
* UDP TX: -5% bandwidth (625Mbps -> 594Mbps)

These losses are considered acceptable given the benefits shown above.
If UDP TX bandwidth must be maximised for a particular use case, NAPI
threaded mode can be disabled at runtime via sysfs writes.

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.

Signed-off-by: Paul Barker <[email protected]>
---
drivers/net/ethernet/renesas/ravb_main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 26b70b996bd3..7434faf0820c 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -2938,8 +2938,11 @@ static int ravb_probe(struct platform_device *pdev)
if (info->nc_queues)
netif_napi_add(ndev, &priv->napi[RAVB_NC], ravb_poll);

- if (info->needs_irq_coalesce)
+ if (info->needs_irq_coalesce) {
netdev_sw_irq_coalesce_default_on(ndev);
+ if (num_present_cpus() == 1)
+ dev_set_threaded(ndev, true);
+ }

/* Network device register */
error = register_netdev(ndev);
--
2.39.2


2024-04-15 09:58:41

by Paul Barker

[permalink] [raw]
Subject: [net-next RFC v3 1/7] net: ravb: Simplify poll & receive functions

We don't need to pass the work budget to ravb_rx() by reference, it's
cleaner to pass this by value and return the amount of work done. This
allows us to simplify the ravb_poll() function and use the common
`work_done` variable name seen in other network drivers for consistency
and ease of understanding.

This is a pure refactor and should not affect behaviour.

Signed-off-by: Paul Barker <[email protected]>
---
drivers/net/ethernet/renesas/ravb.h | 2 +-
drivers/net/ethernet/renesas/ravb_main.c | 29 ++++++++++--------------
2 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
index b48935ec7e28..71de2a7aa27c 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1039,7 +1039,7 @@ struct ravb_ptp {
};

struct ravb_hw_info {
- bool (*receive)(struct net_device *ndev, int *quota, int q);
+ int (*receive)(struct net_device *ndev, int budget, int q);
void (*set_rate)(struct net_device *ndev);
int (*set_feature)(struct net_device *ndev, netdev_features_t features);
int (*dmac_init)(struct net_device *ndev);
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index fd2679ce4e3d..33f8043143c1 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -759,7 +759,7 @@ static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
}

/* Packet receive function for Gigabit Ethernet */
-static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
+static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;
@@ -778,7 +778,7 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
limit = priv->dirty_rx[q] + priv->num_rx_ring[q] - priv->cur_rx[q];
stats = &priv->stats[q];

- for (i = 0; i < limit && rx_packets < *quota; i++, priv->cur_rx[q]++) {
+ for (i = 0; i < limit && rx_packets < budget; i++, priv->cur_rx[q]++) {
entry = priv->cur_rx[q] % priv->num_rx_ring[q];
desc = &priv->rx_ring[q].desc[entry];
if (desc->die_dt == DT_FEMPTY)
@@ -881,13 +881,11 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
desc->die_dt = DT_FEMPTY;
}

- stats->rx_packets += rx_packets;
- *quota -= rx_packets;
- return *quota == 0;
+ return rx_packets;
}

/* Packet receive function for Ethernet AVB */
-static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
+static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;
@@ -904,7 +902,7 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
int i;

limit = priv->dirty_rx[q] + priv->num_rx_ring[q] - priv->cur_rx[q];
- for (i = 0; i < limit && rx_packets < *quota; i++, priv->cur_rx[q]++) {
+ for (i = 0; i < limit && rx_packets < budget; i++, priv->cur_rx[q]++) {
entry = priv->cur_rx[q] % priv->num_rx_ring[q];
desc = &priv->rx_ring[q].ex_desc[entry];
if (desc->die_dt == DT_FEMPTY)
@@ -992,18 +990,16 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
desc->die_dt = DT_FEMPTY;
}

- stats->rx_packets += rx_packets;
- *quota -= rx_packets;
- return *quota == 0;
+ return rx_packets;
}

/* Packet receive function for Ethernet AVB */
-static bool ravb_rx(struct net_device *ndev, int *quota, int q)
+static int ravb_rx(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;

- return info->receive(ndev, quota, q);
+ return info->receive(ndev, budget, q);
}

static void ravb_rcv_snd_disable(struct net_device *ndev)
@@ -1320,13 +1316,12 @@ static int ravb_poll(struct napi_struct *napi, int budget)
unsigned long flags;
int q = napi - priv->napi;
int mask = BIT(q);
- int quota = budget;
- bool unmask;
+ int work_done;

/* Processing RX Descriptor Ring */
/* Clear RX interrupt */
ravb_write(ndev, ~(mask | RIS0_RESERVED), RIS0);
- unmask = !ravb_rx(ndev, &quota, q);
+ work_done = ravb_rx(ndev, budget, q);

/* Processing TX Descriptor Ring */
spin_lock_irqsave(&priv->lock, flags);
@@ -1345,7 +1340,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
if (priv->rx_fifo_errors != ndev->stats.rx_fifo_errors)
ndev->stats.rx_fifo_errors = priv->rx_fifo_errors;

- if (!unmask)
+ if (work_done == budget)
goto out;

napi_complete(napi);
@@ -1362,7 +1357,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
spin_unlock_irqrestore(&priv->lock, flags);

out:
- return budget - quota;
+ return work_done;
}

static void ravb_set_duplex_gbeth(struct net_device *ndev)
--
2.39.2


2024-04-15 10:37:43

by Paul Barker

[permalink] [raw]
Subject: [net-next RFC v3 7/7] net: ravb: Allocate RX buffers via page pool

This patch makes multiple changes that can't be separated:

1) Allocate plain RX buffers via a page pool instead of allocating
SKBs, then use build_skb() when a packet is received.
2) For GbEth IP, reduce the RX buffer size to 2kB.
3) For GbEth IP, merge packets which span more than one RX descriptor
as SKB fragments instead of copying data.

Implementing (1) without (2) would require the use of an order-1 page
pool (instead of an order-0 page pool split into page fragments) for
GbEth.

Implementing (2) without (3) would leave us no space to re-assemble
packets which span more than one RX descriptor.

Implementing (3) without (1) would not be possible as the network stack
expects to use put_page() or page_pool_put_page() to free SKB fragments
after an SKB is consumed.

This patch gives the following improvements during testing with iperf3.

* RZ/G2L:
* TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
* UDP RX: same bandwidth at -17% CPU load (88% -> 74%)

* RZ/G2UL:
* TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
* UDP RX: +417% bandwidth (108Mbps -> 558Mbps)

* RZ/G3S:
* TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
* UDP RX: +420% bandwidth (90Mbps -> 468Mbps)

* RZ/Five:
* TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
* UDP RX: +470% bandwidth (20Mbps -> 114Mbps)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Signed-off-by: Paul Barker <[email protected]>
---
drivers/net/ethernet/renesas/ravb.h | 10 +-
drivers/net/ethernet/renesas/ravb_main.c | 209 +++++++++++++----------
2 files changed, 128 insertions(+), 91 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
index 9c6392ade2f1..4348366c3dc7 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1050,8 +1050,8 @@ struct ravb_hw_info {
netdev_features_t net_features;
int stats_len;
u32 tccr_mask;
+ u32 rx_buffer_size;
u32 rx_max_frame_size;
- u32 rx_max_desc_use;
u32 rx_desc_size;
unsigned aligned_tx: 1;
unsigned needs_irq_coalesce:1; /* Needs software IRQ coalescing */
@@ -1071,6 +1071,11 @@ struct ravb_hw_info {
unsigned half_duplex:1; /* E-MAC supports half duplex mode */
};

+struct ravb_rx_buffer {
+ struct page *page;
+ unsigned int offset;
+};
+
struct ravb_private {
struct net_device *ndev;
struct platform_device *pdev;
@@ -1094,7 +1099,8 @@ struct ravb_private {
struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
void *tx_align[NUM_TX_QUEUE];
struct sk_buff *rx_1st_skb;
- struct sk_buff **rx_skb[NUM_RX_QUEUE];
+ struct page_pool *rx_pool;
+ struct ravb_rx_buffer *rx_buffers[NUM_RX_QUEUE];
struct sk_buff **tx_skb[NUM_TX_QUEUE];
u32 rx_over_errors;
u32 rx_fifo_errors;
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 7434faf0820c..892a3eadef1e 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -30,6 +30,7 @@
#include <linux/reset.h>
#include <linux/math64.h>
#include <net/ip.h>
+#include <net/page_pool/helpers.h>

#include "ravb.h"

@@ -113,25 +114,6 @@ static void ravb_set_rate_rcar(struct net_device *ndev)
}
}

-static struct sk_buff *
-ravb_alloc_skb(struct net_device *ndev, const struct ravb_hw_info *info,
- gfp_t gfp_mask)
-{
- struct sk_buff *skb;
- u32 reserve;
-
- skb = __netdev_alloc_skb(ndev, info->rx_max_frame_size + RAVB_ALIGN - 1,
- gfp_mask);
- if (!skb)
- return NULL;
-
- reserve = (unsigned long)skb->data & (RAVB_ALIGN - 1);
- if (reserve)
- skb_reserve(skb, RAVB_ALIGN - reserve);
-
- return skb;
-}
-
/* Get MAC address from the MAC address registers
*
* Ethernet AVB device doesn't have ROM for MAC address.
@@ -257,21 +239,10 @@ static void ravb_rx_ring_free(struct net_device *ndev, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
unsigned int ring_size;
- unsigned int i;

if (!priv->rx_ring[q].raw)
return;

- for (i = 0; i < priv->num_rx_ring[q]; i++) {
- struct ravb_rx_desc *desc = ravb_rx_get_desc(priv, q, i);
-
- if (!dma_mapping_error(ndev->dev.parent,
- le32_to_cpu(desc->dptr)))
- dma_unmap_single(ndev->dev.parent,
- le32_to_cpu(desc->dptr),
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- }
ring_size = priv->info->rx_desc_size * (priv->num_rx_ring[q] + 1);
dma_free_coherent(ndev->dev.parent, ring_size, priv->rx_ring[q].raw,
priv->rx_desc_dma[q]);
@@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
priv->tx_ring[q] = NULL;
}

- /* Free RX skb ringbuffer */
- if (priv->rx_skb[q]) {
- for (i = 0; i < priv->num_rx_ring[q]; i++)
- dev_kfree_skb(priv->rx_skb[q][i]);
+ /* Free RX buffers */
+ for (i = 0; i < priv->num_rx_ring[q]; i++) {
+ if (priv->rx_buffers[q][i].page)
+ page_pool_put_page(priv->rx_pool, priv->rx_buffers[q][i].page, 0, true);
}
- kfree(priv->rx_skb[q]);
- priv->rx_skb[q] = NULL;
+ kfree(priv->rx_buffers[q]);
+ priv->rx_buffers[q] = NULL;
+ page_pool_destroy(priv->rx_pool);

/* Free aligned TX buffers */
kfree(priv->tx_align[q]);
@@ -317,35 +289,54 @@ static void ravb_ring_free(struct net_device *ndev, int q)
priv->tx_skb[q] = NULL;
}

+static int
+ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
+ __le32 *dptr)
+{
+ struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
+ struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
+ dma_addr_t dma_addr;
+ unsigned int size;
+
+ size = info->rx_buffer_size;
+ rx_buff->page = page_pool_alloc(priv->rx_pool, &rx_buff->offset, &size,
+ gfp_mask);
+ if (unlikely(!rx_buff->page))
+ return -ENOMEM;
+
+ dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
+ dma_sync_single_for_device(ndev->dev.parent, dma_addr,
+ info->rx_buffer_size, DMA_FROM_DEVICE);
+ *dptr = cpu_to_le32(dma_addr);
+ return 0;
+}
+
static u32
ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;
struct ravb_rx_desc *rx_desc;
- dma_addr_t dma_addr;
u32 i, entry;

for (i = 0; i < count; i++) {
entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
rx_desc = ravb_rx_get_desc(priv, q, entry);
- rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);

- if (!priv->rx_skb[q][entry]) {
- priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
- if (!priv->rx_skb[q][entry])
- break;
- dma_addr = dma_map_single(ndev->dev.parent,
- priv->rx_skb[q][entry]->data,
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- skb_checksum_none_assert(priv->rx_skb[q][entry]);
- /* We just set the data size to 0 for a failed mapping
- * which should prevent DMA from happening...
- */
- if (dma_mapping_error(ndev->dev.parent, dma_addr))
+ if (!priv->rx_buffers[q][entry].page) {
+ if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry, gfp_mask,
+ &rx_desc->dptr))) {
+ /* We just set the data size to 0 for a failed mapping
+ * which should prevent DMA from happening...
+ */
rx_desc->ds_cc = cpu_to_le16(0);
- rx_desc->dptr = cpu_to_le32(dma_addr);
+ break;
+ }
+
+ rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
+ - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
+ - ETH_FCS_LEN + sizeof(__sum16));
}
/* Descriptor type must be set after all the above writes */
dma_wmb();
@@ -423,15 +414,32 @@ static int ravb_ring_init(struct net_device *ndev, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
unsigned int num_tx_desc = priv->num_tx_desc;
+ struct page_pool_params params = {
+ .order = 0,
+ .flags = PP_FLAG_DMA_MAP,
+ .pool_size = priv->num_rx_ring[q],
+ .nid = NUMA_NO_NODE,
+ .dev = ndev->dev.parent,
+ .dma_dir = DMA_FROM_DEVICE,
+ };
unsigned int ring_size;
u32 num_filled;

- /* Allocate RX and TX skb rings */
- priv->rx_skb[q] = kcalloc(priv->num_rx_ring[q],
- sizeof(*priv->rx_skb[q]), GFP_KERNEL);
+ /* Allocate RX page pool and buffers */
+ priv->rx_pool = page_pool_create(&params);
+ if (IS_ERR(priv->rx_pool))
+ goto error;
+
+ /* Allocate RX buffers */
+ priv->rx_buffers[q] = kcalloc(priv->num_rx_ring[q],
+ sizeof(*priv->rx_buffers[q]), GFP_KERNEL);
+ if (!priv->rx_buffers[q])
+ goto error;
+
+ /* Allocate TX skb rings */
priv->tx_skb[q] = kcalloc(priv->num_tx_ring[q],
sizeof(*priv->tx_skb[q]), GFP_KERNEL);
- if (!priv->rx_skb[q] || !priv->tx_skb[q])
+ if (!priv->tx_skb[q])
goto error;

if (num_tx_desc > 1) {
@@ -755,25 +763,11 @@ static void ravb_rx_csum(struct sk_buff *skb)
skb_trim(skb, skb->len - sizeof(__sum16));
}

-static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
- struct ravb_rx_desc *desc)
-{
- struct ravb_private *priv = netdev_priv(ndev);
- struct sk_buff *skb;
-
- skb = priv->rx_skb[RAVB_BE][entry];
- priv->rx_skb[RAVB_BE][entry] = NULL;
- dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr),
- ALIGN(priv->info->rx_max_frame_size, 16),
- DMA_FROM_DEVICE);
-
- return skb;
-}
-
/* Packet receive function for Gigabit Ethernet */
static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
struct net_device_stats *stats;
struct ravb_rx_desc *desc;
struct sk_buff *skb;
@@ -817,12 +811,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
if (desc_status & MSC_CEEF)
stats->rx_missed_errors++;
} else {
+ struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
+ void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
die_dt = desc->die_dt & 0xF0;
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
+ dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
+ desc_len, DMA_FROM_DEVICE);
+
if (die_dt == DT_FSINGLE || die_dt == DT_FSTART) {
/* Start of packet:
- * Set initial data length.
+ * Prepare an SKB and add initial data.
*/
+ skb = napi_build_skb(rx_addr, info->rx_buffer_size);
+ if (unlikely(!skb)) {
+ stats->rx_errors++;
+ page_pool_put_page(priv->rx_pool, rx_buff->page, 0, true);
+ break;
+ }
+ skb_mark_for_recycle(skb);
skb_put(skb, desc_len);

/* Save this SKB if the packet spans multiple
@@ -832,14 +837,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
priv->rx_1st_skb = skb;
} else {
/* Continuing a packet:
- * Move data into the saved SKB.
+ * Add this buffer as an RX frag.
*/
- skb_copy_to_linear_data_offset(priv->rx_1st_skb,
- priv->rx_1st_skb->len,
- skb->data,
- desc_len);
- skb_put(priv->rx_1st_skb, desc_len);
- dev_kfree_skb(skb);
+
+ /* rx_1st_skb will be NULL if napi_build_skb()
+ * failed for the first descriptor of a
+ * multi-descriptor packet.
+ */
+ if (unlikely(!priv->rx_1st_skb)) {
+ stats->rx_errors++;
+ page_pool_put_page(priv->rx_pool, rx_buff->page, 0, true);
+ break;
+ }
+
+ skb_add_rx_frag(priv->rx_1st_skb,
+ skb_shinfo(priv->rx_1st_skb)->nr_frags,
+ rx_buff->page, rx_buff->offset,
+ desc_len, info->rx_buffer_size);

/* Set skb to point at the whole packet so that
* we only need one code path for finishing a
@@ -859,7 +873,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
stats->rx_bytes += skb->len;
napi_gro_receive(&priv->napi[q], skb);
rx_packets++;
+
+ /* Clear rx_1st_skb so that it will only be
+ * non-NULL when valid.
+ */
+ if (die_dt == DT_FEND)
+ priv->rx_1st_skb = NULL;
}
+
+ /* Mark this RX buffer as consumed. */
+ rx_buff->page = NULL;
}
}

@@ -875,6 +898,7 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
struct net_device_stats *stats = &priv->stats[q];
struct ravb_ex_rx_desc *desc;
struct sk_buff *skb;
@@ -917,13 +941,20 @@ static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
if (desc_status & MSC_CEEF)
stats->rx_missed_errors++;
} else {
+ struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
+ void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
u32 get_ts = priv->tstamp_rx_ctrl & RAVB_RXTSTAMP_TYPE;

- skb = priv->rx_skb[q][entry];
- priv->rx_skb[q][entry] = NULL;
- dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr),
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
+ skb = napi_build_skb(rx_addr, info->rx_buffer_size);
+ if (unlikely(!skb)) {
+ stats->rx_errors++;
+ page_pool_put_page(priv->rx_pool, rx_buff->page, 0, true);
+ break;
+ }
+ dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
+ pkt_len, DMA_FROM_DEVICE);
+ rx_buff->page = NULL;
+ skb_mark_for_recycle(skb);
get_ts &= (q == RAVB_NC) ?
RAVB_RXTSTAMP_TYPE_V2_L2_EVENT :
~RAVB_RXTSTAMP_TYPE_V2_L2_EVENT;
@@ -2588,8 +2619,8 @@ static const struct ravb_hw_info ravb_gen3_hw_info = {
.net_features = NETIF_F_RXCSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats),
.tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
+ .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
.rx_max_frame_size = SZ_2K,
- .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
.rx_desc_size = sizeof(struct ravb_ex_rx_desc),
.internal_delay = 1,
.tx_counters = 1,
@@ -2612,8 +2643,8 @@ static const struct ravb_hw_info ravb_gen2_hw_info = {
.net_features = NETIF_F_RXCSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats),
.tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
+ .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
.rx_max_frame_size = SZ_2K,
- .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
.rx_desc_size = sizeof(struct ravb_ex_rx_desc),
.aligned_tx = 1,
.gptp = 1,
@@ -2633,8 +2664,8 @@ static const struct ravb_hw_info ravb_rzv2m_hw_info = {
.net_features = NETIF_F_RXCSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats),
.tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
+ .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
.rx_max_frame_size = SZ_2K,
- .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
.rx_desc_size = sizeof(struct ravb_ex_rx_desc),
.multi_irqs = 1,
.err_mgmt_irqs = 1,
@@ -2656,8 +2687,8 @@ static const struct ravb_hw_info gbeth_hw_info = {
.net_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats_gbeth),
.tccr_mask = TCCR_TSRQ0,
+ .rx_buffer_size = SZ_2K,
.rx_max_frame_size = SZ_8K,
- .rx_max_desc_use = 4080,
.rx_desc_size = sizeof(struct ravb_rx_desc),
.aligned_tx = 1,
.tx_counters = 1,
--
2.39.2


2024-04-15 11:23:25

by Niklas Söderlund

[permalink] [raw]
Subject: Re: [net-next RFC v3 1/7] net: ravb: Simplify poll & receive functions

Hi Paul,

I really like this patch!

One nit below.

On 2024-04-15 10:47:58 +0100, Paul Barker wrote:
> We don't need to pass the work budget to ravb_rx() by reference, it's
> cleaner to pass this by value and return the amount of work done. This
> allows us to simplify the ravb_poll() function and use the common
> `work_done` variable name seen in other network drivers for consistency
> and ease of understanding.
>
> This is a pure refactor and should not affect behaviour.
>
> Signed-off-by: Paul Barker <[email protected]>
> ---
> drivers/net/ethernet/renesas/ravb.h | 2 +-
> drivers/net/ethernet/renesas/ravb_main.c | 29 ++++++++++--------------
> 2 files changed, 13 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index b48935ec7e28..71de2a7aa27c 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -1039,7 +1039,7 @@ struct ravb_ptp {
> };
>
> struct ravb_hw_info {
> - bool (*receive)(struct net_device *ndev, int *quota, int q);
> + int (*receive)(struct net_device *ndev, int budget, int q);
> void (*set_rate)(struct net_device *ndev);
> int (*set_feature)(struct net_device *ndev, netdev_features_t features);
> int (*dmac_init)(struct net_device *ndev);
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index fd2679ce4e3d..33f8043143c1 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -759,7 +759,7 @@ static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
> }
>
> /* Packet receive function for Gigabit Ethernet */
> -static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
> +static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> const struct ravb_hw_info *info = priv->info;
> @@ -778,7 +778,7 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
> limit = priv->dirty_rx[q] + priv->num_rx_ring[q] - priv->cur_rx[q];
> stats = &priv->stats[q];
>
> - for (i = 0; i < limit && rx_packets < *quota; i++, priv->cur_rx[q]++) {
> + for (i = 0; i < limit && rx_packets < budget; i++, priv->cur_rx[q]++) {
> entry = priv->cur_rx[q] % priv->num_rx_ring[q];
> desc = &priv->rx_ring[q].desc[entry];
> if (desc->die_dt == DT_FEMPTY)
> @@ -881,13 +881,11 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
> desc->die_dt = DT_FEMPTY;
> }
>
> - stats->rx_packets += rx_packets;
> - *quota -= rx_packets;
> - return *quota == 0;
> + return rx_packets;
> }
>
> /* Packet receive function for Ethernet AVB */
> -static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
> +static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> const struct ravb_hw_info *info = priv->info;
> @@ -904,7 +902,7 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
> int i;
>
> limit = priv->dirty_rx[q] + priv->num_rx_ring[q] - priv->cur_rx[q];
> - for (i = 0; i < limit && rx_packets < *quota; i++, priv->cur_rx[q]++) {
> + for (i = 0; i < limit && rx_packets < budget; i++, priv->cur_rx[q]++) {
> entry = priv->cur_rx[q] % priv->num_rx_ring[q];
> desc = &priv->rx_ring[q].ex_desc[entry];
> if (desc->die_dt == DT_FEMPTY)
> @@ -992,18 +990,16 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
> desc->die_dt = DT_FEMPTY;
> }
>
> - stats->rx_packets += rx_packets;

Don't we still need to update the statistics? Same for ravb_rx_gbeth().

> - *quota -= rx_packets;
> - return *quota == 0;
> + return rx_packets;
> }
>
> /* Packet receive function for Ethernet AVB */
> -static bool ravb_rx(struct net_device *ndev, int *quota, int q)
> +static int ravb_rx(struct net_device *ndev, int budget, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> const struct ravb_hw_info *info = priv->info;
>
> - return info->receive(ndev, quota, q);
> + return info->receive(ndev, budget, q);
> }
>
> static void ravb_rcv_snd_disable(struct net_device *ndev)
> @@ -1320,13 +1316,12 @@ static int ravb_poll(struct napi_struct *napi, int budget)
> unsigned long flags;
> int q = napi - priv->napi;
> int mask = BIT(q);
> - int quota = budget;
> - bool unmask;
> + int work_done;
>
> /* Processing RX Descriptor Ring */
> /* Clear RX interrupt */
> ravb_write(ndev, ~(mask | RIS0_RESERVED), RIS0);
> - unmask = !ravb_rx(ndev, &quota, q);
> + work_done = ravb_rx(ndev, budget, q);
>
> /* Processing TX Descriptor Ring */
> spin_lock_irqsave(&priv->lock, flags);
> @@ -1345,7 +1340,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
> if (priv->rx_fifo_errors != ndev->stats.rx_fifo_errors)
> ndev->stats.rx_fifo_errors = priv->rx_fifo_errors;
>
> - if (!unmask)
> + if (work_done == budget)
> goto out;
>
> napi_complete(napi);
> @@ -1362,7 +1357,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
> spin_unlock_irqrestore(&priv->lock, flags);
>
> out:
> - return budget - quota;
> + return work_done;
> }
>
> static void ravb_set_duplex_gbeth(struct net_device *ndev)
> --
> 2.39.2
>

--
Kind Regards,
Niklas Söderlund

2024-04-15 11:31:26

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next RFC v3 1/7] net: ravb: Simplify poll & receive functions

On 15/04/2024 12:23, Niklas Söderlund wrote:
> Hi Paul,
>
> I really like this patch!
>
> One nit below.
>
> On 2024-04-15 10:47:58 +0100, Paul Barker wrote:
>> We don't need to pass the work budget to ravb_rx() by reference, it's
>> cleaner to pass this by value and return the amount of work done. This
>> allows us to simplify the ravb_poll() function and use the common
>> `work_done` variable name seen in other network drivers for consistency
>> and ease of understanding.
>>
>> This is a pure refactor and should not affect behaviour.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>> ---
>> drivers/net/ethernet/renesas/ravb.h | 2 +-
>> drivers/net/ethernet/renesas/ravb_main.c | 29 ++++++++++--------------
>> 2 files changed, 13 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
>> index b48935ec7e28..71de2a7aa27c 100644
>> --- a/drivers/net/ethernet/renesas/ravb.h
>> +++ b/drivers/net/ethernet/renesas/ravb.h
>> @@ -1039,7 +1039,7 @@ struct ravb_ptp {
>> };
>>
>> struct ravb_hw_info {
>> - bool (*receive)(struct net_device *ndev, int *quota, int q);
>> + int (*receive)(struct net_device *ndev, int budget, int q);
>> void (*set_rate)(struct net_device *ndev);
>> int (*set_feature)(struct net_device *ndev, netdev_features_t features);
>> int (*dmac_init)(struct net_device *ndev);
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index fd2679ce4e3d..33f8043143c1 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>> @@ -759,7 +759,7 @@ static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
>> }
>>
>> /* Packet receive function for Gigabit Ethernet */
>> -static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
>> +static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> const struct ravb_hw_info *info = priv->info;
>> @@ -778,7 +778,7 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
>> limit = priv->dirty_rx[q] + priv->num_rx_ring[q] - priv->cur_rx[q];
>> stats = &priv->stats[q];
>>
>> - for (i = 0; i < limit && rx_packets < *quota; i++, priv->cur_rx[q]++) {
>> + for (i = 0; i < limit && rx_packets < budget; i++, priv->cur_rx[q]++) {
>> entry = priv->cur_rx[q] % priv->num_rx_ring[q];
>> desc = &priv->rx_ring[q].desc[entry];
>> if (desc->die_dt == DT_FEMPTY)
>> @@ -881,13 +881,11 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
>> desc->die_dt = DT_FEMPTY;
>> }
>>
>> - stats->rx_packets += rx_packets;
>> - *quota -= rx_packets;
>> - return *quota == 0;
>> + return rx_packets;
>> }
>>
>> /* Packet receive function for Ethernet AVB */
>> -static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
>> +static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> const struct ravb_hw_info *info = priv->info;
>> @@ -904,7 +902,7 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
>> int i;
>>
>> limit = priv->dirty_rx[q] + priv->num_rx_ring[q] - priv->cur_rx[q];
>> - for (i = 0; i < limit && rx_packets < *quota; i++, priv->cur_rx[q]++) {
>> + for (i = 0; i < limit && rx_packets < budget; i++, priv->cur_rx[q]++) {
>> entry = priv->cur_rx[q] % priv->num_rx_ring[q];
>> desc = &priv->rx_ring[q].ex_desc[entry];
>> if (desc->die_dt == DT_FEMPTY)
>> @@ -992,18 +990,16 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
>> desc->die_dt = DT_FEMPTY;
>> }
>>
>> - stats->rx_packets += rx_packets;
>
> Don't we still need to update the statistics? Same for ravb_rx_gbeth().

Good catch! This was present in the previous version of the series [1],
but I accidentally dropped it while splitting out the bugfix patches.
I'll add it back in for the next version.

[1]: https://lore.kernel.org/all/[email protected]/

Thanks,

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-04-15 12:16:49

by Niklas Söderlund

[permalink] [raw]
Subject: Re: [net-next RFC v3 7/7] net: ravb: Allocate RX buffers via page pool

Hi Paul,

I think using page pool is a good idea!

On 2024-04-15 10:48:04 +0100, Paul Barker wrote:
> This patch makes multiple changes that can't be separated:
>
> 1) Allocate plain RX buffers via a page pool instead of allocating
> SKBs, then use build_skb() when a packet is received.
> 2) For GbEth IP, reduce the RX buffer size to 2kB.
> 3) For GbEth IP, merge packets which span more than one RX descriptor
> as SKB fragments instead of copying data.
>
> Implementing (1) without (2) would require the use of an order-1 page
> pool (instead of an order-0 page pool split into page fragments) for
> GbEth.
>
> Implementing (2) without (3) would leave us no space to re-assemble
> packets which span more than one RX descriptor.
>
> Implementing (3) without (1) would not be possible as the network stack
> expects to use put_page() or page_pool_put_page() to free SKB fragments
> after an SKB is consumed.
>
> This patch gives the following improvements during testing with iperf3.
>
> * RZ/G2L:
> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>
> * RZ/G2UL:
> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>
> * RZ/G3S:
> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>
> * RZ/Five:
> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>
> There is no significant impact on bandwidth or CPU load in testing on
> RZ/G2H or R-Car M3N.
>
> Signed-off-by: Paul Barker <[email protected]>
> ---
> drivers/net/ethernet/renesas/ravb.h | 10 +-
> drivers/net/ethernet/renesas/ravb_main.c | 209 +++++++++++++----------
> 2 files changed, 128 insertions(+), 91 deletions(-)
>
> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index 9c6392ade2f1..4348366c3dc7 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -1050,8 +1050,8 @@ struct ravb_hw_info {
> netdev_features_t net_features;
> int stats_len;
> u32 tccr_mask;
> + u32 rx_buffer_size;
> u32 rx_max_frame_size;
> - u32 rx_max_desc_use;
> u32 rx_desc_size;
> unsigned aligned_tx: 1;
> unsigned needs_irq_coalesce:1; /* Needs software IRQ coalescing */
> @@ -1071,6 +1071,11 @@ struct ravb_hw_info {
> unsigned half_duplex:1; /* E-MAC supports half duplex mode */
> };
>
> +struct ravb_rx_buffer {
> + struct page *page;
> + unsigned int offset;
> +};
> +
> struct ravb_private {
> struct net_device *ndev;
> struct platform_device *pdev;
> @@ -1094,7 +1099,8 @@ struct ravb_private {
> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
> void *tx_align[NUM_TX_QUEUE];
> struct sk_buff *rx_1st_skb;
> - struct sk_buff **rx_skb[NUM_RX_QUEUE];
> + struct page_pool *rx_pool;

Don't we need a page pool per queue? Else multiple calls to
ravb_ring_init() and ravb_ring_free() for different queues will
otherwise risk allocating over a previous queue and multiple free the
same one.

> + struct ravb_rx_buffer *rx_buffers[NUM_RX_QUEUE];
> struct sk_buff **tx_skb[NUM_TX_QUEUE];
> u32 rx_over_errors;
> u32 rx_fifo_errors;
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 7434faf0820c..892a3eadef1e 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -30,6 +30,7 @@
> #include <linux/reset.h>
> #include <linux/math64.h>
> #include <net/ip.h>
> +#include <net/page_pool/helpers.h>
>
> #include "ravb.h"
>
> @@ -113,25 +114,6 @@ static void ravb_set_rate_rcar(struct net_device *ndev)
> }
> }
>
> -static struct sk_buff *
> -ravb_alloc_skb(struct net_device *ndev, const struct ravb_hw_info *info,
> - gfp_t gfp_mask)
> -{
> - struct sk_buff *skb;
> - u32 reserve;
> -
> - skb = __netdev_alloc_skb(ndev, info->rx_max_frame_size + RAVB_ALIGN - 1,
> - gfp_mask);
> - if (!skb)
> - return NULL;
> -
> - reserve = (unsigned long)skb->data & (RAVB_ALIGN - 1);
> - if (reserve)
> - skb_reserve(skb, RAVB_ALIGN - reserve);
> -
> - return skb;
> -}
> -
> /* Get MAC address from the MAC address registers
> *
> * Ethernet AVB device doesn't have ROM for MAC address.
> @@ -257,21 +239,10 @@ static void ravb_rx_ring_free(struct net_device *ndev, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> unsigned int ring_size;
> - unsigned int i;
>
> if (!priv->rx_ring[q].raw)
> return;
>
> - for (i = 0; i < priv->num_rx_ring[q]; i++) {
> - struct ravb_rx_desc *desc = ravb_rx_get_desc(priv, q, i);
> -
> - if (!dma_mapping_error(ndev->dev.parent,
> - le32_to_cpu(desc->dptr)))
> - dma_unmap_single(ndev->dev.parent,
> - le32_to_cpu(desc->dptr),
> - priv->info->rx_max_frame_size,
> - DMA_FROM_DEVICE);
> - }
> ring_size = priv->info->rx_desc_size * (priv->num_rx_ring[q] + 1);
> dma_free_coherent(ndev->dev.parent, ring_size, priv->rx_ring[q].raw,
> priv->rx_desc_dma[q]);
> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> priv->tx_ring[q] = NULL;
> }
>
> - /* Free RX skb ringbuffer */
> - if (priv->rx_skb[q]) {
> - for (i = 0; i < priv->num_rx_ring[q]; i++)
> - dev_kfree_skb(priv->rx_skb[q][i]);
> + /* Free RX buffers */
> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
> + if (priv->rx_buffers[q][i].page)
> + page_pool_put_page(priv->rx_pool, priv->rx_buffers[q][i].page, 0, true);
> }
> - kfree(priv->rx_skb[q]);
> - priv->rx_skb[q] = NULL;
> + kfree(priv->rx_buffers[q]);
> + priv->rx_buffers[q] = NULL;
> + page_pool_destroy(priv->rx_pool);
>
> /* Free aligned TX buffers */
> kfree(priv->tx_align[q]);
> @@ -317,35 +289,54 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> priv->tx_skb[q] = NULL;
> }
>
> +static int
> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
> + __le32 *dptr)

Why not pass the struct ravb_rx_desc instead of a dptr? Then the
function can deal with the error case and fill in rx_desc->dptr and
rx_desc->ds_cc directly making the caller simpler.

> +{
> + struct ravb_private *priv = netdev_priv(ndev);
> + const struct ravb_hw_info *info = priv->info;
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + dma_addr_t dma_addr;
> + unsigned int size;
> +
> + size = info->rx_buffer_size;
> + rx_buff->page = page_pool_alloc(priv->rx_pool, &rx_buff->offset, &size,
> + gfp_mask);
> + if (unlikely(!rx_buff->page))
> + return -ENOMEM;
> +
> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
> + info->rx_buffer_size, DMA_FROM_DEVICE);
> + *dptr = cpu_to_le32(dma_addr);
> + return 0;
> +}
> +
> static u32
> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> const struct ravb_hw_info *info = priv->info;
> struct ravb_rx_desc *rx_desc;
> - dma_addr_t dma_addr;
> u32 i, entry;
>
> for (i = 0; i < count; i++) {
> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
> rx_desc = ravb_rx_get_desc(priv, q, entry);
> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>
> - if (!priv->rx_skb[q][entry]) {
> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
> - if (!priv->rx_skb[q][entry])
> - break;
> - dma_addr = dma_map_single(ndev->dev.parent,
> - priv->rx_skb[q][entry]->data,
> - priv->info->rx_max_frame_size,
> - DMA_FROM_DEVICE);
> - skb_checksum_none_assert(priv->rx_skb[q][entry]);
> - /* We just set the data size to 0 for a failed mapping
> - * which should prevent DMA from happening...
> - */
> - if (dma_mapping_error(ndev->dev.parent, dma_addr))
> + if (!priv->rx_buffers[q][entry].page) {
> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry, gfp_mask,
> + &rx_desc->dptr))) {
> + /* We just set the data size to 0 for a failed mapping
> + * which should prevent DMA from happening...
> + */
> rx_desc->ds_cc = cpu_to_le16(0);
> - rx_desc->dptr = cpu_to_le32(dma_addr);
> + break;
> + }
> +
> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> + - ETH_FCS_LEN + sizeof(__sum16));

Can a comment be added to why we subtract and add things to the size?

> }
> /* Descriptor type must be set after all the above writes */
> dma_wmb();
> @@ -423,15 +414,32 @@ static int ravb_ring_init(struct net_device *ndev, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> unsigned int num_tx_desc = priv->num_tx_desc;
> + struct page_pool_params params = {
> + .order = 0,
> + .flags = PP_FLAG_DMA_MAP,
> + .pool_size = priv->num_rx_ring[q],
> + .nid = NUMA_NO_NODE,
> + .dev = ndev->dev.parent,
> + .dma_dir = DMA_FROM_DEVICE,
> + };
> unsigned int ring_size;
> u32 num_filled;
>
> - /* Allocate RX and TX skb rings */
> - priv->rx_skb[q] = kcalloc(priv->num_rx_ring[q],
> - sizeof(*priv->rx_skb[q]), GFP_KERNEL);
> + /* Allocate RX page pool and buffers */
> + priv->rx_pool = page_pool_create(&params);

I think we need one pool per queue.

> + if (IS_ERR(priv->rx_pool))
> + goto error;
> +
> + /* Allocate RX buffers */
> + priv->rx_buffers[q] = kcalloc(priv->num_rx_ring[q],
> + sizeof(*priv->rx_buffers[q]), GFP_KERNEL);
> + if (!priv->rx_buffers[q])
> + goto error;
> +
> + /* Allocate TX skb rings */
> priv->tx_skb[q] = kcalloc(priv->num_tx_ring[q],
> sizeof(*priv->tx_skb[q]), GFP_KERNEL);
> - if (!priv->rx_skb[q] || !priv->tx_skb[q])
> + if (!priv->tx_skb[q])
> goto error;
>
> if (num_tx_desc > 1) {
> @@ -755,25 +763,11 @@ static void ravb_rx_csum(struct sk_buff *skb)
> skb_trim(skb, skb->len - sizeof(__sum16));
> }
>
> -static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
> - struct ravb_rx_desc *desc)
> -{
> - struct ravb_private *priv = netdev_priv(ndev);
> - struct sk_buff *skb;
> -
> - skb = priv->rx_skb[RAVB_BE][entry];
> - priv->rx_skb[RAVB_BE][entry] = NULL;
> - dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr),
> - ALIGN(priv->info->rx_max_frame_size, 16),
> - DMA_FROM_DEVICE);
> -
> - return skb;
> -}
> -
> /* Packet receive function for Gigabit Ethernet */
> static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> + const struct ravb_hw_info *info = priv->info;
> struct net_device_stats *stats;
> struct ravb_rx_desc *desc;
> struct sk_buff *skb;
> @@ -817,12 +811,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> if (desc_status & MSC_CEEF)
> stats->rx_missed_errors++;
> } else {
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
> die_dt = desc->die_dt & 0xF0;
> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
> + desc_len, DMA_FROM_DEVICE);
> +
> if (die_dt == DT_FSINGLE || die_dt == DT_FSTART) {
> /* Start of packet:
> - * Set initial data length.
> + * Prepare an SKB and add initial data.
> */
> + skb = napi_build_skb(rx_addr, info->rx_buffer_size);
> + if (unlikely(!skb)) {
> + stats->rx_errors++;
> + page_pool_put_page(priv->rx_pool, rx_buff->page, 0, true);
> + break;
> + }
> + skb_mark_for_recycle(skb);
> skb_put(skb, desc_len);
>
> /* Save this SKB if the packet spans multiple
> @@ -832,14 +837,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> priv->rx_1st_skb = skb;
> } else {
> /* Continuing a packet:
> - * Move data into the saved SKB.
> + * Add this buffer as an RX frag.
> */
> - skb_copy_to_linear_data_offset(priv->rx_1st_skb,
> - priv->rx_1st_skb->len,
> - skb->data,
> - desc_len);
> - skb_put(priv->rx_1st_skb, desc_len);
> - dev_kfree_skb(skb);
> +
> + /* rx_1st_skb will be NULL if napi_build_skb()
> + * failed for the first descriptor of a
> + * multi-descriptor packet.
> + */
> + if (unlikely(!priv->rx_1st_skb)) {
> + stats->rx_errors++;
> + page_pool_put_page(priv->rx_pool, rx_buff->page, 0, true);
> + break;
> + }
> +
> + skb_add_rx_frag(priv->rx_1st_skb,
> + skb_shinfo(priv->rx_1st_skb)->nr_frags,
> + rx_buff->page, rx_buff->offset,
> + desc_len, info->rx_buffer_size);
>
> /* Set skb to point at the whole packet so that
> * we only need one code path for finishing a
> @@ -859,7 +873,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;
> +
> + /* Clear rx_1st_skb so that it will only be
> + * non-NULL when valid.
> + */
> + if (die_dt == DT_FEND)
> + priv->rx_1st_skb = NULL;
> }
> +
> + /* Mark this RX buffer as consumed. */
> + rx_buff->page = NULL;
> }
> }
>
> @@ -875,6 +898,7 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> + const struct ravb_hw_info *info = priv->info;
> struct net_device_stats *stats = &priv->stats[q];
> struct ravb_ex_rx_desc *desc;
> struct sk_buff *skb;
> @@ -917,13 +941,20 @@ static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
> if (desc_status & MSC_CEEF)
> stats->rx_missed_errors++;
> } else {
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
> u32 get_ts = priv->tstamp_rx_ctrl & RAVB_RXTSTAMP_TYPE;
>
> - skb = priv->rx_skb[q][entry];
> - priv->rx_skb[q][entry] = NULL;
> - dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr),
> - priv->info->rx_max_frame_size,
> - DMA_FROM_DEVICE);
> + skb = napi_build_skb(rx_addr, info->rx_buffer_size);
> + if (unlikely(!skb)) {
> + stats->rx_errors++;
> + page_pool_put_page(priv->rx_pool, rx_buff->page, 0, true);
> + break;
> + }
> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
> + pkt_len, DMA_FROM_DEVICE);
> + rx_buff->page = NULL;
> + skb_mark_for_recycle(skb);
> get_ts &= (q == RAVB_NC) ?
> RAVB_RXTSTAMP_TYPE_V2_L2_EVENT :
> ~RAVB_RXTSTAMP_TYPE_V2_L2_EVENT;
> @@ -2588,8 +2619,8 @@ static const struct ravb_hw_info ravb_gen3_hw_info = {
> .net_features = NETIF_F_RXCSUM,
> .stats_len = ARRAY_SIZE(ravb_gstrings_stats),
> .tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
> + .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> .rx_max_frame_size = SZ_2K,
> - .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
> .rx_desc_size = sizeof(struct ravb_ex_rx_desc),
> .internal_delay = 1,
> .tx_counters = 1,
> @@ -2612,8 +2643,8 @@ static const struct ravb_hw_info ravb_gen2_hw_info = {
> .net_features = NETIF_F_RXCSUM,
> .stats_len = ARRAY_SIZE(ravb_gstrings_stats),
> .tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
> + .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> .rx_max_frame_size = SZ_2K,
> - .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
> .rx_desc_size = sizeof(struct ravb_ex_rx_desc),
> .aligned_tx = 1,
> .gptp = 1,
> @@ -2633,8 +2664,8 @@ static const struct ravb_hw_info ravb_rzv2m_hw_info = {
> .net_features = NETIF_F_RXCSUM,
> .stats_len = ARRAY_SIZE(ravb_gstrings_stats),
> .tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
> + .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> .rx_max_frame_size = SZ_2K,
> - .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
> .rx_desc_size = sizeof(struct ravb_ex_rx_desc),
> .multi_irqs = 1,
> .err_mgmt_irqs = 1,
> @@ -2656,8 +2687,8 @@ static const struct ravb_hw_info gbeth_hw_info = {
> .net_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM,
> .stats_len = ARRAY_SIZE(ravb_gstrings_stats_gbeth),
> .tccr_mask = TCCR_TSRQ0,
> + .rx_buffer_size = SZ_2K,
> .rx_max_frame_size = SZ_8K,
> - .rx_max_desc_use = 4080,
> .rx_desc_size = sizeof(struct ravb_rx_desc),
> .aligned_tx = 1,
> .tx_counters = 1,
> --
> 2.39.2
>

--
Kind Regards,
Niklas Söderlund

2024-04-19 08:03:14

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next RFC v3 7/7] net: ravb: Allocate RX buffers via page pool

On 15/04/2024 13:16, Niklas Söderlund wrote:
> Hi Paul,
>
> I think using page pool is a good idea!
>
> On 2024-04-15 10:48:04 +0100, Paul Barker wrote:
>> This patch makes multiple changes that can't be separated:
>>
>> 1) Allocate plain RX buffers via a page pool instead of allocating
>> SKBs, then use build_skb() when a packet is received.
>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>> as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>> * RZ/G2L:
>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>> * RZ/G2UL:
>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>> * RZ/G3S:
>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>> * RZ/Five:
>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>> ---
>> drivers/net/ethernet/renesas/ravb.h | 10 +-
>> drivers/net/ethernet/renesas/ravb_main.c | 209 +++++++++++++----------
>> 2 files changed, 128 insertions(+), 91 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
>> index 9c6392ade2f1..4348366c3dc7 100644
>> --- a/drivers/net/ethernet/renesas/ravb.h
>> +++ b/drivers/net/ethernet/renesas/ravb.h
>> @@ -1050,8 +1050,8 @@ struct ravb_hw_info {
>> netdev_features_t net_features;
>> int stats_len;
>> u32 tccr_mask;
>> + u32 rx_buffer_size;
>> u32 rx_max_frame_size;
>> - u32 rx_max_desc_use;
>> u32 rx_desc_size;
>> unsigned aligned_tx: 1;
>> unsigned needs_irq_coalesce:1; /* Needs software IRQ coalescing */
>> @@ -1071,6 +1071,11 @@ struct ravb_hw_info {
>> unsigned half_duplex:1; /* E-MAC supports half duplex mode */
>> };
>>
>> +struct ravb_rx_buffer {
>> + struct page *page;
>> + unsigned int offset;
>> +};
>> +
>> struct ravb_private {
>> struct net_device *ndev;
>> struct platform_device *pdev;
>> @@ -1094,7 +1099,8 @@ struct ravb_private {
>> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
>> void *tx_align[NUM_TX_QUEUE];
>> struct sk_buff *rx_1st_skb;
>> - struct sk_buff **rx_skb[NUM_RX_QUEUE];
>> + struct page_pool *rx_pool;
>
> Don't we need a page pool per queue? Else multiple calls to
> ravb_ring_init() and ravb_ring_free() for different queues will
> otherwise risk allocating over a previous queue and multiple free the
> same one.

Ack.

>
>> + struct ravb_rx_buffer *rx_buffers[NUM_RX_QUEUE];
>> struct sk_buff **tx_skb[NUM_TX_QUEUE];
>> u32 rx_over_errors;
>> u32 rx_fifo_errors;
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index 7434faf0820c..892a3eadef1e 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>> @@ -30,6 +30,7 @@
>> #include <linux/reset.h>
>> #include <linux/math64.h>
>> #include <net/ip.h>
>> +#include <net/page_pool/helpers.h>
>>
>> #include "ravb.h"
>>
>> @@ -113,25 +114,6 @@ static void ravb_set_rate_rcar(struct net_device *ndev)
>> }
>> }
>>
>> -static struct sk_buff *
>> -ravb_alloc_skb(struct net_device *ndev, const struct ravb_hw_info *info,
>> - gfp_t gfp_mask)
>> -{
>> - struct sk_buff *skb;
>> - u32 reserve;
>> -
>> - skb = __netdev_alloc_skb(ndev, info->rx_max_frame_size + RAVB_ALIGN - 1,
>> - gfp_mask);
>> - if (!skb)
>> - return NULL;
>> -
>> - reserve = (unsigned long)skb->data & (RAVB_ALIGN - 1);
>> - if (reserve)
>> - skb_reserve(skb, RAVB_ALIGN - reserve);
>> -
>> - return skb;
>> -}
>> -
>> /* Get MAC address from the MAC address registers
>> *
>> * Ethernet AVB device doesn't have ROM for MAC address.
>> @@ -257,21 +239,10 @@ static void ravb_rx_ring_free(struct net_device *ndev, int q)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> unsigned int ring_size;
>> - unsigned int i;
>>
>> if (!priv->rx_ring[q].raw)
>> return;
>>
>> - for (i = 0; i < priv->num_rx_ring[q]; i++) {
>> - struct ravb_rx_desc *desc = ravb_rx_get_desc(priv, q, i);
>> -
>> - if (!dma_mapping_error(ndev->dev.parent,
>> - le32_to_cpu(desc->dptr)))
>> - dma_unmap_single(ndev->dev.parent,
>> - le32_to_cpu(desc->dptr),
>> - priv->info->rx_max_frame_size,
>> - DMA_FROM_DEVICE);
>> - }
>> ring_size = priv->info->rx_desc_size * (priv->num_rx_ring[q] + 1);
>> dma_free_coherent(ndev->dev.parent, ring_size, priv->rx_ring[q].raw,
>> priv->rx_desc_dma[q]);
>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_ring[q] = NULL;
>> }
>>
>> - /* Free RX skb ringbuffer */
>> - if (priv->rx_skb[q]) {
>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>> - dev_kfree_skb(priv->rx_skb[q][i]);
>> + /* Free RX buffers */
>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>> + if (priv->rx_buffers[q][i].page)
>> + page_pool_put_page(priv->rx_pool, priv->rx_buffers[q][i].page, 0, true);
>> }
>> - kfree(priv->rx_skb[q]);
>> - priv->rx_skb[q] = NULL;
>> + kfree(priv->rx_buffers[q]);
>> + priv->rx_buffers[q] = NULL;
>> + page_pool_destroy(priv->rx_pool);
>>
>> /* Free aligned TX buffers */
>> kfree(priv->tx_align[q]);
>> @@ -317,35 +289,54 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_skb[q] = NULL;
>> }
>>
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> + __le32 *dptr)
>
> Why not pass the struct ravb_rx_desc instead of a dptr? Then the
> function can deal with the error case and fill in rx_desc->dptr and
> rx_desc->ds_cc directly making the caller simpler.

Ack.

>
>> +{
>> + struct ravb_private *priv = netdev_priv(ndev);
>> + const struct ravb_hw_info *info = priv->info;
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + dma_addr_t dma_addr;
>> + unsigned int size;
>> +
>> + size = info->rx_buffer_size;
>> + rx_buff->page = page_pool_alloc(priv->rx_pool, &rx_buff->offset, &size,
>> + gfp_mask);
>> + if (unlikely(!rx_buff->page))
>> + return -ENOMEM;
>> +
>> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> + info->rx_buffer_size, DMA_FROM_DEVICE);
>> + *dptr = cpu_to_le32(dma_addr);
>> + return 0;
>> +}
>> +
>> static u32
>> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> const struct ravb_hw_info *info = priv->info;
>> struct ravb_rx_desc *rx_desc;
>> - dma_addr_t dma_addr;
>> u32 i, entry;
>>
>> for (i = 0; i < count; i++) {
>> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
>> rx_desc = ravb_rx_get_desc(priv, q, entry);
>> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>>
>> - if (!priv->rx_skb[q][entry]) {
>> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
>> - if (!priv->rx_skb[q][entry])
>> - break;
>> - dma_addr = dma_map_single(ndev->dev.parent,
>> - priv->rx_skb[q][entry]->data,
>> - priv->info->rx_max_frame_size,
>> - DMA_FROM_DEVICE);
>> - skb_checksum_none_assert(priv->rx_skb[q][entry]);
>> - /* We just set the data size to 0 for a failed mapping
>> - * which should prevent DMA from happening...
>> - */
>> - if (dma_mapping_error(ndev->dev.parent, dma_addr))
>> + if (!priv->rx_buffers[q][entry].page) {
>> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry, gfp_mask,
>> + &rx_desc->dptr))) {
>> + /* We just set the data size to 0 for a failed mapping
>> + * which should prevent DMA from happening...
>> + */
>> rx_desc->ds_cc = cpu_to_le16(0);
>> - rx_desc->dptr = cpu_to_le32(dma_addr);
>> + break;
>> + }
>> +
>> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>> + - ETH_FCS_LEN + sizeof(__sum16));
>
> Can a comment be added to why we subtract and add things to the size?

Ack.

I'll address these in v4.

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-04-19 20:04:22

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next RFC v3 4/7] net: ravb: Refactor GbEth RX code path

On 4/15/24 12:48 PM, Paul Barker wrote:

> We can reduce code duplication in ravb_rx_gbeth() and add comments to
> make the code flow easier to understand.
>
> Signed-off-by: Paul Barker <[email protected]>
> ---
> drivers/net/ethernet/renesas/ravb_main.c | 70 ++++++++++++------------
> 1 file changed, 35 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index baa01bd81f2d..12618171f6d5 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -818,47 +818,47 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> stats->rx_missed_errors++;
> } else {
> die_dt = desc->die_dt & 0xF0;
> - switch (die_dt) {
> - case DT_FSINGLE:
> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
> + skb = ravb_get_skb_gbeth(ndev, entry, desc);
> + if (die_dt == DT_FSINGLE || die_dt == DT_FSTART) {

No, please keep using *switch* statement here...

> + /* Start of packet:
> + * Set initial data length.
> + */
> skb_put(skb, desc_len);
> +
> + /* Save this SKB if the packet spans multiple
> + * descriptors.
> + */
> + if (die_dt == DT_FSTART)
> + priv->rx_1st_skb = skb;

Hm, looks like we can do that under *case* DT_FSTART: and do
a fallthru to *case* DT_FSINGLE: after that...

> + } else {
> + /* Continuing a packet:
> + * Move data into the saved SKB.
> + */
> + skb_copy_to_linear_data_offset(priv->rx_1st_skb,
> + priv->rx_1st_skb->len,
> + skb->data,
> + desc_len);
> + skb_put(priv->rx_1st_skb, desc_len);
> + dev_kfree_skb(skb);
> +
> + /* Set skb to point at the whole packet so that
> + * we only need one code path for finishing a
> + * packet.
> + */
> + skb = priv->rx_1st_skb;
> + }
> +
> + if (die_dt == DT_FSINGLE || die_dt == DT_FEND) {

Again, keep using *switch*, please...

> + /* Finishing a packet:
> + * Determine protocol & checksum, hand off to
> + * NAPI and update our stats.
> + */
> skb->protocol = eth_type_trans(skb, ndev);
> if (ndev->features & NETIF_F_RXCSUM)
> ravb_rx_csum_gbeth(skb);
> + stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;
[...]

MBR, Sergey

2024-04-19 20:32:38

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next RFC v3 6/7] net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP

On 4/15/24 12:48 PM, Paul Barker wrote:

> NAPI Threaded mode (along with the previously enabled SW IRQ Coalescing)
> is required to improve network stack performance for single core SoCs
> using the GbEth IP (currently the RZ/G2L SoC family and the RZ/G3S SoC).
>
> This patch gives the following improvements during testing with iperf3.
>
> * RZ/G2UL:
> * TCP TX: +32% bandwidth (638Mbps -> 841Mbps)
> * TXP RX: +8.8% bandwidth (667Mbps -> 726Mbps)
> * UDP RX: +104% bandwidth (53Mbps -> 108Mbps)
>
> * RZ/G3S:
> * TCP TX: 29% bandwidth (529Mbps -> 681Mbps)
> * UDP RX: +1290% bandwidth (6.46Mbps -> 90Mbps)
>
> * RZ/Five:
> * UDP RX: Test no longer crashes (0 -> 20 Mbps)
>
> This patch gives the following reductions in performance in the same
> testing:
>
> * RZ/G2UL:
> * UDP TX: -7.5% bandwidth (594Mbps -> 549Mbps)
>
> * RZ/G3S:
> * UDP TX: -5% bandwidth (625Mbps -> 594Mbps)
>
> These losses are considered acceptable given the benefits shown above.
> If UDP TX bandwidth must be maximised for a particular use case, NAPI
> threaded mode can be disabled at runtime via sysfs writes.
>
> The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
> RZ/G3S) is particularly critical.
>
> Signed-off-by: Paul Barker <[email protected]>

Reviewed-by: Sergey Shtylyov <[email protected]>

[...]

MBR, Sergey

2024-04-21 15:49:55

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next RFC v3 4/7] net: ravb: Refactor GbEth RX code path

On 19/04/2024 21:03, Sergey Shtylyov wrote:
> On 4/15/24 12:48 PM, Paul Barker wrote:
>
>> We can reduce code duplication in ravb_rx_gbeth() and add comments to
>> make the code flow easier to understand.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>> ---
>> drivers/net/ethernet/renesas/ravb_main.c | 70 ++++++++++++------------
>> 1 file changed, 35 insertions(+), 35 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index baa01bd81f2d..12618171f6d5 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>> @@ -818,47 +818,47 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> stats->rx_missed_errors++;
>> } else {
>> die_dt = desc->die_dt & 0xF0;
>> - switch (die_dt) {
>> - case DT_FSINGLE:
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + if (die_dt == DT_FSINGLE || die_dt == DT_FSTART) {
>
> No, please keep using *switch* statement here...

That's a shame - I much prefer this version with reduced code
duplication, especially when we add page pool support. Duplication with
subtle differences leads to bugs, see [1] for an example.

[1]: https://lore.kernel.org/all/[email protected]/

An alternative would be to drop this refactor patch, then when we add
page pool support we instead use separate functions to avoid code
duplication. At the end of the series, the switch would then look
something like this:

switch (die_dt) {
case DT_FSINGLE:
skb = ravb_rx_build_skb(ndev, q, rx_buff, desc_len);
if (!skb)
break;
ravb_rx_finish_skb(ndev, q, skb);
rx_packets++;
break;
case DT_FSTART:
priv->rx_1st_skb = ravb_rx_build_skb(ndev, q, rx_buff, desc_len);
break;
case DT_FMID:
ravb_rx_add_frag(ndev, q, rx_buff, desc_len);
break;
case DT_FEND:
if (ravb_rx_add_frag(ndev, q, rx_buff, desc_len))
break;
ravb_rx_finish_skb(ndev, q, priv->rx_1st_skb);
rx_packets++;
priv->rx_1st_skb = NULL;
}

Would you prefer that approach?

>
>> + /* Start of packet:
>> + * Set initial data length.
>> + */
>> skb_put(skb, desc_len);
>> +
>> + /* Save this SKB if the packet spans multiple
>> + * descriptors.
>> + */
>> + if (die_dt == DT_FSTART)
>> + priv->rx_1st_skb = skb;
>
> Hm, looks like we can do that under *case* DT_FSTART: and do
> a fallthru to *case* DT_FSINGLE: after that...

A fallthrough would just have to be removed again when page pool support
is added in a later patch in this series, since we will need to call
napi_build_skb() before the skb is valid.

>
>> + } else {
>> + /* Continuing a packet:
>> + * Move data into the saved SKB.
>> + */
>> + skb_copy_to_linear_data_offset(priv->rx_1st_skb,
>> + priv->rx_1st_skb->len,
>> + skb->data,
>> + desc_len);
>> + skb_put(priv->rx_1st_skb, desc_len);
>> + dev_kfree_skb(skb);
>> +
>> + /* Set skb to point at the whole packet so that
>> + * we only need one code path for finishing a
>> + * packet.
>> + */
>> + skb = priv->rx_1st_skb;
>> + }
>> +
>> + if (die_dt == DT_FSINGLE || die_dt == DT_FEND) {
>
> Again, keep using *switch*, please...
>
>> + /* Finishing a packet:
>> + * Determine protocol & checksum, hand off to
>> + * NAPI and update our stats.
>> + */
>> skb->protocol = eth_type_trans(skb, ndev);
>> if (ndev->features & NETIF_F_RXCSUM)
>> ravb_rx_csum_gbeth(skb);
>> + stats->rx_bytes += skb->len;
>> napi_gro_receive(&priv->napi[q], skb);
>> rx_packets++;
> [...]
>
> MBR, Sergey

Thanks,

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-04-21 20:24:04

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next RFC v3 4/7] net: ravb: Refactor GbEth RX code path

On 4/21/24 6:49 PM, Paul Barker wrote:
[...]

>>> We can reduce code duplication in ravb_rx_gbeth() and add comments to
>>> make the code flow easier to understand.
>>>
>>> Signed-off-by: Paul Barker <[email protected]>
>>> ---
>>> drivers/net/ethernet/renesas/ravb_main.c | 70 ++++++++++++------------
>>> 1 file changed, 35 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>> index baa01bd81f2d..12618171f6d5 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>> @@ -818,47 +818,47 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>>> stats->rx_missed_errors++;
>>> } else {
>>> die_dt = desc->die_dt & 0xF0;
>>> - switch (die_dt) {
>>> - case DT_FSINGLE:
>>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>>> + skb = ravb_get_skb_gbeth(ndev, entry, desc);
>>> + if (die_dt == DT_FSINGLE || die_dt == DT_FSTART) {
>>
>> No, please keep using *switch* statement here...
>
> That's a shame - I much prefer this version with reduced code
> duplication, especially when we add page pool support. Duplication with
> subtle differences leads to bugs, see [1] for an example.
>
> [1]: https://lore.kernel.org/all/[email protected]/

I wasn't clear enough probably, sorry about that.
What I meant to say was that your use of the *if* statement
wasn't actually justified. Please use the *switch* statement
instead.

[...]

> Thanks,

MBR, Sergey