2024-05-28 15:07:09

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 0/7] Improve GbEth performance on Renesas RZ/G2L and related SoCs

This series aims to improve performance of the GbEth IP in the Renesas
RZ/G2L SoC family and the RZ/G3S SoC, which use the ravb driver. Along
the way, we do some refactoring and ensure that napi_complete_done() is
used in accordance with the NAPI documentation for both GbEth and R-Car
code paths.

Much of the performance improvement comes from enabling SW IRQ
Coalescing for all SoCs using the GbEth IP, and NAPI Threaded mode for
single core SoCs using the GbEth IP. These can be enabled/disabled at
runtime via sysfs, but our goal is to set sensible defaults which get
good performance on the affected SoCs.

The rest of the performance improvement comes from using a page pool to
allocate RX buffers, and reducing the allocation size from >8kB to 2kB.

The overall performance impact of this patch series seen in testing with
iperf3 is as follows (see patches 5-7 for more detailed results):
* RZ/G2L:
* TCP TX: +1.8% bandwidth
* TCP RX: +1% bandwidth at 47% less CPU load
* UDP RX: +1% bandwidth at 26% less CPU load

* RZ/G2UL:
* TCP TX: +37% bandwidth
* TCP RX: +43% bandwidth
* UDP TX: -8% bandwidth
* UDP RX: +32500% bandwidth (!)

* RZ/G3S:
* TCP TX: +25% bandwidth
* TCP RX: +76% bandwidth
* UDP TX: -9% bandwidth
* UDP RX: +37900% bandwidth (!)

* RZ/Five:
* TCP TX: +18% bandwidth
* TCP RX: +212% bandwidth
* UDP TX: +2% bandwidth
* UDP RX: +inf bandwidth (test no longer crashes)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Fixing the crash in UDP RX testing for RZ/Five is a cumulative effect of
patches 1, 2, 5 & 6 so this is very difficult to break out as a bugfix
for backporting.

Changes v3->v4:
* Dependency patches have merged so this is no longer an RFC.
* Fixed update of stats->rx_packets.
* Simplified refactoring following feedback from Niklas and Sergey.
* Renamed needs_irq_coalesce -> coalesce_irqs.
* Used a separate page pool for each RX queue.
* Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
simplify the calling function.
* Explained the calculation of rx_desc->ds_cc.
* Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
* Used Niklas' suggested commit message for patch 2/7.
* Added Sergey's Reviewed-by tags to patches 5/7 and 6/7.

Changes v2->v3:
* Incorporated feedback on RFC v2 from Sergey.
* Split out bugfixes and rebased. This changed the order of what was
the first 5 patches of v2 and things look a little different so I've
not picked up Reviewed-by tags from v2.
* Further refactoring and tidy up of RX ring refill and
ravb_rx_gbeth().
* Switched to using a page pool to allocate RX buffers.
* Re-tested and provided updated performance figures.

Changes v1->v2:
* Marked as RFC as the series depends on unmerged patches.
* Refactored R-Car code paths as well as GbEth code paths.
* Updated references to the patches this series depends on.

Paul Barker (7):
net: ravb: Simplify poll & receive functions
net: ravb: Consider busypolling status when re-enabling interrupts
net: ravb: Refactor RX ring refill
net: ravb: Refactor GbEth RX code path
net: ravb: Enable SW IRQ Coalescing for GbEth
net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP
net: ravb: Allocate RX buffers via page pool

drivers/net/ethernet/renesas/ravb.h | 13 +-
drivers/net/ethernet/renesas/ravb_main.c | 459 ++++++++++++-----------
2 files changed, 247 insertions(+), 225 deletions(-)


base-commit: 5233a55a5254ea38dcdd8d836a0f9ee886c3df51
--
2.39.2



2024-05-28 15:07:17

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 3/7] net: ravb: Refactor RX ring refill

To reduce code duplication, we add a new RX ring refill function which
can handle both the initial RX ring population (which was split between
ravb_ring_init() and ravb_ring_format()) and the RX ring refill after
polling (in ravb_rx()).

Signed-off-by: Paul Barker <[email protected]>
---
Changes v3->v4:
* Avoid unnecessarily re-ordering code in ravb_ring_init() and
ravb_ring_format().

drivers/net/ethernet/renesas/ravb_main.c | 148 +++++++++--------------
1 file changed, 55 insertions(+), 93 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 472aa80002be..7df7d2e93a3a 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -317,35 +317,42 @@ static void ravb_ring_free(struct net_device *ndev, int q)
priv->tx_skb[q] = NULL;
}

-static void ravb_rx_ring_format(struct net_device *ndev, int q)
+static u32
+ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
{
struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
struct ravb_rx_desc *rx_desc;
- unsigned int rx_ring_size;
dma_addr_t dma_addr;
- unsigned int i;
+ u32 i, entry;

- rx_ring_size = priv->info->rx_desc_size * priv->num_rx_ring[q];
- memset(priv->rx_ring[q].raw, 0, rx_ring_size);
- /* Build RX ring buffer */
- for (i = 0; i < priv->num_rx_ring[q]; i++) {
- /* RX descriptor */
- rx_desc = ravb_rx_get_desc(priv, q, i);
- rx_desc->ds_cc = cpu_to_le16(priv->info->rx_max_desc_use);
- dma_addr = dma_map_single(ndev->dev.parent, priv->rx_skb[q][i]->data,
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- /* We just set the data size to 0 for a failed mapping which
- * should prevent DMA from happening...
- */
- if (dma_mapping_error(ndev->dev.parent, dma_addr))
- rx_desc->ds_cc = cpu_to_le16(0);
- rx_desc->dptr = cpu_to_le32(dma_addr);
+ for (i = 0; i < count; i++) {
+ entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
+ rx_desc = ravb_rx_get_desc(priv, q, entry);
+ rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
+
+ if (!priv->rx_skb[q][entry]) {
+ priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
+ if (!priv->rx_skb[q][entry])
+ break;
+ dma_addr = dma_map_single(ndev->dev.parent,
+ priv->rx_skb[q][entry]->data,
+ priv->info->rx_max_frame_size,
+ DMA_FROM_DEVICE);
+ skb_checksum_none_assert(priv->rx_skb[q][entry]);
+ /* We just set the data size to 0 for a failed mapping
+ * which should prevent DMA from happening...
+ */
+ if (dma_mapping_error(ndev->dev.parent, dma_addr))
+ rx_desc->ds_cc = cpu_to_le16(0);
+ rx_desc->dptr = cpu_to_le32(dma_addr);
+ }
+ /* Descriptor type must be set after all the above writes */
+ dma_wmb();
rx_desc->die_dt = DT_FEMPTY;
}
- rx_desc = ravb_rx_get_desc(priv, q, i);
- rx_desc->dptr = cpu_to_le32((u32)priv->rx_desc_dma[q]);
- rx_desc->die_dt = DT_LINKFIX; /* type */
+
+ return i;
}

/* Format skb and descriptor buffer for Ethernet AVB */
@@ -353,6 +360,7 @@ static void ravb_ring_format(struct net_device *ndev, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
unsigned int num_tx_desc = priv->num_tx_desc;
+ struct ravb_rx_desc *rx_desc;
struct ravb_tx_desc *tx_desc;
struct ravb_desc *desc;
unsigned int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q] *
@@ -364,7 +372,13 @@ static void ravb_ring_format(struct net_device *ndev, int q)
priv->dirty_rx[q] = 0;
priv->dirty_tx[q] = 0;

- ravb_rx_ring_format(ndev, q);
+ /* Regular RX descriptors have already been initialized by
+ * ravb_rx_ring_refill(), we just need to initialize the final link
+ * descriptor.
+ */
+ rx_desc = ravb_rx_get_desc(priv, q, priv->num_rx_ring[q]);
+ rx_desc->dptr = cpu_to_le32((u32)priv->rx_desc_dma[q]);
+ rx_desc->die_dt = DT_LINKFIX; /* type */

memset(priv->tx_ring[q], 0, tx_ring_size);
/* Build TX ring buffer */
@@ -408,11 +422,9 @@ static void *ravb_alloc_rx_desc(struct net_device *ndev, int q)
static int ravb_ring_init(struct net_device *ndev, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
- const struct ravb_hw_info *info = priv->info;
unsigned int num_tx_desc = priv->num_tx_desc;
unsigned int ring_size;
- struct sk_buff *skb;
- unsigned int i;
+ u32 num_filled;

/* Allocate RX and TX skb rings */
priv->rx_skb[q] = kcalloc(priv->num_rx_ring[q],
@@ -422,12 +434,17 @@ static int ravb_ring_init(struct net_device *ndev, int q)
if (!priv->rx_skb[q] || !priv->tx_skb[q])
goto error;

- for (i = 0; i < priv->num_rx_ring[q]; i++) {
- skb = ravb_alloc_skb(ndev, info, GFP_KERNEL);
- if (!skb)
- goto error;
- priv->rx_skb[q][i] = skb;
- }
+ /* Allocate all RX descriptors. */
+ if (!ravb_alloc_rx_desc(ndev, q))
+ goto error;
+
+ /* Populate RX ring buffer. */
+ priv->dirty_rx[q] = 0;
+ ring_size = priv->info->rx_desc_size * priv->num_rx_ring[q];
+ memset(priv->rx_ring[q].raw, 0, ring_size);
+ num_filled = ravb_rx_ring_refill(ndev, q, priv->num_rx_ring[q], GFP_KERNEL);
+ if (num_filled != priv->num_rx_ring[q])
+ goto error;

if (num_tx_desc > 1) {
/* Allocate rings for the aligned buffers */
@@ -437,12 +454,6 @@ static int ravb_ring_init(struct net_device *ndev, int q)
goto error;
}

- /* Allocate all RX descriptors. */
- if (!ravb_alloc_rx_desc(ndev, q))
- goto error;
-
- priv->dirty_rx[q] = 0;
-
/* Allocate all TX descriptors. */
ring_size = sizeof(struct ravb_tx_desc) *
(priv->num_tx_ring[q] * num_tx_desc + 1);
@@ -762,11 +773,9 @@ static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
- const struct ravb_hw_info *info = priv->info;
struct net_device_stats *stats;
struct ravb_rx_desc *desc;
struct sk_buff *skb;
- dma_addr_t dma_addr;
int rx_packets = 0;
u8 desc_status;
u16 desc_len;
@@ -854,32 +863,9 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
}

/* Refill the RX ring buffers. */
- for (; priv->cur_rx[q] - priv->dirty_rx[q] > 0; priv->dirty_rx[q]++) {
- entry = priv->dirty_rx[q] % priv->num_rx_ring[q];
- desc = &priv->rx_ring[q].desc[entry];
- desc->ds_cc = cpu_to_le16(priv->info->rx_max_desc_use);
-
- if (!priv->rx_skb[q][entry]) {
- skb = ravb_alloc_skb(ndev, info, GFP_ATOMIC);
- if (!skb)
- break;
- dma_addr = dma_map_single(ndev->dev.parent,
- skb->data,
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- skb_checksum_none_assert(skb);
- /* We just set the data size to 0 for a failed mapping
- * which should prevent DMA from happening...
- */
- if (dma_mapping_error(ndev->dev.parent, dma_addr))
- desc->ds_cc = cpu_to_le16(0);
- desc->dptr = cpu_to_le32(dma_addr);
- priv->rx_skb[q][entry] = skb;
- }
- /* Descriptor type must be set after all the above writes */
- dma_wmb();
- desc->die_dt = DT_FEMPTY;
- }
+ priv->dirty_rx[q] += ravb_rx_ring_refill(ndev, q,
+ priv->cur_rx[q] - priv->dirty_rx[q],
+ GFP_ATOMIC);

stats->rx_packets += rx_packets;
return rx_packets;
@@ -889,12 +875,10 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
- const struct ravb_hw_info *info = priv->info;
struct net_device_stats *stats = &priv->stats[q];
struct ravb_ex_rx_desc *desc;
unsigned int limit, i;
struct sk_buff *skb;
- dma_addr_t dma_addr;
struct timespec64 ts;
int rx_packets = 0;
u8 desc_status;
@@ -964,31 +948,9 @@ static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
}

/* Refill the RX ring buffers. */
- for (; priv->cur_rx[q] - priv->dirty_rx[q] > 0; priv->dirty_rx[q]++) {
- entry = priv->dirty_rx[q] % priv->num_rx_ring[q];
- desc = &priv->rx_ring[q].ex_desc[entry];
- desc->ds_cc = cpu_to_le16(priv->info->rx_max_desc_use);
-
- if (!priv->rx_skb[q][entry]) {
- skb = ravb_alloc_skb(ndev, info, GFP_ATOMIC);
- if (!skb)
- break; /* Better luck next round. */
- dma_addr = dma_map_single(ndev->dev.parent, skb->data,
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- skb_checksum_none_assert(skb);
- /* We just set the data size to 0 for a failed mapping
- * which should prevent DMA from happening...
- */
- if (dma_mapping_error(ndev->dev.parent, dma_addr))
- desc->ds_cc = cpu_to_le16(0);
- desc->dptr = cpu_to_le32(dma_addr);
- priv->rx_skb[q][entry] = skb;
- }
- /* Descriptor type must be set after all the above writes */
- dma_wmb();
- desc->die_dt = DT_FEMPTY;
- }
+ priv->dirty_rx[q] += ravb_rx_ring_refill(ndev, q,
+ priv->cur_rx[q] - priv->dirty_rx[q],
+ GFP_ATOMIC);

stats->rx_packets += rx_packets;
return rx_packets;
--
2.39.2


2024-05-28 15:07:23

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 1/7] net: ravb: Simplify poll & receive functions

We don't need to pass the work budget to ravb_rx() by reference, it's
cleaner to pass this by value and return the amount of work done. This
allows us to simplify the ravb_poll() function and use the common
`work_done` variable name seen in other network drivers for consistency
and ease of understanding.

This is a pure refactor and should not affect behaviour.

Signed-off-by: Paul Barker <[email protected]>
---
drivers/net/ethernet/renesas/ravb.h | 2 +-
drivers/net/ethernet/renesas/ravb_main.c | 27 +++++++++++-------------
2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
index b48935ec7e28..71de2a7aa27c 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1039,7 +1039,7 @@ struct ravb_ptp {
};

struct ravb_hw_info {
- bool (*receive)(struct net_device *ndev, int *quota, int q);
+ int (*receive)(struct net_device *ndev, int budget, int q);
void (*set_rate)(struct net_device *ndev);
int (*set_feature)(struct net_device *ndev, netdev_features_t features);
int (*dmac_init)(struct net_device *ndev);
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 4d100283c30f..193ad05383a8 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -759,7 +759,7 @@ static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
}

/* Packet receive function for Gigabit Ethernet */
-static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
+static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;
@@ -781,7 +781,7 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
for (i = 0; i < limit; i++, priv->cur_rx[q]++) {
entry = priv->cur_rx[q] % priv->num_rx_ring[q];
desc = &priv->rx_ring[q].desc[entry];
- if (rx_packets == *quota || desc->die_dt == DT_FEMPTY)
+ if (rx_packets == budget || desc->die_dt == DT_FEMPTY)
break;

/* Descriptor type must be checked before all other reads */
@@ -882,12 +882,11 @@ static bool ravb_rx_gbeth(struct net_device *ndev, int *quota, int q)
}

stats->rx_packets += rx_packets;
- *quota -= rx_packets;
- return *quota == 0;
+ return rx_packets;
}

/* Packet receive function for Ethernet AVB */
-static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
+static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;
@@ -906,7 +905,7 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
for (i = 0; i < limit; i++, priv->cur_rx[q]++) {
entry = priv->cur_rx[q] % priv->num_rx_ring[q];
desc = &priv->rx_ring[q].ex_desc[entry];
- if (rx_packets == *quota || desc->die_dt == DT_FEMPTY)
+ if (rx_packets == budget || desc->die_dt == DT_FEMPTY)
break;

/* Descriptor type must be checked before all other reads */
@@ -992,17 +991,16 @@ static bool ravb_rx_rcar(struct net_device *ndev, int *quota, int q)
}

stats->rx_packets += rx_packets;
- *quota -= rx_packets;
- return *quota == 0;
+ return rx_packets;
}

/* Packet receive function for Ethernet AVB */
-static bool ravb_rx(struct net_device *ndev, int *quota, int q)
+static int ravb_rx(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;

- return info->receive(ndev, quota, q);
+ return info->receive(ndev, budget, q);
}

static void ravb_rcv_snd_disable(struct net_device *ndev)
@@ -1319,13 +1317,12 @@ static int ravb_poll(struct napi_struct *napi, int budget)
unsigned long flags;
int q = napi - priv->napi;
int mask = BIT(q);
- int quota = budget;
- bool unmask;
+ int work_done;

/* Processing RX Descriptor Ring */
/* Clear RX interrupt */
ravb_write(ndev, ~(mask | RIS0_RESERVED), RIS0);
- unmask = !ravb_rx(ndev, &quota, q);
+ work_done = ravb_rx(ndev, budget, q);

/* Processing TX Descriptor Ring */
spin_lock_irqsave(&priv->lock, flags);
@@ -1344,7 +1341,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
if (priv->rx_fifo_errors != ndev->stats.rx_fifo_errors)
ndev->stats.rx_fifo_errors = priv->rx_fifo_errors;

- if (!unmask)
+ if (work_done == budget)
goto out;

napi_complete(napi);
@@ -1361,7 +1358,7 @@ static int ravb_poll(struct napi_struct *napi, int budget)
spin_unlock_irqrestore(&priv->lock, flags);

out:
- return budget - quota;
+ return work_done;
}

static void ravb_set_duplex_gbeth(struct net_device *ndev)
--
2.39.2


2024-05-28 15:07:50

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 5/7] net: ravb: Enable SW IRQ Coalescing for GbEth

Software IRQ Coalescing is required to improve network stack performance
in the RZ/G2L SoC family and the RZ/G3S SoC, i.e. the SoCs which use the
GbEth IP.

This patch gives the following improvements during testing with iperf3:

* RZ/G2L:
* TCP RX: same bandwidth with -6% CPU load (76% -> 71%)
* UDP RX: same bandwidth with -10% CPU load (99% -> 89%)

* RZ/G2UL:
* UDP RX: +4200% bandwidth (1.23Mbps -> 53Mbps)

* RZ/G3S:
* UDP RX: +425% bandwidth (1.23Mbps -> 6.46Mbps)

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.

Signed-off-by: Paul Barker <[email protected]>
Reviewed-by: Sergey Shtylyov <[email protected]>
---
Changes v3-v4:
* Renamed needs_irq_coalesce -> coalesce_irqs.
* Added Sergey's Reviewed-by tag.

drivers/net/ethernet/renesas/ravb.h | 1 +
drivers/net/ethernet/renesas/ravb_main.c | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
index 71de2a7aa27c..6a7aa7dd17e6 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1054,6 +1054,7 @@ struct ravb_hw_info {
u32 rx_max_desc_use;
u32 rx_desc_size;
unsigned aligned_tx: 1;
+ unsigned coalesce_irqs:1; /* Needs software IRQ coalescing */

/* hardware features */
unsigned internal_delay:1; /* AVB-DMAC has internal delays */
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index c9c5cc641589..e95e0b6e1fe6 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -2667,6 +2667,7 @@ static const struct ravb_hw_info gbeth_hw_info = {
.rx_max_desc_use = 4080,
.rx_desc_size = sizeof(struct ravb_rx_desc),
.aligned_tx = 1,
+ .coalesce_irqs = 1,
.tx_counters = 1,
.carrier_counters = 1,
.half_duplex = 1,
@@ -2943,6 +2944,9 @@ static int ravb_probe(struct platform_device *pdev)
if (info->nc_queues)
netif_napi_add(ndev, &priv->napi[RAVB_NC], ravb_poll);

+ if (info->coalesce_irqs)
+ netdev_sw_irq_coalesce_default_on(ndev);
+
/* Network device register */
error = register_netdev(ndev);
if (error)
--
2.39.2


2024-05-28 15:08:03

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 4/7] net: ravb: Refactor GbEth RX code path

We can reduce code duplication in ravb_rx_gbeth().

Signed-off-by: Paul Barker <[email protected]>
---
Changes v3->v4:
* Used switch statements instead of if statements in ravb_rx_gbeth().

drivers/net/ethernet/renesas/ravb_main.c | 73 +++++++++++++-----------
1 file changed, 40 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 7df7d2e93a3a..c9c5cc641589 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -817,47 +817,54 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
stats->rx_missed_errors++;
} else {
die_dt = desc->die_dt & 0xF0;
+ skb = ravb_get_skb_gbeth(ndev, entry, desc);
switch (die_dt) {
case DT_FSINGLE:
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
+ case DT_FSTART:
+ /* Start of packet:
+ * Set initial data length.
+ */
skb_put(skb, desc_len);
+
+ /* Save this SKB if the packet spans multiple
+ * descriptors.
+ */
+ if (die_dt == DT_FSTART)
+ priv->rx_1st_skb = skb;
+ break;
+
+ case DT_FMID:
+ case DT_FEND:
+ /* Continuing a packet:
+ * Move data into the saved SKB.
+ */
+ skb_copy_to_linear_data_offset(priv->rx_1st_skb,
+ priv->rx_1st_skb->len,
+ skb->data,
+ desc_len);
+ skb_put(priv->rx_1st_skb, desc_len);
+ dev_kfree_skb(skb);
+
+ /* Set skb to point at the whole packet so that
+ * we only need one code path for finishing a
+ * packet.
+ */
+ skb = priv->rx_1st_skb;
+ }
+
+ switch (die_dt) {
+ case DT_FSINGLE:
+ case DT_FEND:
+ /* Finishing a packet:
+ * Determine protocol & checksum, hand off to
+ * NAPI and update our stats.
+ */
skb->protocol = eth_type_trans(skb, ndev);
if (ndev->features & NETIF_F_RXCSUM)
ravb_rx_csum_gbeth(skb);
+ stats->rx_bytes += skb->len;
napi_gro_receive(&priv->napi[q], skb);
rx_packets++;
- stats->rx_bytes += desc_len;
- break;
- case DT_FSTART:
- priv->rx_1st_skb = ravb_get_skb_gbeth(ndev, entry, desc);
- skb_put(priv->rx_1st_skb, desc_len);
- break;
- case DT_FMID:
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
- skb_copy_to_linear_data_offset(priv->rx_1st_skb,
- priv->rx_1st_skb->len,
- skb->data,
- desc_len);
- skb_put(priv->rx_1st_skb, desc_len);
- dev_kfree_skb(skb);
- break;
- case DT_FEND:
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
- skb_copy_to_linear_data_offset(priv->rx_1st_skb,
- priv->rx_1st_skb->len,
- skb->data,
- desc_len);
- skb_put(priv->rx_1st_skb, desc_len);
- dev_kfree_skb(skb);
- priv->rx_1st_skb->protocol =
- eth_type_trans(priv->rx_1st_skb, ndev);
- if (ndev->features & NETIF_F_RXCSUM)
- ravb_rx_csum_gbeth(priv->rx_1st_skb);
- stats->rx_bytes += priv->rx_1st_skb->len;
- napi_gro_receive(&priv->napi[q],
- priv->rx_1st_skb);
- rx_packets++;
- break;
}
}
}
--
2.39.2


2024-05-28 15:08:06

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 6/7] net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP

NAPI Threaded mode (along with the previously enabled SW IRQ Coalescing)
is required to improve network stack performance for single core SoCs
using the GbEth IP (currently the RZ/G2L SoC family and the RZ/G3S SoC).

This patch gives the following improvements during testing with iperf3.

* RZ/G2UL:
* TCP TX: +32% bandwidth (638Mbps -> 841Mbps)
* TXP RX: +8.8% bandwidth (667Mbps -> 726Mbps)
* UDP RX: +104% bandwidth (53Mbps -> 108Mbps)

* RZ/G3S:
* TCP TX: 29% bandwidth (529Mbps -> 681Mbps)
* UDP RX: +1290% bandwidth (6.46Mbps -> 90Mbps)

* RZ/Five:
* UDP RX: Test no longer crashes (0 -> 20 Mbps)

This patch gives the following reductions in performance in the same
testing:

* RZ/G2UL:
* UDP TX: -7.5% bandwidth (594Mbps -> 549Mbps)

* RZ/G3S:
* UDP TX: -5% bandwidth (625Mbps -> 594Mbps)

These losses are considered acceptable given the benefits shown above.
If UDP TX bandwidth must be maximised for a particular use case, NAPI
threaded mode can be disabled at runtime via sysfs writes.

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.

Signed-off-by: Paul Barker <[email protected]>
Reviewed-by: Sergey Shtylyov <[email protected]>
---
Changes v3->v4:
* Added Sergey's Reviewed-by tag.

drivers/net/ethernet/renesas/ravb_main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index e95e0b6e1fe6..dd92f074881a 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -2944,8 +2944,11 @@ static int ravb_probe(struct platform_device *pdev)
if (info->nc_queues)
netif_napi_add(ndev, &priv->napi[RAVB_NC], ravb_poll);

- if (info->coalesce_irqs)
+ if (info->coalesce_irqs) {
netdev_sw_irq_coalesce_default_on(ndev);
+ if (num_present_cpus() == 1)
+ dev_set_threaded(ndev, true);
+ }

/* Network device register */
error = register_netdev(ndev);
--
2.39.2


2024-05-28 15:08:51

by Paul Barker

[permalink] [raw]
Subject: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

This patch makes multiple changes that can't be separated:

1) Allocate plain RX buffers via a page pool instead of allocating
SKBs, then use build_skb() when a packet is received.
2) For GbEth IP, reduce the RX buffer size to 2kB.
3) For GbEth IP, merge packets which span more than one RX descriptor
as SKB fragments instead of copying data.

Implementing (1) without (2) would require the use of an order-1 page
pool (instead of an order-0 page pool split into page fragments) for
GbEth.

Implementing (2) without (3) would leave us no space to re-assemble
packets which span more than one RX descriptor.

Implementing (3) without (1) would not be possible as the network stack
expects to use put_page() or page_pool_put_page() to free SKB fragments
after an SKB is consumed.

RX checksum offload support is adjusted to handle both linear and
nonlinear (fragmented) packets.

This patch gives the following improvements during testing with iperf3.

* RZ/G2L:
* TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
* UDP RX: same bandwidth at -17% CPU load (88% -> 74%)

* RZ/G2UL:
* TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
* UDP RX: +417% bandwidth (108Mbps -> 558Mbps)

* RZ/G3S:
* TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
* UDP RX: +420% bandwidth (90Mbps -> 468Mbps)

* RZ/Five:
* TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
* UDP RX: +470% bandwidth (20Mbps -> 114Mbps)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Signed-off-by: Paul Barker <[email protected]>
---
Changes v3->v4:
* Used a separate page pool for each RX queue.
* Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
simplify the calling function.
* Explained the calculation of rx_desc->ds_cc.
* Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().

drivers/net/ethernet/renesas/ravb.h | 10 +-
drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++---------
2 files changed, 146 insertions(+), 94 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
index 6a7aa7dd17e6..f2091a17fcf7 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1050,8 +1050,8 @@ struct ravb_hw_info {
netdev_features_t net_features;
int stats_len;
u32 tccr_mask;
+ u32 rx_buffer_size;
u32 rx_max_frame_size;
- u32 rx_max_desc_use;
u32 rx_desc_size;
unsigned aligned_tx: 1;
unsigned coalesce_irqs:1; /* Needs software IRQ coalescing */
@@ -1071,6 +1071,11 @@ struct ravb_hw_info {
unsigned half_duplex:1; /* E-MAC supports half duplex mode */
};

+struct ravb_rx_buffer {
+ struct page *page;
+ unsigned int offset;
+};
+
struct ravb_private {
struct net_device *ndev;
struct platform_device *pdev;
@@ -1094,7 +1099,8 @@ struct ravb_private {
struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
void *tx_align[NUM_TX_QUEUE];
struct sk_buff *rx_1st_skb;
- struct sk_buff **rx_skb[NUM_RX_QUEUE];
+ struct page_pool *rx_pool[NUM_RX_QUEUE];
+ struct ravb_rx_buffer *rx_buffers[NUM_RX_QUEUE];
struct sk_buff **tx_skb[NUM_TX_QUEUE];
u32 rx_over_errors;
u32 rx_fifo_errors;
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index dd92f074881a..bb7f7d44be6e 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -30,6 +30,7 @@
#include <linux/reset.h>
#include <linux/math64.h>
#include <net/ip.h>
+#include <net/page_pool/helpers.h>

#include "ravb.h"

@@ -113,25 +114,6 @@ static void ravb_set_rate_rcar(struct net_device *ndev)
}
}

-static struct sk_buff *
-ravb_alloc_skb(struct net_device *ndev, const struct ravb_hw_info *info,
- gfp_t gfp_mask)
-{
- struct sk_buff *skb;
- u32 reserve;
-
- skb = __netdev_alloc_skb(ndev, info->rx_max_frame_size + RAVB_ALIGN - 1,
- gfp_mask);
- if (!skb)
- return NULL;
-
- reserve = (unsigned long)skb->data & (RAVB_ALIGN - 1);
- if (reserve)
- skb_reserve(skb, RAVB_ALIGN - reserve);
-
- return skb;
-}
-
/* Get MAC address from the MAC address registers
*
* Ethernet AVB device doesn't have ROM for MAC address.
@@ -257,21 +239,10 @@ static void ravb_rx_ring_free(struct net_device *ndev, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
unsigned int ring_size;
- unsigned int i;

if (!priv->rx_ring[q].raw)
return;

- for (i = 0; i < priv->num_rx_ring[q]; i++) {
- struct ravb_rx_desc *desc = ravb_rx_get_desc(priv, q, i);
-
- if (!dma_mapping_error(ndev->dev.parent,
- le32_to_cpu(desc->dptr)))
- dma_unmap_single(ndev->dev.parent,
- le32_to_cpu(desc->dptr),
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- }
ring_size = priv->info->rx_desc_size * (priv->num_rx_ring[q] + 1);
dma_free_coherent(ndev->dev.parent, ring_size, priv->rx_ring[q].raw,
priv->rx_desc_dma[q]);
@@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
priv->tx_ring[q] = NULL;
}

- /* Free RX skb ringbuffer */
- if (priv->rx_skb[q]) {
- for (i = 0; i < priv->num_rx_ring[q]; i++)
- dev_kfree_skb(priv->rx_skb[q][i]);
+ /* Free RX buffers */
+ for (i = 0; i < priv->num_rx_ring[q]; i++) {
+ if (priv->rx_buffers[q][i].page)
+ page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
}
- kfree(priv->rx_skb[q]);
- priv->rx_skb[q] = NULL;
+ kfree(priv->rx_buffers[q]);
+ priv->rx_buffers[q] = NULL;
+ page_pool_destroy(priv->rx_pool[q]);

/* Free aligned TX buffers */
kfree(priv->tx_align[q]);
@@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
priv->tx_skb[q] = NULL;
}

+static int
+ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
+ struct ravb_rx_desc *rx_desc)
+{
+ struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
+ struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
+ dma_addr_t dma_addr;
+ unsigned int size;
+
+ size = info->rx_buffer_size;
+ rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
+ gfp_mask);
+ if (unlikely(!rx_buff->page)) {
+ /* We just set the data size to 0 for a failed mapping
+ * which should prevent DMA from happening...
+ */
+ rx_desc->ds_cc = cpu_to_le16(0);
+ return -ENOMEM;
+ }
+
+ dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
+ dma_sync_single_for_device(ndev->dev.parent, dma_addr,
+ info->rx_buffer_size, DMA_FROM_DEVICE);
+ rx_desc->dptr = cpu_to_le32(dma_addr);
+
+ /* The end of the RX buffer is used to store skb shared data, so we need
+ * to ensure that the hardware leaves enough space for this.
+ */
+ rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
+ - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
+ - ETH_FCS_LEN + sizeof(__sum16));
+ return 0;
+}
+
static u32
ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
{
struct ravb_private *priv = netdev_priv(ndev);
- const struct ravb_hw_info *info = priv->info;
struct ravb_rx_desc *rx_desc;
- dma_addr_t dma_addr;
u32 i, entry;

for (i = 0; i < count; i++) {
entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
rx_desc = ravb_rx_get_desc(priv, q, entry);
- rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);

- if (!priv->rx_skb[q][entry]) {
- priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
- if (!priv->rx_skb[q][entry])
+ if (!priv->rx_buffers[q][entry].page) {
+ if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,
+ gfp_mask, rx_desc)))
break;
- dma_addr = dma_map_single(ndev->dev.parent,
- priv->rx_skb[q][entry]->data,
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
- skb_checksum_none_assert(priv->rx_skb[q][entry]);
- /* We just set the data size to 0 for a failed mapping
- * which should prevent DMA from happening...
- */
- if (dma_mapping_error(ndev->dev.parent, dma_addr))
- rx_desc->ds_cc = cpu_to_le16(0);
- rx_desc->dptr = cpu_to_le32(dma_addr);
}
/* Descriptor type must be set after all the above writes */
dma_wmb();
@@ -423,15 +416,32 @@ static int ravb_ring_init(struct net_device *ndev, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
unsigned int num_tx_desc = priv->num_tx_desc;
+ struct page_pool_params params = {
+ .order = 0,
+ .flags = PP_FLAG_DMA_MAP,
+ .pool_size = priv->num_rx_ring[q],
+ .nid = NUMA_NO_NODE,
+ .dev = ndev->dev.parent,
+ .dma_dir = DMA_FROM_DEVICE,
+ };
unsigned int ring_size;
u32 num_filled;

- /* Allocate RX and TX skb rings */
- priv->rx_skb[q] = kcalloc(priv->num_rx_ring[q],
- sizeof(*priv->rx_skb[q]), GFP_KERNEL);
+ /* Allocate RX page pool and buffers */
+ priv->rx_pool[q] = page_pool_create(&params);
+ if (IS_ERR(priv->rx_pool[q]))
+ goto error;
+
+ /* Allocate RX buffers */
+ priv->rx_buffers[q] = kcalloc(priv->num_rx_ring[q],
+ sizeof(*priv->rx_buffers[q]), GFP_KERNEL);
+ if (!priv->rx_buffers[q])
+ goto error;
+
+ /* Allocate TX skb rings */
priv->tx_skb[q] = kcalloc(priv->num_tx_ring[q],
sizeof(*priv->tx_skb[q]), GFP_KERNEL);
- if (!priv->rx_skb[q] || !priv->tx_skb[q])
+ if (!priv->tx_skb[q])
goto error;

/* Allocate all RX descriptors. */
@@ -717,7 +727,9 @@ static void ravb_get_tx_tstamp(struct net_device *ndev)

static void ravb_rx_csum_gbeth(struct sk_buff *skb)
{
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
__wsum csum_ip_hdr, csum_proto;
+ skb_frag_t *last_frag;
u8 *hw_csum;

/* The hardware checksum status is contained in sizeof(__sum16) * 2 = 4
@@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb)
if (unlikely(skb->len < sizeof(__sum16) * 2))
return;

- hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
+ if (skb_is_nonlinear(skb)) {
+ last_frag = &shinfo->frags[shinfo->nr_frags - 1];
+ hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16);
+ } else {
+ hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
+ }
+
csum_proto = csum_unfold((__force __sum16)get_unaligned_le16(hw_csum));

hw_csum -= sizeof(__sum16);
csum_ip_hdr = csum_unfold((__force __sum16)get_unaligned_le16(hw_csum));
- skb_trim(skb, skb->len - 2 * sizeof(__sum16));
+
+ if (skb_is_nonlinear(skb))
+ skb_frag_size_sub(last_frag, 2 * sizeof(__sum16));
+ else
+ skb_trim(skb, skb->len - 2 * sizeof(__sum16));

/* TODO: IPV6 Rx checksum */
if (skb->protocol == htons(ETH_P_IP) && !csum_ip_hdr && !csum_proto)
@@ -754,25 +776,11 @@ static void ravb_rx_csum(struct sk_buff *skb)
skb_trim(skb, skb->len - sizeof(__sum16));
}

-static struct sk_buff *ravb_get_skb_gbeth(struct net_device *ndev, int entry,
- struct ravb_rx_desc *desc)
-{
- struct ravb_private *priv = netdev_priv(ndev);
- struct sk_buff *skb;
-
- skb = priv->rx_skb[RAVB_BE][entry];
- priv->rx_skb[RAVB_BE][entry] = NULL;
- dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr),
- ALIGN(priv->info->rx_max_frame_size, 16),
- DMA_FROM_DEVICE);
-
- return skb;
-}
-
/* Packet receive function for Gigabit Ethernet */
static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
struct net_device_stats *stats;
struct ravb_rx_desc *desc;
struct sk_buff *skb;
@@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
if (desc_status & MSC_CEEF)
stats->rx_missed_errors++;
} else {
+ struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
+ void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
die_dt = desc->die_dt & 0xF0;
- skb = ravb_get_skb_gbeth(ndev, entry, desc);
+ dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
+ desc_len, DMA_FROM_DEVICE);
+
switch (die_dt) {
case DT_FSINGLE:
case DT_FSTART:
/* Start of packet:
- * Set initial data length.
+ * Prepare an SKB and add initial data.
*/
+ skb = napi_build_skb(rx_addr, info->rx_buffer_size);
+ if (unlikely(!skb)) {
+ stats->rx_errors++;
+ page_pool_put_page(priv->rx_pool[q],
+ rx_buff->page, 0, true);
+ break;
+ }
+ skb_mark_for_recycle(skb);
skb_put(skb, desc_len);

/* Save this SKB if the packet spans multiple
@@ -836,14 +856,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
case DT_FMID:
case DT_FEND:
/* Continuing a packet:
- * Move data into the saved SKB.
+ * Add this buffer as an RX frag.
*/
- skb_copy_to_linear_data_offset(priv->rx_1st_skb,
- priv->rx_1st_skb->len,
- skb->data,
- desc_len);
- skb_put(priv->rx_1st_skb, desc_len);
- dev_kfree_skb(skb);
+
+ /* rx_1st_skb will be NULL if napi_build_skb()
+ * failed for the first descriptor of a
+ * multi-descriptor packet.
+ */
+ if (unlikely(!priv->rx_1st_skb)) {
+ stats->rx_errors++;
+ page_pool_put_page(priv->rx_pool[q],
+ rx_buff->page, 0, true);
+ break;
+ }
+ skb_add_rx_frag(priv->rx_1st_skb,
+ skb_shinfo(priv->rx_1st_skb)->nr_frags,
+ rx_buff->page, rx_buff->offset,
+ desc_len, info->rx_buffer_size);

/* Set skb to point at the whole packet so that
* we only need one code path for finishing a
@@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
stats->rx_bytes += skb->len;
napi_gro_receive(&priv->napi[q], skb);
rx_packets++;
+
+ /* Clear rx_1st_skb so that it will only be
+ * non-NULL when valid.
+ */
+ if (die_dt == DT_FEND)
+ priv->rx_1st_skb = NULL;
}
+
+ /* Mark this RX buffer as consumed. */
+ rx_buff->page = NULL;
}
}

@@ -882,6 +920,7 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
{
struct ravb_private *priv = netdev_priv(ndev);
+ const struct ravb_hw_info *info = priv->info;
struct net_device_stats *stats = &priv->stats[q];
struct ravb_ex_rx_desc *desc;
unsigned int limit, i;
@@ -923,13 +962,20 @@ static int ravb_rx_rcar(struct net_device *ndev, int budget, int q)
if (desc_status & MSC_CEEF)
stats->rx_missed_errors++;
} else {
+ struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
+ void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
u32 get_ts = priv->tstamp_rx_ctrl & RAVB_RXTSTAMP_TYPE;

- skb = priv->rx_skb[q][entry];
- priv->rx_skb[q][entry] = NULL;
- dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr),
- priv->info->rx_max_frame_size,
- DMA_FROM_DEVICE);
+ skb = napi_build_skb(rx_addr, info->rx_buffer_size);
+ if (unlikely(!skb)) {
+ stats->rx_errors++;
+ page_pool_put_page(priv->rx_pool[q], rx_buff->page, 0, true);
+ break;
+ }
+ dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
+ pkt_len, DMA_FROM_DEVICE);
+ rx_buff->page = NULL;
+ skb_mark_for_recycle(skb);
get_ts &= (q == RAVB_NC) ?
RAVB_RXTSTAMP_TYPE_V2_L2_EVENT :
~RAVB_RXTSTAMP_TYPE_V2_L2_EVENT;
@@ -2595,8 +2641,8 @@ static const struct ravb_hw_info ravb_gen3_hw_info = {
.net_features = NETIF_F_RXCSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats),
.tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
+ .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
.rx_max_frame_size = SZ_2K,
- .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
.rx_desc_size = sizeof(struct ravb_ex_rx_desc),
.internal_delay = 1,
.tx_counters = 1,
@@ -2619,8 +2665,8 @@ static const struct ravb_hw_info ravb_gen2_hw_info = {
.net_features = NETIF_F_RXCSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats),
.tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
+ .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
.rx_max_frame_size = SZ_2K,
- .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
.rx_desc_size = sizeof(struct ravb_ex_rx_desc),
.aligned_tx = 1,
.gptp = 1,
@@ -2640,8 +2686,8 @@ static const struct ravb_hw_info ravb_rzv2m_hw_info = {
.net_features = NETIF_F_RXCSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats),
.tccr_mask = TCCR_TSRQ0 | TCCR_TSRQ1 | TCCR_TSRQ2 | TCCR_TSRQ3,
+ .rx_buffer_size = SZ_2K + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
.rx_max_frame_size = SZ_2K,
- .rx_max_desc_use = SZ_2K - ETH_FCS_LEN + sizeof(__sum16),
.rx_desc_size = sizeof(struct ravb_ex_rx_desc),
.multi_irqs = 1,
.err_mgmt_irqs = 1,
@@ -2663,8 +2709,8 @@ static const struct ravb_hw_info gbeth_hw_info = {
.net_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM,
.stats_len = ARRAY_SIZE(ravb_gstrings_stats_gbeth),
.tccr_mask = TCCR_TSRQ0,
+ .rx_buffer_size = SZ_2K,
.rx_max_frame_size = SZ_8K,
- .rx_max_desc_use = 4080,
.rx_desc_size = sizeof(struct ravb_rx_desc),
.aligned_tx = 1,
.coalesce_irqs = 1,
--
2.39.2


2024-05-28 16:37:55

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 1/7] net: ravb: Simplify poll & receive functions

On 5/28/24 6:03 PM, Paul Barker wrote:

> We don't need to pass the work budget to ravb_rx() by reference, it's
> cleaner to pass this by value and return the amount of work done. This
> allows us to simplify the ravb_poll() function and use the common
> `work_done` variable name seen in other network drivers for consistency
> and ease of understanding.
>
> This is a pure refactor and should not affect behaviour.
>
> Signed-off-by: Paul Barker <[email protected]>

[...]

Reviewed-by: Sergey Shtylyov <[email protected]>

MBR, Sergey

2024-05-28 20:50:46

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 3/7] net: ravb: Refactor RX ring refill

On 5/28/24 6:03 PM, Paul Barker wrote:

> To reduce code duplication, we add a new RX ring refill function which
> can handle both the initial RX ring population (which was split between
> ravb_ring_init() and ravb_ring_format()) and the RX ring refill after
> polling (in ravb_rx()).
>
> Signed-off-by: Paul Barker <[email protected]>

Looks sane...

Reviewed-by: Sergey Shtylyov <[email protected]>

[...]

MBR, Sergey

2024-05-29 18:30:43

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 4/7] net: ravb: Refactor GbEth RX code path

On 5/28/24 6:03 PM, Paul Barker wrote:

> We can reduce code duplication in ravb_rx_gbeth().
>
> Signed-off-by: Paul Barker <[email protected]>
[...]

> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 7df7d2e93a3a..c9c5cc641589 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -817,47 +817,54 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> stats->rx_missed_errors++;
> } else {
> die_dt = desc->die_dt & 0xF0;
> + skb = ravb_get_skb_gbeth(ndev, entry, desc);
> switch (die_dt) {

Why not do instead (as I've asked you alraedy):

case DT_FSTART:
priv->rx_1st_skb = skb;
fallthrough;

> case DT_FSINGLE:
> - skb = ravb_get_skb_gbeth(ndev, entry, desc);


> + case DT_FSTART:
> + /* Start of packet:
> + * Set initial data length.
> + */

Please consider turning that block comment into one-liner...

> skb_put(skb, desc_len);
> +
> + /* Save this SKB if the packet spans multiple
> + * descriptors.
> + */

This one too...
(The current line length limit is 100 columns.)

> + if (die_dt == DT_FSTART)
> + priv->rx_1st_skb = skb;

This needs to be done under *case* DT_FSTART above instead...

> + break;
> +
> + case DT_FMID:
> + case DT_FEND:
> + /* Continuing a packet:
> + * Move data into the saved SKB.
> + */
> + skb_copy_to_linear_data_offset(priv->rx_1st_skb,
> + priv->rx_1st_skb->len,
> + skb->data,
> + desc_len);
> + skb_put(priv->rx_1st_skb, desc_len);
> + dev_kfree_skb(skb);
> +
> + /* Set skb to point at the whole packet so that

Please call it consistently, either SKB or skb (I prefer this one).

> + * we only need one code path for finishing a
> + * packet.
> + */
> + skb = priv->rx_1st_skb;
> + }
> +
> + switch (die_dt) {
> + case DT_FSINGLE:
> + case DT_FEND:
> + /* Finishing a packet:
> + * Determine protocol & checksum, hand off to
> + * NAPI and update our stats.
> + */
> skb->protocol = eth_type_trans(skb, ndev);
> if (ndev->features & NETIF_F_RXCSUM)
> ravb_rx_csum_gbeth(skb);
> + stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;

Otherwise, this is very good patch! Sorry for letting in the duplcate
code earlier! :-)

[...]

MBR, Sergey

2024-05-29 19:07:30

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next PATCH v4 4/7] net: ravb: Refactor GbEth RX code path

On 29/05/2024 19:30, Sergey Shtylyov wrote:
> On 5/28/24 6:03 PM, Paul Barker wrote:
>
>> We can reduce code duplication in ravb_rx_gbeth().
>>
>> Signed-off-by: Paul Barker <[email protected]>
> [...]
>
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index 7df7d2e93a3a..c9c5cc641589 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>> @@ -817,47 +817,54 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> stats->rx_missed_errors++;
>> } else {
>> die_dt = desc->die_dt & 0xF0;
>> + skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> switch (die_dt) {
>
> Why not do instead (as I've asked you alraedy):
>
> case DT_FSTART:
> priv->rx_1st_skb = skb;
> fallthrough;

I've avoided that change to keep patch 7/7 simpler (as we have to move
the assignment of skb later in that patch). I can change this if you
want though.

>
>> case DT_FSINGLE:
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>
>
>> + case DT_FSTART:
>> + /* Start of packet:
>> + * Set initial data length.
>> + */
>
> Please consider turning that block comment into one-liner...

Ack.

>
>> skb_put(skb, desc_len);
>> +
>> + /* Save this SKB if the packet spans multiple
>> + * descriptors.
>> + */
>
> This one too...
> (The current line length limit is 100 columns.)

Ack. I'll re-flow some other lines with a 100 col limit as well - I'm
immediately thinking of the skb_copy_to_linear_data_offset call below.

>
>> + if (die_dt == DT_FSTART)
>> + priv->rx_1st_skb = skb;
>
> This needs to be done under *case* DT_FSTART above instead...

See above comment. We can do this under DT_FSTART in this patch if you
want, but this if condition will then come back in a revised patch 7/7.

>
>> + break;
>> +
>> + case DT_FMID:
>> + case DT_FEND:
>> + /* Continuing a packet:
>> + * Move data into the saved SKB.
>> + */
>> + skb_copy_to_linear_data_offset(priv->rx_1st_skb,
>> + priv->rx_1st_skb->len,
>> + skb->data,
>> + desc_len);
>> + skb_put(priv->rx_1st_skb, desc_len);
>> + dev_kfree_skb(skb);
>> +
>> + /* Set skb to point at the whole packet so that
>
> Please call it consistently, either SKB or skb (I prefer this one).

Ack.

>
>> + * we only need one code path for finishing a
>> + * packet.
>> + */
>> + skb = priv->rx_1st_skb;
>> + }
>> +
>> + switch (die_dt) {
>> + case DT_FSINGLE:
>> + case DT_FEND:
>> + /* Finishing a packet:
>> + * Determine protocol & checksum, hand off to
>> + * NAPI and update our stats.
>> + */
>> skb->protocol = eth_type_trans(skb, ndev);
>> if (ndev->features & NETIF_F_RXCSUM)
>> ravb_rx_csum_gbeth(skb);
>> + stats->rx_bytes += skb->len;
>> napi_gro_receive(&priv->napi[q], skb);
>> rx_packets++;
>
> Otherwise, this is very good patch! Sorry for letting in the duplcate
> code earlier! :-)
>
> [...]
>
> MBR, Sergey

Thanks for the review!

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-05-29 20:53:22

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 5/28/24 6:03 PM, Paul Barker wrote:

> This patch makes multiple changes that can't be separated:
>
> 1) Allocate plain RX buffers via a page pool instead of allocating
> SKBs, then use build_skb() when a packet is received.
> 2) For GbEth IP, reduce the RX buffer size to 2kB.
> 3) For GbEth IP, merge packets which span more than one RX descriptor
> as SKB fragments instead of copying data.
>
> Implementing (1) without (2) would require the use of an order-1 page
> pool (instead of an order-0 page pool split into page fragments) for
> GbEth.
>
> Implementing (2) without (3) would leave us no space to re-assemble
> packets which span more than one RX descriptor.
>
> Implementing (3) without (1) would not be possible as the network stack
> expects to use put_page() or page_pool_put_page() to free SKB fragments
> after an SKB is consumed.
>
> RX checksum offload support is adjusted to handle both linear and
> nonlinear (fragmented) packets.
>
> This patch gives the following improvements during testing with iperf3.
>
> * RZ/G2L:
> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>
> * RZ/G2UL:
> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>
> * RZ/G3S:
> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>
> * RZ/Five:
> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>
> There is no significant impact on bandwidth or CPU load in testing on
> RZ/G2H or R-Car M3N.
>
> Signed-off-by: Paul Barker <[email protected]>
> ---
> Changes v3->v4:
> * Used a separate page pool for each RX queue.
> * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
> simplify the calling function.
> * Explained the calculation of rx_desc->ds_cc.
> * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
>
> drivers/net/ethernet/renesas/ravb.h | 10 +-
> drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++---------
> 2 files changed, 146 insertions(+), 94 deletions(-)
>
> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index 6a7aa7dd17e6..f2091a17fcf7 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
[...]> @@ -1094,7 +1099,8 @@ struct ravb_private {
> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
> void *tx_align[NUM_TX_QUEUE];
> struct sk_buff *rx_1st_skb;
> - struct sk_buff **rx_skb[NUM_RX_QUEUE];
> + struct page_pool *rx_pool[NUM_RX_QUEUE];

Don't we need #include <net/page_pool/types.h>

[...]
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index dd92f074881a..bb7f7d44be6e 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
[...]
> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> priv->tx_skb[q] = NULL;
> }
>
> +static int
> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
> + struct ravb_rx_desc *rx_desc)
> +{
> + struct ravb_private *priv = netdev_priv(ndev);
> + const struct ravb_hw_info *info = priv->info;
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + dma_addr_t dma_addr;
> + unsigned int size;
> +
> + size = info->rx_buffer_size;
> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
> + gfp_mask);
> + if (unlikely(!rx_buff->page)) {
> + /* We just set the data size to 0 for a failed mapping
> + * which should prevent DMA from happening...
> + */
> + rx_desc->ds_cc = cpu_to_le16(0);
> + return -ENOMEM;
> + }
> +
> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
> + info->rx_buffer_size, DMA_FROM_DEVICE);

Do we really need this call?

> + rx_desc->dptr = cpu_to_le32(dma_addr);
> +
> + /* The end of the RX buffer is used to store skb shared data, so we need
> + * to ensure that the hardware leaves enough space for this.
> + */
> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))

Please leave the - operator on the previous line...

> + - ETH_FCS_LEN + sizeof(__sum16));

Here as well...

> + return 0;
> +}
> +
> static u32
> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
> {
> struct ravb_private *priv = netdev_priv(ndev);
> - const struct ravb_hw_info *info = priv->info;
> struct ravb_rx_desc *rx_desc;
> - dma_addr_t dma_addr;
> u32 i, entry;
>
> for (i = 0; i < count; i++) {
> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
> rx_desc = ravb_rx_get_desc(priv, q, entry);
> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>
> - if (!priv->rx_skb[q][entry]) {
> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
> - if (!priv->rx_skb[q][entry])
> + if (!priv->rx_buffers[q][entry].page) {
> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,

Well, IIRC Greg KH is against using unlikely() unless you have actually
instrumented the code and this gives an improvement... have you? :-)

[...]
> @@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb)
> if (unlikely(skb->len < sizeof(__sum16) * 2))
> return;
>
> - hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
> + if (skb_is_nonlinear(skb)) {
> + last_frag = &shinfo->frags[shinfo->nr_frags - 1];
> + hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16);
> + } else {
> + hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
> + }

We can do the subtraction only once here...

[...]
> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> if (desc_status & MSC_CEEF)
> stats->rx_missed_errors++;
> } else {
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;

Need an empty line here...

> die_dt = desc->die_dt & 0xF0;
> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
> + desc_len, DMA_FROM_DEVICE);
> +
> switch (die_dt) {
> case DT_FSINGLE:
> case DT_FSTART:
> /* Start of packet:
> - * Set initial data length.
> + * Prepare an SKB and add initial data.

I'd prefer calling it skb in the comments...

[...]
> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;
> +
> + /* Clear rx_1st_skb so that it will only be
> + * non-NULL when valid.
> + */
> + if (die_dt == DT_FEND)
> + priv->rx_1st_skb = NULL;

Hm, can't we do this under *case* DT_FEND above?

[...]

MBR, Sergey

2024-05-30 09:21:44

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 29/05/2024 21:52, Sergey Shtylyov wrote:
> On 5/28/24 6:03 PM, Paul Barker wrote:
>
>> This patch makes multiple changes that can't be separated:
>>
>> 1) Allocate plain RX buffers via a page pool instead of allocating
>> SKBs, then use build_skb() when a packet is received.
>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>> as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> RX checksum offload support is adjusted to handle both linear and
>> nonlinear (fragmented) packets.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>> * RZ/G2L:
>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>> * RZ/G2UL:
>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>> * RZ/G3S:
>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>> * RZ/Five:
>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>> ---
>> Changes v3->v4:
>> * Used a separate page pool for each RX queue.
>> * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
>> simplify the calling function.
>> * Explained the calculation of rx_desc->ds_cc.
>> * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
>>
>> drivers/net/ethernet/renesas/ravb.h | 10 +-
>> drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++---------
>> 2 files changed, 146 insertions(+), 94 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
>> index 6a7aa7dd17e6..f2091a17fcf7 100644
>> --- a/drivers/net/ethernet/renesas/ravb.h
>> +++ b/drivers/net/ethernet/renesas/ravb.h
> [...]> @@ -1094,7 +1099,8 @@ struct ravb_private {
>> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
>> void *tx_align[NUM_TX_QUEUE];
>> struct sk_buff *rx_1st_skb;
>> - struct sk_buff **rx_skb[NUM_RX_QUEUE];
>> + struct page_pool *rx_pool[NUM_RX_QUEUE];
>
> Don't we need #include <net/page_pool/types.h>

Yes. I got away with it as ravb_main.c includes
<net/page_pool/helpers.h> before including "ravb.h", but the header
shouldn't assume that.

>
> [...]
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index dd92f074881a..bb7f7d44be6e 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> [...]
>> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_skb[q] = NULL;
>> }
>>
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> + struct ravb_rx_desc *rx_desc)
>> +{
>> + struct ravb_private *priv = netdev_priv(ndev);
>> + const struct ravb_hw_info *info = priv->info;
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + dma_addr_t dma_addr;
>> + unsigned int size;
>> +
>> + size = info->rx_buffer_size;
>> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
>> + gfp_mask);
>> + if (unlikely(!rx_buff->page)) {
>> + /* We just set the data size to 0 for a failed mapping
>> + * which should prevent DMA from happening...
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(0);
>> + return -ENOMEM;
>> + }
>> +
>> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> + info->rx_buffer_size, DMA_FROM_DEVICE);
>
> Do we really need this call?

Looking at .config I see CONFIG_DMA_NEED_SYNC=y so yes I think this is
needed.

>
>> + rx_desc->dptr = cpu_to_le32(dma_addr);
>> +
>> + /* The end of the RX buffer is used to store skb shared data, so we need
>> + * to ensure that the hardware leaves enough space for this.
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>
> Please leave the - operator on the previous line...

Ack.

>
>> + - ETH_FCS_LEN + sizeof(__sum16));
>
> Here as well...

Ack.

>
>> + return 0;
>> +}
>> +
>> static u32
>> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> - const struct ravb_hw_info *info = priv->info;
>> struct ravb_rx_desc *rx_desc;
>> - dma_addr_t dma_addr;
>> u32 i, entry;
>>
>> for (i = 0; i < count; i++) {
>> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
>> rx_desc = ravb_rx_get_desc(priv, q, entry);
>> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>>
>> - if (!priv->rx_skb[q][entry]) {
>> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
>> - if (!priv->rx_skb[q][entry])
>> + if (!priv->rx_buffers[q][entry].page) {
>> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,
>
> Well, IIRC Greg KH is against using unlikely() unless you have actually
> instrumented the code and this gives an improvement... have you? :-)

My understanding was that we should use unlikely() for error checking in
hot code paths where we want the "good" path to be optimised. I can drop
this if I'm wrong though.

>
> [...]
>> @@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb)
>> if (unlikely(skb->len < sizeof(__sum16) * 2))
>> return;
>>
>> - hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> + if (skb_is_nonlinear(skb)) {
>> + last_frag = &shinfo->frags[shinfo->nr_frags - 1];
>> + hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16);
>> + } else {
>> + hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> + }
>
> We can do the subtraction only once here...

Ack. I'll pull that out of the if.

>
> [...]
>> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> if (desc_status & MSC_CEEF)
>> stats->rx_missed_errors++;
>> } else {
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
>
> Need an empty line here...

Ack.

>
>> die_dt = desc->die_dt & 0xF0;
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
>> + desc_len, DMA_FROM_DEVICE);
>> +
>> switch (die_dt) {
>> case DT_FSINGLE:
>> case DT_FSTART:
>> /* Start of packet:
>> - * Set initial data length.
>> + * Prepare an SKB and add initial data.
>
> I'd prefer calling it skb in the comments...

Ack.

>
> [...]
>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> stats->rx_bytes += skb->len;
>> napi_gro_receive(&priv->napi[q], skb);
>> rx_packets++;
>> +
>> + /* Clear rx_1st_skb so that it will only be
>> + * non-NULL when valid.
>> + */
>> + if (die_dt == DT_FEND)
>> + priv->rx_1st_skb = NULL;
>
> Hm, can't we do this under *case* DT_FEND above?

It makes more logical sense to me to do this as the last step, but I
guess it's a little more optimal to do it earlier. I'll move it.

Thanks,

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-05-30 10:29:27

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 30/05/2024 10:21, Paul Barker wrote:
> On 29/05/2024 21:52, Sergey Shtylyov wrote:
>> On 5/28/24 6:03 PM, Paul Barker wrote:
>>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>>> stats->rx_bytes += skb->len;
>>> napi_gro_receive(&priv->napi[q], skb);
>>> rx_packets++;
>>> +
>>> + /* Clear rx_1st_skb so that it will only be
>>> + * non-NULL when valid.
>>> + */
>>> + if (die_dt == DT_FEND)
>>> + priv->rx_1st_skb = NULL;
>>
>> Hm, can't we do this under *case* DT_FEND above?
>
> It makes more logical sense to me to do this as the last step, but I
> guess it's a little more optimal to do it earlier. I'll move it.

Actually, this doesn't even need to be conditional. If die_dt is
DT_FSINGLE, priv->rx_1st_skb will already be NULL so this will be a
no-op. So I'll just simplify this.

Thanks,

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-05-30 20:37:49

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 4/7] net: ravb: Refactor GbEth RX code path

On 5/29/24 10:07 PM, Paul Barker wrote:
[...]

>>> We can reduce code duplication in ravb_rx_gbeth().
>>>
>>> Signed-off-by: Paul Barker <[email protected]>
>> [...]
>>
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>> index 7df7d2e93a3a..c9c5cc641589 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>> @@ -817,47 +817,54 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>>> stats->rx_missed_errors++;
>>> } else {
>>> die_dt = desc->die_dt & 0xF0;
>>> + skb = ravb_get_skb_gbeth(ndev, entry, desc);
>>> switch (die_dt) {
>>
>> Why not do instead (as I've asked you alraedy):

Already. :-)

>>
>> case DT_FSTART:
>> priv->rx_1st_skb = skb;
>> fallthrough;
>
> I've avoided that change to keep patch 7/7 simpler (as we have to move
> the assignment of skb later in that patch). I can change this if you
> want though.

Oh, then please keep it as is! :-)

[...]

> Thanks for the review!

MBR, Sergey

2024-05-31 17:52:03

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 5/30/24 12:21 PM, Paul Barker wrote:
[...]

>>> This patch makes multiple changes that can't be separated:
>>>
>>> 1) Allocate plain RX buffers via a page pool instead of allocating
>>> SKBs, then use build_skb() when a packet is received.
>>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>>> as SKB fragments instead of copying data.
>>>
>>> Implementing (1) without (2) would require the use of an order-1 page
>>> pool (instead of an order-0 page pool split into page fragments) for
>>> GbEth.
>>>
>>> Implementing (2) without (3) would leave us no space to re-assemble
>>> packets which span more than one RX descriptor.
>>>
>>> Implementing (3) without (1) would not be possible as the network stack
>>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>>> after an SKB is consumed.
>>>
>>> RX checksum offload support is adjusted to handle both linear and
>>> nonlinear (fragmented) packets.
>>>
>>> This patch gives the following improvements during testing with iperf3.
>>>
>>> * RZ/G2L:
>>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>>
>>> * RZ/G2UL:
>>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>>
>>> * RZ/G3S:
>>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>>
>>> * RZ/Five:
>>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>>
>>> There is no significant impact on bandwidth or CPU load in testing on
>>> RZ/G2H or R-Car M3N.
>>>
>>> Signed-off-by: Paul Barker <[email protected]>
[...]

>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>> index dd92f074881a..bb7f7d44be6e 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
[...]
>>> + return 0;
>>> +}
>>> +
>>> static u32
>>> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
>>> {
>>> struct ravb_private *priv = netdev_priv(ndev);
>>> - const struct ravb_hw_info *info = priv->info;
>>> struct ravb_rx_desc *rx_desc;
>>> - dma_addr_t dma_addr;
>>> u32 i, entry;
>>>
>>> for (i = 0; i < count; i++) {
>>> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
>>> rx_desc = ravb_rx_get_desc(priv, q, entry);
>>> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>>>
>>> - if (!priv->rx_skb[q][entry]) {
>>> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
>>> - if (!priv->rx_skb[q][entry])
>>> + if (!priv->rx_buffers[q][entry].page) {
>>> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,
>>
>> Well, IIRC Greg KH is against using unlikely() unless you have actually
>> instrumented the code and this gives an improvement... have you? :-)
>
> My understanding was that we should use unlikely() for error checking in
> hot code paths where we want the "good" path to be optimised. I can drop
> this if I'm wrong though.

OK, keep it... :-)

[...]
>>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>>> stats->rx_bytes += skb->len;
>>> napi_gro_receive(&priv->napi[q], skb);
>>> rx_packets++;
>>> +
>>> + /* Clear rx_1st_skb so that it will only be
>>> + * non-NULL when valid.
>>> + */
>>> + if (die_dt == DT_FEND)
>>> + priv->rx_1st_skb = NULL;
>>
>> Hm, can't we do this under *case* DT_FEND above?
>
> It makes more logical sense to me to do this as the last step, but I
> guess it's a little more optimal to do it earlier. I'll move it.

Looking at it once more, we can't... unless I'm missing s/th. :-)

> Thanks,

MBR, Sergey

2024-06-01 10:13:20

by Simon Horman

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On Tue, May 28, 2024 at 04:03:39PM +0100, Paul Barker wrote:
> This patch makes multiple changes that can't be separated:
>
> 1) Allocate plain RX buffers via a page pool instead of allocating
> SKBs, then use build_skb() when a packet is received.
> 2) For GbEth IP, reduce the RX buffer size to 2kB.
> 3) For GbEth IP, merge packets which span more than one RX descriptor
> as SKB fragments instead of copying data.
>
> Implementing (1) without (2) would require the use of an order-1 page
> pool (instead of an order-0 page pool split into page fragments) for
> GbEth.
>
> Implementing (2) without (3) would leave us no space to re-assemble
> packets which span more than one RX descriptor.
>
> Implementing (3) without (1) would not be possible as the network stack
> expects to use put_page() or page_pool_put_page() to free SKB fragments
> after an SKB is consumed.
>
> RX checksum offload support is adjusted to handle both linear and
> nonlinear (fragmented) packets.
>
> This patch gives the following improvements during testing with iperf3.
>
> * RZ/G2L:
> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>
> * RZ/G2UL:
> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>
> * RZ/G3S:
> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>
> * RZ/Five:
> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>
> There is no significant impact on bandwidth or CPU load in testing on
> RZ/G2H or R-Car M3N.
>
> Signed-off-by: Paul Barker <[email protected]>

Hi Paul,

Some minor feedback from my side.

...

> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c

...

> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> priv->tx_ring[q] = NULL;
> }
>
> - /* Free RX skb ringbuffer */
> - if (priv->rx_skb[q]) {
> - for (i = 0; i < priv->num_rx_ring[q]; i++)
> - dev_kfree_skb(priv->rx_skb[q][i]);
> + /* Free RX buffers */
> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
> + if (priv->rx_buffers[q][i].page)
> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);

nit: Networking still prefers code to be 80 columns wide or less.
It looks like that can be trivially achieved here.

Flagged by checkpatch.pl --max-line-length=80

> }
> - kfree(priv->rx_skb[q]);
> - priv->rx_skb[q] = NULL;
> + kfree(priv->rx_buffers[q]);
> + priv->rx_buffers[q] = NULL;
> + page_pool_destroy(priv->rx_pool[q]);
>
> /* Free aligned TX buffers */
> kfree(priv->tx_align[q]);
> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> priv->tx_skb[q] = NULL;
> }
>
> +static int
> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
> + struct ravb_rx_desc *rx_desc)
> +{
> + struct ravb_private *priv = netdev_priv(ndev);
> + const struct ravb_hw_info *info = priv->info;
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + dma_addr_t dma_addr;
> + unsigned int size;

nit: I would appreciate it if some consideration could be given to
moving this driver towards rather than away from reverse xmas
tree - longest line to shortest - for local variable declarations.

I'm not suggesting a clean-up patch. Rather, that in cases
like this where new code is added, and also in cases where
code is modified, reverse xmas tree is preferred.

Here I would suggest separating the assinment of rx_buf from
it's declaration (completely untested!):

struct ravb_private *priv = netdev_priv(ndev);
const struct ravb_hw_info *info = priv->info;
struct ravb_rx_buffer *rx_buff;
dma_addr_t dma_addr;
unsigned int size;

rx_buff = &priv->rx_buffers[q][entry];

Edward Cree's xmastree tool can be helpful here:
https://github.com/ecree-solarflare/xmastree

> +
> + size = info->rx_buffer_size;
> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
> + gfp_mask);
> + if (unlikely(!rx_buff->page)) {
> + /* We just set the data size to 0 for a failed mapping
> + * which should prevent DMA from happening...
> + */
> + rx_desc->ds_cc = cpu_to_le16(0);
> + return -ENOMEM;
> + }
> +
> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
> + info->rx_buffer_size, DMA_FROM_DEVICE);
> + rx_desc->dptr = cpu_to_le32(dma_addr);
> +
> + /* The end of the RX buffer is used to store skb shared data, so we need
> + * to ensure that the hardware leaves enough space for this.
> + */
> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> + - ETH_FCS_LEN + sizeof(__sum16));
> + return 0;
> +}

...

> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> if (desc_status & MSC_CEEF)
> stats->rx_missed_errors++;
> } else {
> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
> die_dt = desc->die_dt & 0xF0;
> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
> + desc_len, DMA_FROM_DEVICE);
> +
> switch (die_dt) {
> case DT_FSINGLE:
> case DT_FSTART:
> /* Start of packet:
> - * Set initial data length.
> + * Prepare an SKB and add initial data.
> */
> + skb = napi_build_skb(rx_addr, info->rx_buffer_size);
> + if (unlikely(!skb)) {
> + stats->rx_errors++;
> + page_pool_put_page(priv->rx_pool[q],
> + rx_buff->page, 0, true);

Here skb is NULL.

> + break;
> + }
> + skb_mark_for_recycle(skb);
> skb_put(skb, desc_len);
>
> /* Save this SKB if the packet spans multiple
> @@ -836,14 +856,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> case DT_FMID:
> case DT_FEND:
> /* Continuing a packet:
> - * Move data into the saved SKB.
> + * Add this buffer as an RX frag.
> */
> - skb_copy_to_linear_data_offset(priv->rx_1st_skb,
> - priv->rx_1st_skb->len,
> - skb->data,
> - desc_len);
> - skb_put(priv->rx_1st_skb, desc_len);
> - dev_kfree_skb(skb);
> +
> + /* rx_1st_skb will be NULL if napi_build_skb()
> + * failed for the first descriptor of a
> + * multi-descriptor packet.
> + */
> + if (unlikely(!priv->rx_1st_skb)) {
> + stats->rx_errors++;
> + page_pool_put_page(priv->rx_pool[q],
> + rx_buff->page, 0, true);

And here skb seems to be uninitialised.

> + break;
> + }
> + skb_add_rx_frag(priv->rx_1st_skb,
> + skb_shinfo(priv->rx_1st_skb)->nr_frags,
> + rx_buff->page, rx_buff->offset,
> + desc_len, info->rx_buffer_size);
>
> /* Set skb to point at the whole packet so that
> * we only need one code path for finishing a

The code between the hunk above and the hunk below is:

/* Set skb to point at the whole packet so that
* we only need one code path for finishing a
* packet.
*/
skb = priv->rx_1st_skb;
}
switch (die_dt) {
case DT_FSINGLE:
case DT_FEND:
/* Finishing a packet:
* Determine protocol & checksum, hand off to
* NAPI and update our stats.
*/
skb->protocol = eth_type_trans(skb, ndev);
if (ndev->features & NETIF_F_RXCSUM)
ravb_rx_csum_gbeth(skb);
stats->rx_bytes += skb->len;
napi_gro_receive(&priv->napi[q], skb);
rx_packets++;

It seems that the inter-hunk code above may dereference skb when it is NULL
or uninitialised.

Flagged by Smatch.

> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
> stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;
> +
> + /* Clear rx_1st_skb so that it will only be
> + * non-NULL when valid.
> + */
> + if (die_dt == DT_FEND)
> + priv->rx_1st_skb = NULL;
> }
> +
> + /* Mark this RX buffer as consumed. */
> + rx_buff->page = NULL;
> }
> }
>

...

2024-06-03 08:04:06

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 01/06/2024 11:13, Simon Horman wrote:
> On Tue, May 28, 2024 at 04:03:39PM +0100, Paul Barker wrote:
>> This patch makes multiple changes that can't be separated:
>>
>> 1) Allocate plain RX buffers via a page pool instead of allocating
>> SKBs, then use build_skb() when a packet is received.
>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>> as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> RX checksum offload support is adjusted to handle both linear and
>> nonlinear (fragmented) packets.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>> * RZ/G2L:
>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>> * RZ/G2UL:
>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>> * RZ/G3S:
>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>> * RZ/Five:
>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <[email protected]>
>
> Hi Paul,
>
> Some minor feedback from my side.
>
> ...
>
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>
> ...
>
>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_ring[q] = NULL;
>> }
>>
>> - /* Free RX skb ringbuffer */
>> - if (priv->rx_skb[q]) {
>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>> - dev_kfree_skb(priv->rx_skb[q][i]);
>> + /* Free RX buffers */
>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>> + if (priv->rx_buffers[q][i].page)
>> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
>
> nit: Networking still prefers code to be 80 columns wide or less.
> It looks like that can be trivially achieved here.
>
> Flagged by checkpatch.pl --max-line-length=80

Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
to 80 in the docs though [2], so I guess you may be right.

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://www.kernel.org/doc/html/latest/process/coding-style.html

>
>> }
>> - kfree(priv->rx_skb[q]);
>> - priv->rx_skb[q] = NULL;
>> + kfree(priv->rx_buffers[q]);
>> + priv->rx_buffers[q] = NULL;
>> + page_pool_destroy(priv->rx_pool[q]);
>>
>> /* Free aligned TX buffers */
>> kfree(priv->tx_align[q]);
>> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_skb[q] = NULL;
>> }
>>
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> + struct ravb_rx_desc *rx_desc)
>> +{
>> + struct ravb_private *priv = netdev_priv(ndev);
>> + const struct ravb_hw_info *info = priv->info;
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + dma_addr_t dma_addr;
>> + unsigned int size;
>
> nit: I would appreciate it if some consideration could be given to
> moving this driver towards rather than away from reverse xmas
> tree - longest line to shortest - for local variable declarations.
>
> I'm not suggesting a clean-up patch. Rather, that in cases
> like this where new code is added, and also in cases where
> code is modified, reverse xmas tree is preferred.
>
> Here I would suggest separating the assinment of rx_buf from
> it's declaration (completely untested!):
>
> struct ravb_private *priv = netdev_priv(ndev);
> const struct ravb_hw_info *info = priv->info;
> struct ravb_rx_buffer *rx_buff;
> dma_addr_t dma_addr;
> unsigned int size;
>
> rx_buff = &priv->rx_buffers[q][entry];
>
> Edward Cree's xmastree tool can be helpful here:
> https://github.com/ecree-solarflare/xmastree

Ack.

>
>> +
>> + size = info->rx_buffer_size;
>> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
>> + gfp_mask);
>> + if (unlikely(!rx_buff->page)) {
>> + /* We just set the data size to 0 for a failed mapping
>> + * which should prevent DMA from happening...
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(0);
>> + return -ENOMEM;
>> + }
>> +
>> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> + info->rx_buffer_size, DMA_FROM_DEVICE);
>> + rx_desc->dptr = cpu_to_le32(dma_addr);
>> +
>> + /* The end of the RX buffer is used to store skb shared data, so we need
>> + * to ensure that the hardware leaves enough space for this.
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>> + - ETH_FCS_LEN + sizeof(__sum16));
>> + return 0;
>> +}
>
> ...
>
>> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> if (desc_status & MSC_CEEF)
>> stats->rx_missed_errors++;
>> } else {
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
>> die_dt = desc->die_dt & 0xF0;
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
>> + desc_len, DMA_FROM_DEVICE);
>> +
>> switch (die_dt) {
>> case DT_FSINGLE:
>> case DT_FSTART:
>> /* Start of packet:
>> - * Set initial data length.
>> + * Prepare an SKB and add initial data.
>> */
>> + skb = napi_build_skb(rx_addr, info->rx_buffer_size);
>> + if (unlikely(!skb)) {
>> + stats->rx_errors++;
>> + page_pool_put_page(priv->rx_pool[q],
>> + rx_buff->page, 0, true);
>
> Here skb is NULL.
>
>> + break;
>> + }
>> + skb_mark_for_recycle(skb);
>> skb_put(skb, desc_len);
>>
>> /* Save this SKB if the packet spans multiple
>> @@ -836,14 +856,23 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> case DT_FMID:
>> case DT_FEND:
>> /* Continuing a packet:
>> - * Move data into the saved SKB.
>> + * Add this buffer as an RX frag.
>> */
>> - skb_copy_to_linear_data_offset(priv->rx_1st_skb,
>> - priv->rx_1st_skb->len,
>> - skb->data,
>> - desc_len);
>> - skb_put(priv->rx_1st_skb, desc_len);
>> - dev_kfree_skb(skb);
>> +
>> + /* rx_1st_skb will be NULL if napi_build_skb()
>> + * failed for the first descriptor of a
>> + * multi-descriptor packet.
>> + */
>> + if (unlikely(!priv->rx_1st_skb)) {
>> + stats->rx_errors++;
>> + page_pool_put_page(priv->rx_pool[q],
>> + rx_buff->page, 0, true);
>
> And here skb seems to be uninitialised.
>
>> + break;
>> + }
>> + skb_add_rx_frag(priv->rx_1st_skb,
>> + skb_shinfo(priv->rx_1st_skb)->nr_frags,
>> + rx_buff->page, rx_buff->offset,
>> + desc_len, info->rx_buffer_size);
>>
>> /* Set skb to point at the whole packet so that
>> * we only need one code path for finishing a
>
> The code between the hunk above and the hunk below is:
>
> /* Set skb to point at the whole packet so that
> * we only need one code path for finishing a
> * packet.
> */
> skb = priv->rx_1st_skb;
> }
> switch (die_dt) {
> case DT_FSINGLE:
> case DT_FEND:
> /* Finishing a packet:
> * Determine protocol & checksum, hand off to
> * NAPI and update our stats.
> */
> skb->protocol = eth_type_trans(skb, ndev);
> if (ndev->features & NETIF_F_RXCSUM)
> ravb_rx_csum_gbeth(skb);
> stats->rx_bytes += skb->len;
> napi_gro_receive(&priv->napi[q], skb);
> rx_packets++;
>
> It seems that the inter-hunk code above may dereference skb when it is NULL
> or uninitialised.
>
> Flagged by Smatch.

I see what has happened here. I wrote this using if statements first,
then changed to switch statements in response to Sergey's review. So the
break statements were intended to break out of the outer for loop, not
the switch statement. I'll need to replace them with goto statements.

Thanks for your review!

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-06-03 12:08:50

by Simon Horman

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On Mon, Jun 03, 2024 at 09:02:51AM +0100, Paul Barker wrote:
> On 01/06/2024 11:13, Simon Horman wrote:
> > On Tue, May 28, 2024 at 04:03:39PM +0100, Paul Barker wrote:

...

> >> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
> >> priv->tx_ring[q] = NULL;
> >> }
> >>
> >> - /* Free RX skb ringbuffer */
> >> - if (priv->rx_skb[q]) {
> >> - for (i = 0; i < priv->num_rx_ring[q]; i++)
> >> - dev_kfree_skb(priv->rx_skb[q][i]);
> >> + /* Free RX buffers */
> >> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
> >> + if (priv->rx_buffers[q][i].page)
> >> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
> >
> > nit: Networking still prefers code to be 80 columns wide or less.
> > It looks like that can be trivially achieved here.
> >
> > Flagged by checkpatch.pl --max-line-length=80
>
> Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
> to 80 in the docs though [2], so I guess you may be right.
>
> [1]: https://lore.kernel.org/all/[email protected]/
> [2]: https://www.kernel.org/doc/html/latest/process/coding-style.html

Hi Paul,

If Sergey prefers 100 then I won't argue :)

FWIIW, think what has happened here relates to the Kernel, at some point,
going from 80 to 100 columns as the preferred maximum width, while Networking
stuck with 80.

...






2024-06-03 12:16:06

by Paul Barker

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 03/06/2024 13:07, Simon Horman wrote:
> On Mon, Jun 03, 2024 at 09:02:51AM +0100, Paul Barker wrote:
>> On 01/06/2024 11:13, Simon Horman wrote:
>>> On Tue, May 28, 2024 at 04:03:39PM +0100, Paul Barker wrote:
>
> ...
>
>>>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>>>> priv->tx_ring[q] = NULL;
>>>> }
>>>>
>>>> - /* Free RX skb ringbuffer */
>>>> - if (priv->rx_skb[q]) {
>>>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>>>> - dev_kfree_skb(priv->rx_skb[q][i]);
>>>> + /* Free RX buffers */
>>>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>>>> + if (priv->rx_buffers[q][i].page)
>>>> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
>>>
>>> nit: Networking still prefers code to be 80 columns wide or less.
>>> It looks like that can be trivially achieved here.
>>>
>>> Flagged by checkpatch.pl --max-line-length=80
>>
>> Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
>> to 80 in the docs though [2], so I guess you may be right.
>>
>> [1]: https://lore.kernel.org/all/[email protected]/
>> [2]: https://www.kernel.org/doc/html/latest/process/coding-style.html
>
> Hi Paul,
>
> If Sergey prefers 100 then I won't argue :)
>
> FWIIW, think what has happened here relates to the Kernel, at some point,
> going from 80 to 100 columns as the preferred maximum width, while Networking
> stuck with 80.

I saw that netdevbpf patchwork is configured for 80 cols and it has
warnings for v4 of this patch [1], so I've already re-wrapped the
changes in this series to 80 cols (excluding a couple of lines where
using slightly more than 80 cols significantly improves readability).
I'm planning to send that in the next hour or so, assuming my tests
pass.

[1]: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

--
Paul Barker


Attachments:
OpenPGP_0x27F4B3459F002257.asc (3.49 kB)
OpenPGP public key
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-06-03 20:46:26

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 6/3/24 11:02 AM, Paul Barker wrote:
[...]
>>> This patch makes multiple changes that can't be separated:
>>>
>>> 1) Allocate plain RX buffers via a page pool instead of allocating
>>> SKBs, then use build_skb() when a packet is received.
>>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>>> as SKB fragments instead of copying data.
>>>
>>> Implementing (1) without (2) would require the use of an order-1 page
>>> pool (instead of an order-0 page pool split into page fragments) for
>>> GbEth.
>>>
>>> Implementing (2) without (3) would leave us no space to re-assemble
>>> packets which span more than one RX descriptor.
>>>
>>> Implementing (3) without (1) would not be possible as the network stack
>>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>>> after an SKB is consumed.
>>>
>>> RX checksum offload support is adjusted to handle both linear and
>>> nonlinear (fragmented) packets.
>>>
>>> This patch gives the following improvements during testing with iperf3.
>>>
>>> * RZ/G2L:
>>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>>
>>> * RZ/G2UL:
>>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>>
>>> * RZ/G3S:
>>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>>
>>> * RZ/Five:
>>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>>
>>> There is no significant impact on bandwidth or CPU load in testing on
>>> RZ/G2H or R-Car M3N.
>>>
>>> Signed-off-by: Paul Barker <[email protected]>

[...]

>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
[...]
>>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>>> priv->tx_ring[q] = NULL;
>>> }
>>>
>>> - /* Free RX skb ringbuffer */
>>> - if (priv->rx_skb[q]) {
>>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>>> - dev_kfree_skb(priv->rx_skb[q][i]);
>>> + /* Free RX buffers */
>>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>>> + if (priv->rx_buffers[q][i].page)
>>> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
>>
>> nit: Networking still prefers code to be 80 columns wide or less.
>> It looks like that can be trivially achieved here.
>>
>> Flagged by checkpatch.pl --max-line-length=80
>
> Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
> to 80 in the docs though [2], so I guess you may be right.
>
> [1]: https://lore.kernel.org/all/[email protected]/
> [2]: https://www.kernel.org/doc/html/latest/process/coding-style.html

Note that I (mostly) meant the comments...

[...]

MBR, Sergey

2024-06-05 17:19:03

by Sergey Shtylyov

[permalink] [raw]
Subject: Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

On 6/3/24 3:15 PM, Paul Barker wrote:
[...]

>>>>> @@ -298,13 +269,14 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>>>>> priv->tx_ring[q] = NULL;
>>>>> }
>>>>>
>>>>> - /* Free RX skb ringbuffer */
>>>>> - if (priv->rx_skb[q]) {
>>>>> - for (i = 0; i < priv->num_rx_ring[q]; i++)
>>>>> - dev_kfree_skb(priv->rx_skb[q][i]);
>>>>> + /* Free RX buffers */
>>>>> + for (i = 0; i < priv->num_rx_ring[q]; i++) {
>>>>> + if (priv->rx_buffers[q][i].page)
>>>>> + page_pool_put_page(priv->rx_pool[q], priv->rx_buffers[q][i].page, 0, true);
>>>>
>>>> nit: Networking still prefers code to be 80 columns wide or less.
>>>> It looks like that can be trivially achieved here.
>>>>
>>>> Flagged by checkpatch.pl --max-line-length=80
>>>
>>> Sergey has asked me to wrap to 100 cols [1]. I can only find a reference
>>> to 80 in the docs though [2], so I guess you may be right.
>>>
>>> [1]: https://lore.kernel.org/all/[email protected]/
>>> [2]: https://www.kernel.org/doc/html/latest/process/coding-style.html
>>
>> Hi Paul,
>>
>> If Sergey prefers 100 then I won't argue :)
>>
>> FWIIW, think what has happened here relates to the Kernel, at some point,
>> going from 80 to 100 columns as the preferred maximum width, while Networking
>> stuck with 80.
>
> I saw that netdevbpf patchwork is configured for 80 cols and it has
> warnings for v4 of this patch [1], so I've already re-wrapped the
> changes in this series to 80 cols (excluding a couple of lines where
> using slightly more than 80 cols significantly improves readability).
> I'm planning to send that in the next hour or so, assuming my tests
> pass.
>
> [1]: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

Sorry for misinforming you about 100 coulmns -- I had no idea netdev stuck
to 80! :-)

MBR, Sergey