2023-08-10 07:22:46

by Wei Fang

[permalink] [raw]
Subject: [PATCH V5 net-next 0/2] net: fec: add XDP_TX feature support

This patch set is to support the XDP_TX feature of FEC driver, the first
patch is add initial XDP_TX support, and the second patch improves the
performance of XDP_TX by not using xdp_convert_buff_to_frame(). Please
refer to the commit message of each patch for more details.

Wei Fang (2):
net: fec: add XDP_TX feature support
net: fec: improve XDP_TX performance

drivers/net/ethernet/freescale/fec.h | 6 +-
drivers/net/ethernet/freescale/fec_main.c | 184 +++++++++++++++-------
2 files changed, 130 insertions(+), 60 deletions(-)

--
2.25.1



2023-08-10 07:44:12

by Wei Fang

[permalink] [raw]
Subject: [PATCH V5 net-next 1/2] net: fec: add XDP_TX feature support

The XDP_TX feature is not supported before, and all the frames
which are deemed to do XDP_TX action actually do the XDP_DROP
action. So this patch adds the XDP_TX support to FEC driver.

I tested the performance of XDP_TX in XDP_DRV mode and XDP_SKB
mode respectively on i.MX8MP-EVK platform, and as suggested by
Jesper, I also tested the performance of XDP_REDIRECT on the
same platform. And the test steps and results are as follows.

XDP_TX test:
Step 1: One board is used as generator and connects to switch,and
the FEC port of DUT also connects to the switch. Both boards with
flow control off. Then the generator runs the
pktgen_sample03_burst_single_flow.sh script to generate and send
burst traffic to DUT. Note that the size of packet was set to 64
bytes and the procotol of packet was UDP in my test scenario. In
addtion, the SMAC of the packet need to be different from the MAC
of the generator, because the xdp2 program will swap the DMAC and
SMAC of the packet and send it back to the generator. If the SMAC
of the generated packet is the MAC of the generator, the generator
will receive the returned traffic which increase the CPU loading
and significantly degrade the transmit speed of the generator, and
finally it affects the test of XDP_TX performance.

Step 2: The DUT runs the xdp2 program to transmit received UDP
packets back out on the same port where they were received.

root@imx8mpevk:~# ./xdp2 eth0
proto 17: 353918 pkt/s
proto 17: 352923 pkt/s
proto 17: 353900 pkt/s
proto 17: 352672 pkt/s
proto 17: 353912 pkt/s
proto 17: 354219 pkt/s

root@imx8mpevk:~# ./xdp2 -S eth0
proto 17: 160604 pkt/s
proto 17: 160708 pkt/s
proto 17: 160564 pkt/s
proto 17: 160684 pkt/s
proto 17: 160640 pkt/s
proto 17: 160720 pkt/s

The above results show that the XDP_TX performance of XDP_DRV mode
is much better than XDP_SKB mode, more than twice that of XDP_SKB
mode, which is in line with our expectation.

XDP_REDIRECT test:
Step1: Both the generator and the FEC port of the DUT connet to the
switch port. All the ports with flow control off, then the generator
runs the pktgen script to generate and send burst traffic to DUT.
Note that the size of packet was set to 64 bytes and the procotol of
packet was UDP in my test scenario.

Step2: The DUT runs the xdp_redirect program to redirect the traffic
from the FEC port to the FEC port itself.

root@imx8mpevk:~# ./xdp_redirect eth0 eth0
Redirecting from eth0 (ifindex 2; driver fec) to eth0
(ifindex 2; driver fec)
Summary 232,302 rx/s 0 err,drop/s 232,344 xmit/s
Summary 234,579 rx/s 0 err,drop/s 234,577 xmit/s
Summary 235,548 rx/s 0 err,drop/s 235,549 xmit/s
Summary 234,704 rx/s 0 err,drop/s 234,703 xmit/s
Summary 235,504 rx/s 0 err,drop/s 235,504 xmit/s
Summary 235,223 rx/s 0 err,drop/s 235,224 xmit/s
Summary 234,509 rx/s 0 err,drop/s 234,507 xmit/s
Summary 235,481 rx/s 0 err,drop/s 235,482 xmit/s
Summary 234,684 rx/s 0 err,drop/s 234,683 xmit/s
Summary 235,520 rx/s 0 err,drop/s 235,520 xmit/s
Summary 235,461 rx/s 0 err,drop/s 235,461 xmit/s
Summary 234,627 rx/s 0 err,drop/s 234,627 xmit/s
Summary 235,611 rx/s 0 err,drop/s 235,611 xmit/s
Packets received : 3,053,753
Average packets/s : 234,904
Packets transmitted : 3,053,792
Average transmit/s : 234,907

Compared the performance of XDP_TX with XDP_REDIRECT, XDP_TX is also
much better than XDP_REDIRECT. It's also in line with our expectation.

Signed-off-by: Wei Fang <[email protected]>
Suggested-by: Jesper Dangaard Brouer <[email protected]>
Suggested-by: Jakub Kicinski <[email protected]>
Reviewed-by: Larysa Zaremba <[email protected]>
---
V2 changes:
According to Jakub's comments, the V2 patch adds two changes.
1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
tx timeout as XDP shares the queues with kernel stack.
2. Tx processing shouldn't call any XDP (or page pool) APIs if the
"budget" is 0.

V3 changes:
1. Remove the second change in V2, because this change has been
separated into another patch and it has been submmitted to the
upstream [1].
[1] https://lore.kernel.org/r/[email protected]

V4 changes:
1. Based on Jakub's comments, add trace_xdp_exception() for the
error path of XDP_TX.

V5: changes:
1. Implement Jesper's "sync_dma_len" suggestion and simultaneously
use page_pool_put_page(pool, page, 0, true) for XDP_TX to avoid
the DMA sync on page recycle, which is suggested by Jakub.
2. Based on Jesper's suggestion, add a benchmark comparison between
XDP_TX and XDP_REDIRECT. So the commit message is also modified
synchronously.
---
drivers/net/ethernet/freescale/fec.h | 1 +
drivers/net/ethernet/freescale/fec_main.c | 104 +++++++++++++++++-----
2 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 5ca9906d7c6a..7bb02a9da2a6 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -548,6 +548,7 @@ enum {
enum fec_txbuf_type {
FEC_TXBUF_T_SKB,
FEC_TXBUF_T_XDP_NDO,
+ FEC_TXBUF_T_XDP_TX,
};

struct fec_tx_buffer {
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 43f14cec91e9..30b01985be7c 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -68,6 +68,7 @@
#include <soc/imx/cpuidle.h>
#include <linux/filter.h>
#include <linux/bpf.h>
+#include <linux/bpf_trace.h>

#include <asm/cacheflush.h>

@@ -75,6 +76,9 @@

static void set_multicast_list(struct net_device *ndev);
static void fec_enet_itr_coal_set(struct net_device *ndev);
+static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
+ int cpu, struct xdp_buff *xdp,
+ u32 dma_sync_len);

#define DRIVER_NAME "fec"

@@ -960,7 +964,8 @@ static void fec_enet_bd_init(struct net_device *dev)
txq->tx_buf[i].skb = NULL;
}
} else {
- if (bdp->cbd_bufaddr)
+ if (bdp->cbd_bufaddr &&
+ txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
dma_unmap_single(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr),
fec16_to_cpu(bdp->cbd_datlen),
@@ -1423,13 +1428,14 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
break;

xdpf = txq->tx_buf[index].xdp;
- if (bdp->cbd_bufaddr)
+ if (bdp->cbd_bufaddr &&
+ txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
dma_unmap_single(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr),
fec16_to_cpu(bdp->cbd_datlen),
DMA_TO_DEVICE);
bdp->cbd_bufaddr = cpu_to_fec32(0);
- if (!xdpf) {
+ if (unlikely(!xdpf)) {
txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
goto tx_buf_done;
}
@@ -1482,7 +1488,13 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
/* Free the sk buffer associated with this last transmit */
dev_kfree_skb_any(skb);
} else {
- xdp_return_frame(xdpf);
+ if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
+ xdp_return_frame_rx_napi(xdpf);
+ } else {
+ struct page *page = virt_to_head_page(xdpf->data);
+
+ page_pool_put_page(page->pp, page, 0, true);
+ }

txq->tx_buf[index].xdp = NULL;
/* restore default tx buffer type: FEC_TXBUF_T_SKB */
@@ -1541,7 +1553,7 @@ static void fec_enet_update_cbd(struct fec_enet_priv_rx_q *rxq,

static u32
fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
- struct xdp_buff *xdp, struct fec_enet_priv_rx_q *rxq, int index)
+ struct xdp_buff *xdp, struct fec_enet_priv_rx_q *rxq, int cpu)
{
unsigned int sync, len = xdp->data_end - xdp->data;
u32 ret = FEC_ENET_XDP_PASS;
@@ -1551,8 +1563,10 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,

act = bpf_prog_run_xdp(prog, xdp);

- /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
- sync = xdp->data_end - xdp->data_hard_start - FEC_ENET_XDP_HEADROOM;
+ /* Due xdp_adjust_tail and xdp_adjust_head: DMA sync for_device cover
+ * max len CPU touch
+ */
+ sync = xdp->data_end - xdp->data;
sync = max(sync, len);

switch (act) {
@@ -1573,11 +1587,19 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
}
break;

- default:
- bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
- fallthrough;
-
case XDP_TX:
+ err = fec_enet_xdp_tx_xmit(fep, cpu, xdp, sync);
+ if (unlikely(err)) {
+ ret = FEC_ENET_XDP_CONSUMED;
+ page = virt_to_head_page(xdp->data);
+ page_pool_put_page(rxq->page_pool, page, sync, true);
+ trace_xdp_exception(fep->netdev, prog, act);
+ } else {
+ ret = FEC_ENET_XDP_TX;
+ }
+ break;
+
+ default:
bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
fallthrough;

@@ -1619,6 +1641,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
struct bpf_prog *xdp_prog = READ_ONCE(fep->xdp_prog);
u32 ret, xdp_result = FEC_ENET_XDP_PASS;
u32 data_start = FEC_ENET_XDP_HEADROOM;
+ int cpu = smp_processor_id();
struct xdp_buff xdp;
struct page *page;
u32 sub_len = 4;
@@ -1697,7 +1720,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
/* subtract 16bit shift and FCS */
xdp_prepare_buff(&xdp, page_address(page),
data_start, pkt_len - sub_len, false);
- ret = fec_enet_run_xdp(fep, xdp_prog, &xdp, rxq, index);
+ ret = fec_enet_run_xdp(fep, xdp_prog, &xdp, rxq, cpu);
xdp_result |= ret;
if (ret != FEC_ENET_XDP_PASS)
goto rx_processing_done;
@@ -3766,7 +3789,8 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)

static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
struct fec_enet_priv_tx_q *txq,
- struct xdp_frame *frame)
+ struct xdp_frame *frame,
+ u32 dma_sync_len, bool ndo_xmit)
{
unsigned int index, status, estatus;
struct bufdesc *bdp;
@@ -3786,10 +3810,24 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,

index = fec_enet_get_bd_index(bdp, &txq->bd);

- dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
- frame->len, DMA_TO_DEVICE);
- if (dma_mapping_error(&fep->pdev->dev, dma_addr))
- return -ENOMEM;
+ if (ndo_xmit) {
+ dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
+ frame->len, DMA_TO_DEVICE);
+ if (dma_mapping_error(&fep->pdev->dev, dma_addr))
+ return -ENOMEM;
+
+ txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
+ } else {
+ struct page *page = virt_to_page(frame->data);
+
+ dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
+ frame->headroom;
+ dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
+ dma_sync_len, DMA_BIDIRECTIONAL);
+ txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
+ }
+
+ txq->tx_buf[index].xdp = frame;

status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
if (fep->bufdesc_ex)
@@ -3808,9 +3846,6 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
ebdp->cbd_esc = cpu_to_fec32(estatus);
}

- txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
- txq->tx_buf[index].xdp = frame;
-
/* Make sure the updates to rest of the descriptor are performed before
* transferring ownership.
*/
@@ -3836,6 +3871,33 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
return 0;
}

+static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
+ int cpu, struct xdp_buff *xdp,
+ u32 dma_sync_len)
+{
+ struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
+ struct fec_enet_priv_tx_q *txq;
+ struct netdev_queue *nq;
+ int queue, ret;
+
+ if (unlikely(!xdpf))
+ return -EFAULT;
+
+ queue = fec_enet_xdp_get_tx_queue(fep, cpu);
+ txq = fep->tx_queue[queue];
+ nq = netdev_get_tx_queue(fep->netdev, queue);
+
+ __netif_tx_lock(nq, cpu);
+
+ /* Avoid tx timeout as XDP shares the queue with kernel stack */
+ txq_trans_cond_update(nq);
+ ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len, false);
+
+ __netif_tx_unlock(nq);
+
+ return ret;
+}
+
static int fec_enet_xdp_xmit(struct net_device *dev,
int num_frames,
struct xdp_frame **frames,
@@ -3858,7 +3920,7 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
/* Avoid tx timeout as XDP shares the queue with kernel stack */
txq_trans_cond_update(nq);
for (i = 0; i < num_frames; i++) {
- if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
+ if (fec_enet_txq_xmit_frame(fep, txq, frames[i], 0, true) < 0)
break;
sent_frames++;
}
--
2.25.1


2023-08-10 08:33:37

by Wei Fang

[permalink] [raw]
Subject: [PATCH V5 net-next 2/2] net: fec: improve XDP_TX performance

As suggested by Jesper and Alexander, we can avoid converting xdp_buff
to xdp_frame in case of XDP_TX to save a bunch of CPU cycles, so that
we can further improve the XDP_TX performance.

Before this patch on i.MX8MP-EVK board, the performance shows as follows.
root@imx8mpevk:~# ./xdp2 eth0
proto 17: 353918 pkt/s
proto 17: 352923 pkt/s
proto 17: 353900 pkt/s
proto 17: 352672 pkt/s
proto 17: 353912 pkt/s
proto 17: 354219 pkt/s

After applying this patch, the performance is improved.
root@imx8mpevk:~# ./xdp2 eth0
proto 17: 369261 pkt/s
proto 17: 369267 pkt/s
proto 17: 369206 pkt/s
proto 17: 369214 pkt/s
proto 17: 369126 pkt/s
proto 17: 369272 pkt/s

Signed-off-by: Wei Fang <[email protected]>
Suggested-by: Alexander Lobakin <[email protected]>
Suggested-by: Jesper Dangaard Brouer <[email protected]>
---
V5 changes:
New patch. Separated from the first patch, to keep track of the changes
and improvements (suggested by Jesper).
---
drivers/net/ethernet/freescale/fec.h | 5 +-
drivers/net/ethernet/freescale/fec_main.c | 134 ++++++++++++----------
2 files changed, 73 insertions(+), 66 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 7bb02a9da2a6..a8fbcada6b01 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -552,10 +552,7 @@ enum fec_txbuf_type {
};

struct fec_tx_buffer {
- union {
- struct sk_buff *skb;
- struct xdp_frame *xdp;
- };
+ void *buf_p;
enum fec_txbuf_type type;
};

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 30b01985be7c..ae6e41ad71b8 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -399,7 +399,7 @@ static void fec_dump(struct net_device *ndev)
fec16_to_cpu(bdp->cbd_sc),
fec32_to_cpu(bdp->cbd_bufaddr),
fec16_to_cpu(bdp->cbd_datlen),
- txq->tx_buf[index].skb);
+ txq->tx_buf[index].buf_p);
bdp = fec_enet_get_nextdesc(bdp, &txq->bd);
index++;
} while (bdp != txq->bd.base);
@@ -656,7 +656,7 @@ static int fec_enet_txq_submit_skb(struct fec_enet_priv_tx_q *txq,

index = fec_enet_get_bd_index(last_bdp, &txq->bd);
/* Save skb pointer */
- txq->tx_buf[index].skb = skb;
+ txq->tx_buf[index].buf_p = skb;

/* Make sure the updates to rest of the descriptor are performed before
* transferring ownership.
@@ -862,7 +862,7 @@ static int fec_enet_txq_submit_tso(struct fec_enet_priv_tx_q *txq,
}

/* Save skb pointer */
- txq->tx_buf[index].skb = skb;
+ txq->tx_buf[index].buf_p = skb;

skb_tx_timestamp(skb);
txq->bd.cur = bdp;
@@ -959,27 +959,27 @@ static void fec_enet_bd_init(struct net_device *dev)
fec32_to_cpu(bdp->cbd_bufaddr),
fec16_to_cpu(bdp->cbd_datlen),
DMA_TO_DEVICE);
- if (txq->tx_buf[i].skb) {
- dev_kfree_skb_any(txq->tx_buf[i].skb);
- txq->tx_buf[i].skb = NULL;
- }
- } else {
- if (bdp->cbd_bufaddr &&
- txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
+ if (txq->tx_buf[i].buf_p)
+ dev_kfree_skb_any(txq->tx_buf[i].buf_p);
+ } else if (txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO) {
+ if (bdp->cbd_bufaddr)
dma_unmap_single(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr),
fec16_to_cpu(bdp->cbd_datlen),
DMA_TO_DEVICE);

- if (txq->tx_buf[i].xdp) {
- xdp_return_frame(txq->tx_buf[i].xdp);
- txq->tx_buf[i].xdp = NULL;
- }
+ if (txq->tx_buf[i].buf_p)
+ xdp_return_frame(txq->tx_buf[i].buf_p);
+ } else {
+ struct page *page = txq->tx_buf[i].buf_p;

- /* restore default tx buffer type: FEC_TXBUF_T_SKB */
- txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
+ if (page)
+ page_pool_put_page(page->pp, page, 0, false);
}

+ txq->tx_buf[i].buf_p = NULL;
+ /* restore default tx buffer type: FEC_TXBUF_T_SKB */
+ txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
bdp->cbd_bufaddr = cpu_to_fec32(0);
bdp = fec_enet_get_nextdesc(bdp, &txq->bd);
}
@@ -1386,6 +1386,8 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
struct netdev_queue *nq;
int index = 0;
int entries_free;
+ struct page *page;
+ int frame_len;

fep = netdev_priv(ndev);

@@ -1407,8 +1409,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
index = fec_enet_get_bd_index(bdp, &txq->bd);

if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB) {
- skb = txq->tx_buf[index].skb;
- txq->tx_buf[index].skb = NULL;
+ skb = txq->tx_buf[index].buf_p;
if (bdp->cbd_bufaddr &&
!IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
dma_unmap_single(&fep->pdev->dev,
@@ -1427,18 +1428,24 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
if (unlikely(!budget))
break;

- xdpf = txq->tx_buf[index].xdp;
- if (bdp->cbd_bufaddr &&
- txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
- dma_unmap_single(&fep->pdev->dev,
- fec32_to_cpu(bdp->cbd_bufaddr),
- fec16_to_cpu(bdp->cbd_datlen),
- DMA_TO_DEVICE);
+ if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
+ xdpf = txq->tx_buf[index].buf_p;
+ if (bdp->cbd_bufaddr)
+ dma_unmap_single(&fep->pdev->dev,
+ fec32_to_cpu(bdp->cbd_bufaddr),
+ fec16_to_cpu(bdp->cbd_datlen),
+ DMA_TO_DEVICE);
+ } else {
+ page = txq->tx_buf[index].buf_p;
+ }
+
bdp->cbd_bufaddr = cpu_to_fec32(0);
- if (unlikely(!xdpf)) {
+ if (unlikely(!txq->tx_buf[index].buf_p)) {
txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
goto tx_buf_done;
}
+
+ frame_len = fec16_to_cpu(bdp->cbd_datlen);
}

/* Check for errors. */
@@ -1462,7 +1469,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB)
ndev->stats.tx_bytes += skb->len;
else
- ndev->stats.tx_bytes += xdpf->len;
+ ndev->stats.tx_bytes += frame_len;
}

/* Deferred means some collisions occurred during transmit,
@@ -1487,20 +1494,16 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)

/* Free the sk buffer associated with this last transmit */
dev_kfree_skb_any(skb);
+ } else if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
+ xdp_return_frame_rx_napi(xdpf);
} else {
- if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
- xdp_return_frame_rx_napi(xdpf);
- } else {
- struct page *page = virt_to_head_page(xdpf->data);
-
- page_pool_put_page(page->pp, page, 0, true);
- }
-
- txq->tx_buf[index].xdp = NULL;
- /* restore default tx buffer type: FEC_TXBUF_T_SKB */
- txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
+ page_pool_put_page(page->pp, page, 0, true);
}

+ txq->tx_buf[index].buf_p = NULL;
+ /* restore default tx buffer type: FEC_TXBUF_T_SKB */
+ txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
+
tx_buf_done:
/* Make sure the update to bdp and tx_buf are performed
* before dirty_tx
@@ -3230,7 +3233,6 @@ static void fec_enet_free_buffers(struct net_device *ndev)
{
struct fec_enet_private *fep = netdev_priv(ndev);
unsigned int i;
- struct sk_buff *skb;
struct fec_enet_priv_tx_q *txq;
struct fec_enet_priv_rx_q *rxq;
unsigned int q;
@@ -3255,18 +3257,23 @@ static void fec_enet_free_buffers(struct net_device *ndev)
kfree(txq->tx_bounce[i]);
txq->tx_bounce[i] = NULL;

+ if (!txq->tx_buf[i].buf_p) {
+ txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
+ continue;
+ }
+
if (txq->tx_buf[i].type == FEC_TXBUF_T_SKB) {
- skb = txq->tx_buf[i].skb;
- txq->tx_buf[i].skb = NULL;
- dev_kfree_skb(skb);
+ dev_kfree_skb(txq->tx_buf[i].buf_p);
+ } else if (txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO) {
+ xdp_return_frame(txq->tx_buf[i].buf_p);
} else {
- if (txq->tx_buf[i].xdp) {
- xdp_return_frame(txq->tx_buf[i].xdp);
- txq->tx_buf[i].xdp = NULL;
- }
+ struct page *page = txq->tx_buf[i].buf_p;

- txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
+ page_pool_put_page(page->pp, page, 0, false);
}
+
+ txq->tx_buf[i].buf_p = NULL;
+ txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
}
}
}
@@ -3789,13 +3796,14 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)

static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
struct fec_enet_priv_tx_q *txq,
- struct xdp_frame *frame,
- u32 dma_sync_len, bool ndo_xmit)
+ void *frame, u32 dma_sync_len,
+ bool ndo_xmit)
{
unsigned int index, status, estatus;
struct bufdesc *bdp;
dma_addr_t dma_addr;
int entries_free;
+ u16 frame_len;

entries_free = fec_enet_get_free_txdesc_num(txq);
if (entries_free < MAX_SKB_FRAGS + 1) {
@@ -3811,30 +3819,36 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
index = fec_enet_get_bd_index(bdp, &txq->bd);

if (ndo_xmit) {
- dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
- frame->len, DMA_TO_DEVICE);
+ struct xdp_frame *xdpf = frame;
+
+ dma_addr = dma_map_single(&fep->pdev->dev, xdpf->data,
+ xdpf->len, DMA_TO_DEVICE);
if (dma_mapping_error(&fep->pdev->dev, dma_addr))
return -ENOMEM;

+ frame_len = xdpf->len;
+ txq->tx_buf[index].buf_p = xdpf;
txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
} else {
- struct page *page = virt_to_page(frame->data);
+ struct xdp_buff *xdpb = frame;
+ struct page *page;

- dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
- frame->headroom;
+ page = virt_to_page(xdpb->data);
+ dma_addr = page_pool_get_dma_addr(page) +
+ (xdpb->data - xdpb->data_hard_start);
dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
dma_sync_len, DMA_BIDIRECTIONAL);
+ frame_len = xdpb->data_end - xdpb->data;
+ txq->tx_buf[index].buf_p = page;
txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
}

- txq->tx_buf[index].xdp = frame;
-
status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
if (fep->bufdesc_ex)
estatus = BD_ENET_TX_INT;

bdp->cbd_bufaddr = cpu_to_fec32(dma_addr);
- bdp->cbd_datlen = cpu_to_fec16(frame->len);
+ bdp->cbd_datlen = cpu_to_fec16(frame_len);

if (fep->bufdesc_ex) {
struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
@@ -3875,14 +3889,10 @@ static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
int cpu, struct xdp_buff *xdp,
u32 dma_sync_len)
{
- struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
struct fec_enet_priv_tx_q *txq;
struct netdev_queue *nq;
int queue, ret;

- if (unlikely(!xdpf))
- return -EFAULT;
-
queue = fec_enet_xdp_get_tx_queue(fep, cpu);
txq = fep->tx_queue[queue];
nq = netdev_get_tx_queue(fep->netdev, queue);
@@ -3891,7 +3901,7 @@ static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,

/* Avoid tx timeout as XDP shares the queue with kernel stack */
txq_trans_cond_update(nq);
- ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len, false);
+ ret = fec_enet_txq_xmit_frame(fep, txq, xdp, dma_sync_len, false);

__netif_tx_unlock(nq);

--
2.25.1


2023-08-10 14:26:13

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH V5 net-next 1/2] net: fec: add XDP_TX feature support



On 10/08/2023 08.45, Wei Fang wrote:
> The XDP_TX feature is not supported before, and all the frames
> which are deemed to do XDP_TX action actually do the XDP_DROP
> action. So this patch adds the XDP_TX support to FEC driver.
>
> I tested the performance of XDP_TX in XDP_DRV mode and XDP_SKB
> mode respectively on i.MX8MP-EVK platform, and as suggested by
> Jesper, I also tested the performance of XDP_REDIRECT on the
> same platform. And the test steps and results are as follows.
>
> XDP_TX test:
> Step 1: One board is used as generator and connects to switch,and
> the FEC port of DUT also connects to the switch. Both boards with
> flow control off. Then the generator runs the
> pktgen_sample03_burst_single_flow.sh script to generate and send
> burst traffic to DUT. Note that the size of packet was set to 64
> bytes and the procotol of packet was UDP in my test scenario. In
> addtion, the SMAC of the packet need to be different from the MAC
> of the generator, because the xdp2 program will swap the DMAC and
> SMAC of the packet and send it back to the generator. If the SMAC
> of the generated packet is the MAC of the generator, the generator
> will receive the returned traffic which increase the CPU loading
> and significantly degrade the transmit speed of the generator, and
> finally it affects the test of XDP_TX performance.
>

Great that you included this description.

> Step 2: The DUT runs the xdp2 program to transmit received UDP
> packets back out on the same port where they were received.
>
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17: 353918 pkt/s
> proto 17: 352923 pkt/s
> proto 17: 353900 pkt/s
> proto 17: 352672 pkt/s
> proto 17: 353912 pkt/s
> proto 17: 354219 pkt/s
>

Nice performance results! :-)

> root@imx8mpevk:~# ./xdp2 -S eth0
> proto 17: 160604 pkt/s
> proto 17: 160708 pkt/s
> proto 17: 160564 pkt/s
> proto 17: 160684 pkt/s
> proto 17: 160640 pkt/s
> proto 17: 160720 pkt/s
>
> The above results show that the XDP_TX performance of XDP_DRV mode
> is much better than XDP_SKB mode, more than twice that of XDP_SKB
> mode, which is in line with our expectation.
>
> XDP_REDIRECT test:
> Step1: Both the generator and the FEC port of the DUT connet to the
> switch port. All the ports with flow control off, then the generator
> runs the pktgen script to generate and send burst traffic to DUT.
> Note that the size of packet was set to 64 bytes and the procotol of
> packet was UDP in my test scenario.
>
> Step2: The DUT runs the xdp_redirect program to redirect the traffic
> from the FEC port to the FEC port itself.
>
> root@imx8mpevk:~# ./xdp_redirect eth0 eth0
> Redirecting from eth0 (ifindex 2; driver fec) to eth0
> (ifindex 2; driver fec)
> Summary 232,302 rx/s 0 err,drop/s 232,344 xmit/s
> Summary 234,579 rx/s 0 err,drop/s 234,577 xmit/s
> Summary 235,548 rx/s 0 err,drop/s 235,549 xmit/s
> Summary 234,704 rx/s 0 err,drop/s 234,703 xmit/s
> Summary 235,504 rx/s 0 err,drop/s 235,504 xmit/s
> Summary 235,223 rx/s 0 err,drop/s 235,224 xmit/s
> Summary 234,509 rx/s 0 err,drop/s 234,507 xmit/s
> Summary 235,481 rx/s 0 err,drop/s 235,482 xmit/s
> Summary 234,684 rx/s 0 err,drop/s 234,683 xmit/s
> Summary 235,520 rx/s 0 err,drop/s 235,520 xmit/s
> Summary 235,461 rx/s 0 err,drop/s 235,461 xmit/s
> Summary 234,627 rx/s 0 err,drop/s 234,627 xmit/s
> Summary 235,611 rx/s 0 err,drop/s 235,611 xmit/s
> Packets received : 3,053,753
> Average packets/s : 234,904
> Packets transmitted : 3,053,792
> Average transmit/s : 234,907
>
> Compared the performance of XDP_TX with XDP_REDIRECT, XDP_TX is also
> much better than XDP_REDIRECT. It's also in line with our expectation.
>
> Signed-off-by: Wei Fang <[email protected]>
> Suggested-by: Jesper Dangaard Brouer <[email protected]>
> Suggested-by: Jakub Kicinski <[email protected]>
> Reviewed-by: Larysa Zaremba <[email protected]>
> ---

If you add below code comment you can add my ACK in V6:

Acked-by: Jesper Dangaard Brouer <[email protected]>


> V2 changes:
> According to Jakub's comments, the V2 patch adds two changes.
> 1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
> tx timeout as XDP shares the queues with kernel stack.
> 2. Tx processing shouldn't call any XDP (or page pool) APIs if the
> "budget" is 0.
>
> V3 changes:
> 1. Remove the second change in V2, because this change has been
> separated into another patch and it has been submmitted to the
> upstream [1].
> [1] https://lore.kernel.org/r/[email protected]
>
> V4 changes:
> 1. Based on Jakub's comments, add trace_xdp_exception() for the
> error path of XDP_TX.
>
> V5: changes:
> 1. Implement Jesper's "sync_dma_len" suggestion and simultaneously
> use page_pool_put_page(pool, page, 0, true) for XDP_TX to avoid
> the DMA sync on page recycle, which is suggested by Jakub.
> 2. Based on Jesper's suggestion, add a benchmark comparison between
> XDP_TX and XDP_REDIRECT. So the commit message is also modified
> synchronously.
> ---
> drivers/net/ethernet/freescale/fec.h | 1 +
> drivers/net/ethernet/freescale/fec_main.c | 104 +++++++++++++++++-----
> 2 files changed, 84 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 5ca9906d7c6a..7bb02a9da2a6 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -548,6 +548,7 @@ enum {
> enum fec_txbuf_type {
> FEC_TXBUF_T_SKB,
> FEC_TXBUF_T_XDP_NDO,
> + FEC_TXBUF_T_XDP_TX,
> };
>
> struct fec_tx_buffer {
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 43f14cec91e9..30b01985be7c 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -68,6 +68,7 @@
> #include <soc/imx/cpuidle.h>
> #include <linux/filter.h>
> #include <linux/bpf.h>
> +#include <linux/bpf_trace.h>
>
> #include <asm/cacheflush.h>
>
> @@ -75,6 +76,9 @@
>
> static void set_multicast_list(struct net_device *ndev);
> static void fec_enet_itr_coal_set(struct net_device *ndev);
> +static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
> + int cpu, struct xdp_buff *xdp,
> + u32 dma_sync_len);
>
> #define DRIVER_NAME "fec"
>
> @@ -960,7 +964,8 @@ static void fec_enet_bd_init(struct net_device *dev)
> txq->tx_buf[i].skb = NULL;
> }
> } else {
> - if (bdp->cbd_bufaddr)
> + if (bdp->cbd_bufaddr &&
> + txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
> dma_unmap_single(&fep->pdev->dev,
> fec32_to_cpu(bdp->cbd_bufaddr),
> fec16_to_cpu(bdp->cbd_datlen),
> @@ -1423,13 +1428,14 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
> break;
>
> xdpf = txq->tx_buf[index].xdp;
> - if (bdp->cbd_bufaddr)
> + if (bdp->cbd_bufaddr &&
> + txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
> dma_unmap_single(&fep->pdev->dev,
> fec32_to_cpu(bdp->cbd_bufaddr),
> fec16_to_cpu(bdp->cbd_datlen),
> DMA_TO_DEVICE);
> bdp->cbd_bufaddr = cpu_to_fec32(0);
> - if (!xdpf) {
> + if (unlikely(!xdpf)) {
> txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
> goto tx_buf_done;
> }
> @@ -1482,7 +1488,13 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
> /* Free the sk buffer associated with this last transmit */
> dev_kfree_skb_any(skb);
> } else {
> - xdp_return_frame(xdpf);
> + if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
> + xdp_return_frame_rx_napi(xdpf);
> + } else {
> + struct page *page = virt_to_head_page(xdpf->data);
> +

I think this usage of page_pool_put_page() with dma_sync_size=0 requires
a comment, else we will forget why this okay...
I suggest:

/* PP dma_sync_size=0 as xmit already synced DMA for_device */

> + page_pool_put_page(page->pp, page, 0, true);
> + }
>
> txq->tx_buf[index].xdp = NULL;
> /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> @@ -1541,7 +1553,7 @@ static void fec_enet_update_cbd(struct fec_enet_priv_rx_q *rxq,
>
> static u32
> fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
> - struct xdp_buff *xdp, struct fec_enet_priv_rx_q *rxq, int index)
> + struct xdp_buff *xdp, struct fec_enet_priv_rx_q *rxq, int cpu)
> {
> unsigned int sync, len = xdp->data_end - xdp->data;
> u32 ret = FEC_ENET_XDP_PASS;
> @@ -1551,8 +1563,10 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
>
> act = bpf_prog_run_xdp(prog, xdp);
>
> - /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
> - sync = xdp->data_end - xdp->data_hard_start - FEC_ENET_XDP_HEADROOM;
> + /* Due xdp_adjust_tail and xdp_adjust_head: DMA sync for_device cover
> + * max len CPU touch
> + */
> + sync = xdp->data_end - xdp->data;
> sync = max(sync, len);
>
> switch (act) {
> @@ -1573,11 +1587,19 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
> }
> break;
>
> - default:
> - bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> - fallthrough;
> -
> case XDP_TX:
> + err = fec_enet_xdp_tx_xmit(fep, cpu, xdp, sync);
> + if (unlikely(err)) {
> + ret = FEC_ENET_XDP_CONSUMED;
> + page = virt_to_head_page(xdp->data);
> + page_pool_put_page(rxq->page_pool, page, sync, true);
> + trace_xdp_exception(fep->netdev, prog, act);
> + } else {
> + ret = FEC_ENET_XDP_TX;
> + }
> + break;
> +
> + default:
> bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> fallthrough;
>
> @@ -1619,6 +1641,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
> struct bpf_prog *xdp_prog = READ_ONCE(fep->xdp_prog);
> u32 ret, xdp_result = FEC_ENET_XDP_PASS;
> u32 data_start = FEC_ENET_XDP_HEADROOM;
> + int cpu = smp_processor_id();
> struct xdp_buff xdp;
> struct page *page;
> u32 sub_len = 4;
> @@ -1697,7 +1720,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
> /* subtract 16bit shift and FCS */
> xdp_prepare_buff(&xdp, page_address(page),
> data_start, pkt_len - sub_len, false);
> - ret = fec_enet_run_xdp(fep, xdp_prog, &xdp, rxq, index);
> + ret = fec_enet_run_xdp(fep, xdp_prog, &xdp, rxq, cpu);
> xdp_result |= ret;
> if (ret != FEC_ENET_XDP_PASS)
> goto rx_processing_done;
> @@ -3766,7 +3789,8 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)
>
> static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> struct fec_enet_priv_tx_q *txq,
> - struct xdp_frame *frame)
> + struct xdp_frame *frame,
> + u32 dma_sync_len, bool ndo_xmit)
> {
> unsigned int index, status, estatus;
> struct bufdesc *bdp;
> @@ -3786,10 +3810,24 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>
> index = fec_enet_get_bd_index(bdp, &txq->bd);
>
> - dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> - frame->len, DMA_TO_DEVICE);
> - if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> - return -ENOMEM;
> + if (ndo_xmit) {
> + dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> + frame->len, DMA_TO_DEVICE);
> + if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> + return -ENOMEM;
> +
> + txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> + } else {
> + struct page *page = virt_to_page(frame->data);
> +
> + dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
> + frame->headroom;
> + dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> + dma_sync_len, DMA_BIDIRECTIONAL);

DMA sync for_device is here, that allows recycling PP with dma_sync_size=0.

> + txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> + }
> +
> + txq->tx_buf[index].xdp = frame;
>
> status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
> if (fep->bufdesc_ex)
> @@ -3808,9 +3846,6 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> ebdp->cbd_esc = cpu_to_fec32(estatus);
> }
>
> - txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> - txq->tx_buf[index].xdp = frame;
> -
> /* Make sure the updates to rest of the descriptor are performed before
> * transferring ownership.
> */
> @@ -3836,6 +3871,33 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> return 0;
> }
>
> +static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
> + int cpu, struct xdp_buff *xdp,
> + u32 dma_sync_len)
> +{
> + struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
> + struct fec_enet_priv_tx_q *txq;
> + struct netdev_queue *nq;
> + int queue, ret;
> +
> + if (unlikely(!xdpf))
> + return -EFAULT;
> +
> + queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> + txq = fep->tx_queue[queue];
> + nq = netdev_get_tx_queue(fep->netdev, queue);
> +
> + __netif_tx_lock(nq, cpu);
> +
> + /* Avoid tx timeout as XDP shares the queue with kernel stack */
> + txq_trans_cond_update(nq);
> + ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len, false);
> +
> + __netif_tx_unlock(nq);
> +
> + return ret;
> +}
> +
> static int fec_enet_xdp_xmit(struct net_device *dev,
> int num_frames,
> struct xdp_frame **frames,
> @@ -3858,7 +3920,7 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
> /* Avoid tx timeout as XDP shares the queue with kernel stack */
> txq_trans_cond_update(nq);
> for (i = 0; i < num_frames; i++) {
> - if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
> + if (fec_enet_txq_xmit_frame(fep, txq, frames[i], 0, true) < 0)
> break;
> sent_frames++;
> }

2023-08-10 14:49:18

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH V5 net-next 2/2] net: fec: improve XDP_TX performance



On 10/08/2023 08.45, Wei Fang wrote:
> As suggested by Jesper and Alexander, we can avoid converting xdp_buff
> to xdp_frame in case of XDP_TX to save a bunch of CPU cycles, so that
> we can further improve the XDP_TX performance.
>
> Before this patch on i.MX8MP-EVK board, the performance shows as follows.
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17: 353918 pkt/s
> proto 17: 352923 pkt/s
> proto 17: 353900 pkt/s
> proto 17: 352672 pkt/s
> proto 17: 353912 pkt/s
> proto 17: 354219 pkt/s
>
> After applying this patch, the performance is improved.
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17: 369261 pkt/s
> proto 17: 369267 pkt/s
> proto 17: 369206 pkt/s
> proto 17: 369214 pkt/s
> proto 17: 369126 pkt/s
> proto 17: 369272 pkt/s
>

So approx 4.3% improvement.

> Signed-off-by: Wei Fang <[email protected]>
> Suggested-by: Alexander Lobakin <[email protected]>
> Suggested-by: Jesper Dangaard Brouer <[email protected]>
> ---
> V5 changes:
> New patch. Separated from the first patch, to keep track of the changes
> and improvements (suggested by Jesper).

Thanks

> ---
> drivers/net/ethernet/freescale/fec.h | 5 +-
> drivers/net/ethernet/freescale/fec_main.c | 134 ++++++++++++----------
> 2 files changed, 73 insertions(+), 66 deletions(-)
>
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 7bb02a9da2a6..a8fbcada6b01 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -552,10 +552,7 @@ enum fec_txbuf_type {
> };
>
> struct fec_tx_buffer {
> - union {
> - struct sk_buff *skb;
> - struct xdp_frame *xdp;
> - };
> + void *buf_p;

I want to hear Olek's (Alexander) oppinion on this void pointer circus.

The rest of the patch looks correct to me so:

Reviewed-by: Jesper Dangaard Brouer <[email protected]>

--Jesper

> enum fec_txbuf_type type;
> };
>
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 30b01985be7c..ae6e41ad71b8 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -399,7 +399,7 @@ static void fec_dump(struct net_device *ndev)
> fec16_to_cpu(bdp->cbd_sc),
> fec32_to_cpu(bdp->cbd_bufaddr),
> fec16_to_cpu(bdp->cbd_datlen),
> - txq->tx_buf[index].skb);
> + txq->tx_buf[index].buf_p);
> bdp = fec_enet_get_nextdesc(bdp, &txq->bd);
> index++;
> } while (bdp != txq->bd.base);
> @@ -656,7 +656,7 @@ static int fec_enet_txq_submit_skb(struct fec_enet_priv_tx_q *txq,
>
> index = fec_enet_get_bd_index(last_bdp, &txq->bd);
> /* Save skb pointer */
> - txq->tx_buf[index].skb = skb;
> + txq->tx_buf[index].buf_p = skb;
>
> /* Make sure the updates to rest of the descriptor are performed before
> * transferring ownership.
> @@ -862,7 +862,7 @@ static int fec_enet_txq_submit_tso(struct fec_enet_priv_tx_q *txq,
> }
>
> /* Save skb pointer */
> - txq->tx_buf[index].skb = skb;
> + txq->tx_buf[index].buf_p = skb;
>
> skb_tx_timestamp(skb);
> txq->bd.cur = bdp;
> @@ -959,27 +959,27 @@ static void fec_enet_bd_init(struct net_device *dev)
> fec32_to_cpu(bdp->cbd_bufaddr),
> fec16_to_cpu(bdp->cbd_datlen),
> DMA_TO_DEVICE);
> - if (txq->tx_buf[i].skb) {
> - dev_kfree_skb_any(txq->tx_buf[i].skb);
> - txq->tx_buf[i].skb = NULL;
> - }
> - } else {
> - if (bdp->cbd_bufaddr &&
> - txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
> + if (txq->tx_buf[i].buf_p)
> + dev_kfree_skb_any(txq->tx_buf[i].buf_p);
> + } else if (txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO) {
> + if (bdp->cbd_bufaddr)
> dma_unmap_single(&fep->pdev->dev,
> fec32_to_cpu(bdp->cbd_bufaddr),
> fec16_to_cpu(bdp->cbd_datlen),
> DMA_TO_DEVICE);
>
> - if (txq->tx_buf[i].xdp) {
> - xdp_return_frame(txq->tx_buf[i].xdp);
> - txq->tx_buf[i].xdp = NULL;
> - }
> + if (txq->tx_buf[i].buf_p)
> + xdp_return_frame(txq->tx_buf[i].buf_p);
> + } else {
> + struct page *page = txq->tx_buf[i].buf_p;
>
> - /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> - txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
> + if (page)
> + page_pool_put_page(page->pp, page, 0, false);
> }
>
> + txq->tx_buf[i].buf_p = NULL;
> + /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> + txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
> bdp->cbd_bufaddr = cpu_to_fec32(0);
> bdp = fec_enet_get_nextdesc(bdp, &txq->bd);
> }
> @@ -1386,6 +1386,8 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
> struct netdev_queue *nq;
> int index = 0;
> int entries_free;
> + struct page *page;
> + int frame_len;
>
> fep = netdev_priv(ndev);
>
> @@ -1407,8 +1409,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
> index = fec_enet_get_bd_index(bdp, &txq->bd);
>
> if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB) {
> - skb = txq->tx_buf[index].skb;
> - txq->tx_buf[index].skb = NULL;
> + skb = txq->tx_buf[index].buf_p;
> if (bdp->cbd_bufaddr &&
> !IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
> dma_unmap_single(&fep->pdev->dev,
> @@ -1427,18 +1428,24 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
> if (unlikely(!budget))
> break;
>
> - xdpf = txq->tx_buf[index].xdp;
> - if (bdp->cbd_bufaddr &&
> - txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
> - dma_unmap_single(&fep->pdev->dev,
> - fec32_to_cpu(bdp->cbd_bufaddr),
> - fec16_to_cpu(bdp->cbd_datlen),
> - DMA_TO_DEVICE);
> + if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
> + xdpf = txq->tx_buf[index].buf_p;
> + if (bdp->cbd_bufaddr)
> + dma_unmap_single(&fep->pdev->dev,
> + fec32_to_cpu(bdp->cbd_bufaddr),
> + fec16_to_cpu(bdp->cbd_datlen),
> + DMA_TO_DEVICE);
> + } else {
> + page = txq->tx_buf[index].buf_p;
> + }
> +
> bdp->cbd_bufaddr = cpu_to_fec32(0);
> - if (unlikely(!xdpf)) {
> + if (unlikely(!txq->tx_buf[index].buf_p)) {
> txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
> goto tx_buf_done;
> }
> +
> + frame_len = fec16_to_cpu(bdp->cbd_datlen);
> }
>
> /* Check for errors. */
> @@ -1462,7 +1469,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
> if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB)
> ndev->stats.tx_bytes += skb->len;
> else
> - ndev->stats.tx_bytes += xdpf->len;
> + ndev->stats.tx_bytes += frame_len;
> }
>
> /* Deferred means some collisions occurred during transmit,
> @@ -1487,20 +1494,16 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
>
> /* Free the sk buffer associated with this last transmit */
> dev_kfree_skb_any(skb);
> + } else if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
> + xdp_return_frame_rx_napi(xdpf);
> } else {
> - if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
> - xdp_return_frame_rx_napi(xdpf);
> - } else {
> - struct page *page = virt_to_head_page(xdpf->data);
> -
> - page_pool_put_page(page->pp, page, 0, true);
> - }
> -
> - txq->tx_buf[index].xdp = NULL;
> - /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> - txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
> + page_pool_put_page(page->pp, page, 0, true);
> }
>
> + txq->tx_buf[index].buf_p = NULL;
> + /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> + txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
> +
> tx_buf_done:
> /* Make sure the update to bdp and tx_buf are performed
> * before dirty_tx
> @@ -3230,7 +3233,6 @@ static void fec_enet_free_buffers(struct net_device *ndev)
> {
> struct fec_enet_private *fep = netdev_priv(ndev);
> unsigned int i;
> - struct sk_buff *skb;
> struct fec_enet_priv_tx_q *txq;
> struct fec_enet_priv_rx_q *rxq;
> unsigned int q;
> @@ -3255,18 +3257,23 @@ static void fec_enet_free_buffers(struct net_device *ndev)
> kfree(txq->tx_bounce[i]);
> txq->tx_bounce[i] = NULL;
>
> + if (!txq->tx_buf[i].buf_p) {
> + txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
> + continue;
> + }
> +
> if (txq->tx_buf[i].type == FEC_TXBUF_T_SKB) {
> - skb = txq->tx_buf[i].skb;
> - txq->tx_buf[i].skb = NULL;
> - dev_kfree_skb(skb);
> + dev_kfree_skb(txq->tx_buf[i].buf_p);
> + } else if (txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO) {
> + xdp_return_frame(txq->tx_buf[i].buf_p);
> } else {
> - if (txq->tx_buf[i].xdp) {
> - xdp_return_frame(txq->tx_buf[i].xdp);
> - txq->tx_buf[i].xdp = NULL;
> - }
> + struct page *page = txq->tx_buf[i].buf_p;
>
> - txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
> + page_pool_put_page(page->pp, page, 0, false);
> }
> +
> + txq->tx_buf[i].buf_p = NULL;
> + txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
> }
> }
> }
> @@ -3789,13 +3796,14 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)
>
> static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> struct fec_enet_priv_tx_q *txq,
> - struct xdp_frame *frame,
> - u32 dma_sync_len, bool ndo_xmit)
> + void *frame, u32 dma_sync_len,
> + bool ndo_xmit)
> {
> unsigned int index, status, estatus;
> struct bufdesc *bdp;
> dma_addr_t dma_addr;
> int entries_free;
> + u16 frame_len;
>
> entries_free = fec_enet_get_free_txdesc_num(txq);
> if (entries_free < MAX_SKB_FRAGS + 1) {
> @@ -3811,30 +3819,36 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> index = fec_enet_get_bd_index(bdp, &txq->bd);
>
> if (ndo_xmit) {
> - dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> - frame->len, DMA_TO_DEVICE);
> + struct xdp_frame *xdpf = frame;
> +
> + dma_addr = dma_map_single(&fep->pdev->dev, xdpf->data,
> + xdpf->len, DMA_TO_DEVICE);
> if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> return -ENOMEM;
>
> + frame_len = xdpf->len;
> + txq->tx_buf[index].buf_p = xdpf;
> txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> } else {
> - struct page *page = virt_to_page(frame->data);
> + struct xdp_buff *xdpb = frame;
> + struct page *page;
>
> - dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
> - frame->headroom;
> + page = virt_to_page(xdpb->data);
> + dma_addr = page_pool_get_dma_addr(page) +
> + (xdpb->data - xdpb->data_hard_start);
> dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> dma_sync_len, DMA_BIDIRECTIONAL);
> + frame_len = xdpb->data_end - xdpb->data;
> + txq->tx_buf[index].buf_p = page;
> txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> }
>
> - txq->tx_buf[index].xdp = frame;
> -
> status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
> if (fep->bufdesc_ex)
> estatus = BD_ENET_TX_INT;
>
> bdp->cbd_bufaddr = cpu_to_fec32(dma_addr);
> - bdp->cbd_datlen = cpu_to_fec16(frame->len);
> + bdp->cbd_datlen = cpu_to_fec16(frame_len);
>
> if (fep->bufdesc_ex) {
> struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> @@ -3875,14 +3889,10 @@ static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
> int cpu, struct xdp_buff *xdp,
> u32 dma_sync_len)
> {
> - struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
> struct fec_enet_priv_tx_q *txq;
> struct netdev_queue *nq;
> int queue, ret;
>
> - if (unlikely(!xdpf))
> - return -EFAULT;
> -
> queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> txq = fep->tx_queue[queue];
> nq = netdev_get_tx_queue(fep->netdev, queue);
> @@ -3891,7 +3901,7 @@ static int fec_enet_xdp_tx_xmit(struct fec_enet_private *fep,
>
> /* Avoid tx timeout as XDP shares the queue with kernel stack */
> txq_trans_cond_update(nq);
> - ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len, false);
> + ret = fec_enet_txq_xmit_frame(fep, txq, xdp, dma_sync_len, false);
>
> __netif_tx_unlock(nq);
>

2023-08-11 03:14:15

by Wei Fang

[permalink] [raw]
Subject: RE: [PATCH V5 net-next 1/2] net: fec: add XDP_TX feature support

> If you add below code comment you can add my ACK in V6:
>
Okay, I will add the annotation to the code in V6. Thanks.

> Acked-by: Jesper Dangaard Brouer <[email protected]>
>
> > @@ -1482,7 +1488,13 @@ fec_enet_tx_queue(struct net_device *ndev,
> u16 queue_id, int budget)
> > /* Free the sk buffer associated with this last transmit */
> > dev_kfree_skb_any(skb);
> > } else {
> > - xdp_return_frame(xdpf);
> > + if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
> > + xdp_return_frame_rx_napi(xdpf);
> > + } else {
> > + struct page *page = virt_to_head_page(xdpf->data);
> > +
>
> I think this usage of page_pool_put_page() with dma_sync_size=0 requires a
> comment, else we will forget why this okay...
> I suggest:
>
> /* PP dma_sync_size=0 as xmit already synced DMA for_device */
>
> > + page_pool_put_page(page->pp, page, 0, true);
> > + }
> >
> > txq->tx_buf[index].xdp = NULL;
> > /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> @@ -1541,7


2023-08-11 12:55:44

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH V5 net-next 1/2] net: fec: add XDP_TX feature support

On Thu, Aug 10, 2023 at 02:45:13PM +0800, Wei Fang wrote:
> The XDP_TX feature is not supported before, and all the frames
> which are deemed to do XDP_TX action actually do the XDP_DROP
> action. So this patch adds the XDP_TX support to FEC driver.
>
> I tested the performance of XDP_TX in XDP_DRV mode and XDP_SKB
> mode respectively on i.MX8MP-EVK platform, and as suggested by
> Jesper, I also tested the performance of XDP_REDIRECT on the
> same platform. And the test steps and results are as follows.
>
> XDP_TX test:
> Step 1: One board is used as generator and connects to switch,and
> the FEC port of DUT also connects to the switch. Both boards with
> flow control off. Then the generator runs the
> pktgen_sample03_burst_single_flow.sh script to generate and send
> burst traffic to DUT. Note that the size of packet was set to 64
> bytes and the procotol of packet was UDP in my test scenario. In
> addtion, the SMAC of the packet need to be different from the MAC

nit: addtion -> addition

checkpatch.pl --codespell is your friend here.

...

2023-08-11 15:43:43

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH V5 net-next 1/2] net: fec: add XDP_TX feature support



On 11/08/2023 03.26, Wei Fang wrote:
>> If you add below code comment you can add my ACK in V6:
>>
> Okay, I will add the annotation to the code in V6. Thanks.
>

Great, one adjustment to my suggested comment below.

>> Acked-by: Jesper Dangaard Brouer <[email protected]>
>>
>>> @@ -1482,7 +1488,13 @@ fec_enet_tx_queue(struct net_device *ndev,
>> u16 queue_id, int budget)
>>> /* Free the sk buffer associated with this last transmit */
>>> dev_kfree_skb_any(skb);
>>> } else {
>>> - xdp_return_frame(xdpf);
>>> + if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO) {
>>> + xdp_return_frame_rx_napi(xdpf);
>>> + } else {
>>> + struct page *page = virt_to_head_page(xdpf->data);
>>> +
>>
>> I think this usage of page_pool_put_page() with dma_sync_size=0 requires a
>> comment, else we will forget why this okay...
>> I suggest:
>>
>> /* PP dma_sync_size=0 as xmit already synced DMA for_device */
>>

I update my suggestion to:

/* PP dma_sync_size=0 as XDP_TX already synced DMA for_device */

Reading code path there is an simple "else" to reach this spot, and it
will be good to hint to code-reader that this code path deals with
XDP_TX completion handling.

You are of-cause free to come up with a better comment yourself.

>>> + page_pool_put_page(page->pp, page, 0, true);
>>> + }
>>>
>>> txq->tx_buf[index].xdp = NULL;
>>> /* restore default tx buffer type: FEC_TXBUF_T_SKB */
>> @@ -1541,7
>

--Jesper

2023-08-12 02:56:02

by Wei Fang

[permalink] [raw]
Subject: RE: [PATCH V5 net-next 1/2] net: fec: add XDP_TX feature support

> On Thu, Aug 10, 2023 at 02:45:13PM +0800, Wei Fang wrote:
> > The XDP_TX feature is not supported before, and all the frames which
> > are deemed to do XDP_TX action actually do the XDP_DROP action. So
> > this patch adds the XDP_TX support to FEC driver.
> >
> > I tested the performance of XDP_TX in XDP_DRV mode and XDP_SKB mode
> > respectively on i.MX8MP-EVK platform, and as suggested by Jesper, I
> > also tested the performance of XDP_REDIRECT on the same platform. And
> > the test steps and results are as follows.
> >
> > XDP_TX test:
> > Step 1: One board is used as generator and connects to switch,and the
> > FEC port of DUT also connects to the switch. Both boards with flow
> > control off. Then the generator runs the
> > pktgen_sample03_burst_single_flow.sh script to generate and send burst
> > traffic to DUT. Note that the size of packet was set to 64 bytes and
> > the procotol of packet was UDP in my test scenario. In addtion, the
> > SMAC of the packet need to be different from the MAC
>
> nit: addtion -> addition
>
> checkpatch.pl --codespell is your friend here.
>
Thanks for reminding.